MultiHaluDet: Multilingual Hallucination Detection via LLM Hidden State Probing

Summary

MULTIHALUDET is a four-stage framework designed to detect hallucinations in Large Language Models (LLMs) by probing their full hidden state trajectories without requiring language-specific fine-tuning. Addressing the limitations of existing methods that rely on surface-level heuristics or single-layer representations, this approach extracts sequential features across multiple transformer layers. It processes these features using a hybrid architecture featuring multi-scale attention and self-attention pooling to capture both fine-grained and coarse-grained patterns of factual inconsistency. The system generates out-of-fold embeddings that feed into a learned ensemble meta-learner, combining diverse classifiers for robust prediction. Extensive experiments on the HaluEval and TriviaQA benchmarks demonstrate that MULTIHALUDET achieves state-of-the-art performance, reaching up to 98.55% AUROC with Mistral-7B and LLaMA2-7B architectures. Crucially, the framework exhibits strong cross-lingual generalization. Evaluations across high-resource (French), medium-resource (Bangla), and low-resource (Amharic) languages show consistent superiority over baselines, maintaining high detection accuracy even in typologically diverse linguistic tiers. Ablation studies confirm that key components, particularly out-of-fold stacking and multi-scale attention, are essential for capturing the complex semantic shifts indicative of hallucinations. While the method requires white-box access to internal model states and incurs higher computational costs than simple heuristics, it offers a reliable solution for detecting deep factual inconsistencies across diverse languages and model architectures.

PDF viewer

Chunks(28)

Chunk 0 · 1,991 chars

MULTIHALUDET:
Multilingual Hallucination Detection via LLM Hidden State Probing
Riasad Alvi1, Nurul Labib Sayeedi1, Md. Faiyaz Abdullah Sayeedi1,2
1United International University, 2BRAC University
ralvi212069@bscse.uiu.ac.bd, nsayeedi2410045@bsds.uiu.ac.bd, msayeedi212049@bscse.uiu.ac.bd
 https://github.com/alvi-uiu/MultiHaluDet
Abstract
Hallucinations in Large Language Models
(LLMs) represent a critical barrier to their reli-
able deployment, a vulnerability heavily exacer-
bated in non-English and resource-constrained
contexts. Existing detection approaches that
rely on output confidence heuristics or single-
layer internal representations frequently fail
to capture deep, complex factual inconsisten-
cies across diverse languages. To address this,
we introduce MULTIHALUDET, a novel four-
stage framework that detects multilingual hal-
lucinations by probing the full hidden state
trajectories of frozen LLMs without requiring
language-specific fine-tuning. Our method ex-
tracts sequential features across multiple layers
and processes them via a hybrid architecture
using multi-scale attention and self-attention
pooling. By generating out-of-fold embeddings
that feed into a learned ensemble meta-learner,
MULTIHALUDET captures both fine-grained
and coarse-grained patterns of factual inconsis-
tency. Extensive experiments demonstrate that
our framework achieves state-of-the-art detec-
tion performance, reaching up to 98.55% AU-
ROC on the English HaluEval and TriviaQA
benchmarks using Mistral-7B and LLaMA2-
7B architectures. Crucially, we rigorously eval-
uate our framework’s cross-lingual generaliza-
tion across high (French), medium (Bangla),
and low-resource (Amharic) languages. MUL-
TIHALUDET demonstrates exceptional repre-
sentational robustness, consistently outperform-
ing baselines and successfully transferring hal-
lucination detection capabilities across typolog-
ically diverse linguistic tiers.
1 Introduction
Hallucinations in LLMs have emerged as a

Chunk 1 · 1,983 chars

d low-resource (Amharic) languages. MUL-
TIHALUDET demonstrates exceptional repre-
sentational robustness, consistently outperform-
ing baselines and successfully transferring hal-
lucination detection capabilities across typolog-
ically diverse linguistic tiers.
1 Introduction
Hallucinations in LLMs have emerged as a criti-
cal barrier to their reliable deployment across high-
stakes domains (Farquhar et al., 2024; Mishra et al.,
2024; Varshney et al., 2023). These generations
pose significant risks (Farquhar et al., 2024), mo-
tivating diverse detection techniques. Evidence-
based methods (Chern et al., 2023; Zhang et al.,
2024) retrieve external information to verify fac-
tual consistency, but are computationally intensive.
Evidence-free methods leverage inherent model
characteristics: logit-based and consistency-based
methods estimate uncertainty and output stability,
while classification-based methods (Orgad et al.,
2024; Binkowski et al., 2025) probe internal hidden
states without external retrieval.
While internal state probing shows promise (Or-
gad et al., 2024; Binkowski et al., 2025), current
approaches remain inadequate, particularly across
diverse languages. Recent work reveals truthful-
ness information is concentrated in specific to-
kens (Orgad et al., 2024); however, single-position
methods struggle when non-factual tokens are dis-
tributed across sequences. Furthermore, attention-
based approaches (Chuang et al., 2024; Binkowski
et al., 2025) using simple ratios achieve limited dis-
crimination. Probabilistic frameworks (Hou et al.,
2025) require complex reasoning pipelines, while
active validation (Varshney et al., 2023) and tool-
augmented systems (Chern et al., 2023; Zhang
et al., 2024) introduce substantial latency. As noted
in Farquhar et al. (2024), hallucinations manifest
as semantic-level confabulations rather than token-
level uncertainty. Existing methods largely fail to
address these confabulations across non-English
and

Chunk 2 · 1,994 chars

23) and tool-
augmented systems (Chern et al., 2023; Zhang
et al., 2024) introduce substantial latency. As noted
in Farquhar et al. (2024), hallucinations manifest
as semantic-level confabulations rather than token-
level uncertainty. Existing methods largely fail to
address these confabulations across non-English
and resource-constrained languages.
Transformer architectures process information
through deep layers, embedding language-agnostic
hallucination signals within hidden state trajecto-
ries. Deep sequence modeling with multi-scale fea-
ture aggregation offers a promising solution. We in-
troduce MULTIHALUDET, a supervised framework
leveraging multi-scale attention and transformer
encoders to model hidden state dynamics across
the full depth of frozen LLMs. Unlike methods
focused on individual tokens or static layers, our
approach aggregates information across multiple
scales through self-attention pooling. We evaluate
1
arXiv:2605.24919v1 [cs.CL] 24 May 2026

-- 1 of 12 --

across multiple architectures and language tiers,
demonstrating strong cross-lingual performance.
Our contributions are summarized as follows:
• We introduce MULTIHALUDET, a four-stage
framework comprising dynamic feature ex-
traction, multi-scale attention encoding, out-
of-fold deep feature generation, and learned
ensemble meta-learning, which jointly cap-
ture fine-grained and coarse-grained patterns
from hidden-state trajectories.
• We evaluate our framework on the HaluE-
val and TriviaQA benchmarks using Llama-
2-7B and Mistral-7B-Instruct, demonstrating
substantial performance gains and representa-
tional robustness over existing baselines.
• We comprehensively evaluate the cross-
lingual generalization of our framework by
translating standard benchmarks into French
(high-resource), Bangla (medium-resource),
and Amharic (low-resource). Our results
demonstrate that internal state probing can
maintain strong detection signals across ty-
pologically diverse linguistic tiers under

Chunk 3 · 1,996 chars

aluate the cross-
lingual generalization of our framework by
translating standard benchmarks into French
(high-resource), Bangla (medium-resource),
and Amharic (low-resource). Our results
demonstrate that internal state probing can
maintain strong detection signals across ty-
pologically diverse linguistic tiers under con-
trolled translation-based evaluation, without
language-specific fine-tuning.
2 Related Work
Hallucination detection approaches fall into three
categories: (i) evidence-based methods verify-
ing outputs against external knowledge, (ii) self-
detection methods leveraging internal model states,
and (iii) consistency-based approaches assessing
output stability.
Evidence-based methods retrieve external infor-
mation for verification. Kale and Alfeo (2025)
converts LLM responses into knowledge graphs
for atomic fact verification. Vangala et al. (2025)
introduced multi-source retrieval with contradic-
tion graph analysis. Zhang and Wang (2025) em-
ployed HHEM for lightweight consistency assess-
ment. While effective, these methods depend on
retrieval quality, making them vulnerable to gaps
and latency.
Self-detection methods probe LLM internal rep-
resentations without external resources. Su et al.
(2024) introduced MIND, training classifiers on
auto-generated pseudo-labels. Liang and Wang
(2025) employed Bayesian optimization to identify
optimal layer insertion points. Zhang et al. (2025)
proposed MHAD, selecting specific neurons via
linear probing. Kossen et al. (2024) approximated
semantic uncertainty from hidden states. Chen
et al. (2024) proposed INSIDE, measuring consis-
tency through eigenvalues of response covariances.
Kim et al. (2025) analyzed layer-wise usable infor-
mation across transformer depths. Quevedo et al.
(2024) demonstrated strong detection using only
four token probability features. Consistency-based
methods assess output stability. Yang et al. (2025a)
proposed MetaQA, leveraging metamorphic rela-
tions for semantic consistency

Chunk 4 · 1,996 chars

(2025) analyzed layer-wise usable infor-
mation across transformer depths. Quevedo et al.
(2024) demonstrated strong detection using only
four token probability features. Consistency-based
methods assess output stability. Yang et al. (2025a)
proposed MetaQA, leveraging metamorphic rela-
tions for semantic consistency verification.
Despite these advances, challenges remain.
Evidence-based methods suffer from retrieval de-
pendency. Self-detection methods relying on
single-position representations struggle when non-
factual tokens appear at sequence beginnings.
Consistency-based methods incur costs from multi-
ple generation passes. Simhi et al. (2025) demon-
strated that LLMs can hallucinate with high cer-
tainty despite possessing correct knowledge. Yang
et al. (2025b) revealed the need to distinguish be-
tween intelligent and defective hallucinations.
Our work aligns with internal state probing meth-
ods but addresses their key limitations. While
Liang and Wang (2025) and Zhang et al. (2025)
focus on specific layers or tokens, we employ dy-
namic layer sampling and multi-scale attention to
aggregate information across the full depth trajec-
tory. Unlike Kossen et al. (2024) and Chen et al.
(2024), which probe specific positions, our archi-
tecture processes sequential layer features with
transformer encoders and self-attention pooling,
enabling adaptive depth selection. Our out-of-fold
stacking with ensemble meta-learning provides
more robust generalization than single-classifier ap-
proaches (Su et al., 2024; Liang and Wang, 2025).
3 Methodology
We present MULTIHALUDET (Figure 1), a four-
stage framework for hallucination detection.
3.1 LLM-Based Feature Extraction
3.1.1 Prompt Construction and Forward Pass
Let D = {(qi, ai, yi)}N
i=1 denote a dataset of
question–answer pairs, where qi is a natural-
language question, ai is a candidate answer, and
yi ∈ {0, 1} is a binary label indicating whether
ai is a hallucination (yi = 1) or a faithful re-
sponse (yi = 0).

Chunk 5 · 1,990 chars

Feature Extraction
3.1.1 Prompt Construction and Forward Pass
Let D = {(qi, ai, yi)}N
i=1 denote a dataset of
question–answer pairs, where qi is a natural-
language question, ai is a candidate answer, and
yi ∈ {0, 1} is a binary label indicating whether
ai is a hallucination (yi = 1) or a faithful re-
sponse (yi = 0). Each sample (qi, ai) is formatted
as a structured natural-language prompt and tok-
enized with a fixed maximum sequence length. The
2

-- 2 of 12 --

Figure 1: Overview of the four-stage MULTIHALUDET framework for multilingual hallucination detection.
prompt is passed through a frozen, quantized LLM
in a single forward pass, yielding a sequence of hid-
den state tensors {H(l)}L
l=0, where L is the number
of transformer layers and H(l) ∈ RT ×d collects
the d-dimensional representations of T tokens at
layer l. The next-token logit vector z ∈ RV at the
final position is retained for global statistics. No
gradient is computed during feature extraction; the
LLM parameters remain fully frozen throughout
all experiments.
3.1.2 Dynamic Layer Sampling
To ensure architectural compatibility across LLMs
of varying depth, we introduce a dynamic layer
sampling strategy that maps any model’s L trans-
former layers to a fixed target count K, producing
a uniform sequential representation regardless of
model size. Three regimes are handled without
model-specific configuration:
• Exact match (L = K): identity mapping IK =
{1, . . . , K}.
• Shallow model (L < K): all L layers are used
and the deepest layer is repeated to pad the se-
quence to K.
• Deep model (L > K): K indices are selected
by uniform interpolation over the full depth:
IK =

clip

1 + (L − 1) · k
K − 1

, 1, L
K−1
k=0
(1)
where ⌊·⌉ denotes rounding to the nearest inte-
ger and clip(·, 1, L) clamps indices to valid range.
This formulation is differentiable with respect to K
and requires no architecture-specific configuration,
making it directly applicable to any transformer-
based LLM.
3.1.3

Chunk 6 · 1,998 chars

1 + (L − 1) · k
K − 1

, 1, L
K−1
k=0
(1)
where ⌊·⌉ denotes rounding to the nearest inte-
ger and clip(·, 1, L) clamps indices to valid range.
This formulation is differentiable with respect to K
and requires no architecture-specific configuration,
making it directly applicable to any transformer-
based LLM.
3.1.3 Sequential Per-Layer Features
For each sampled layer index l ∈ IK , we extract
a compact descriptor from two complementary
views of the hidden state tensor: the last-token
representation h(l) = H(l)
T,:, which reflects the
model’s final contextual state, and the sequence
mean ¯h(l) = 1
T
PT
t=1 H(l)
t,: , which captures the ag-
gregate token-level context. From these, we com-
pute a per-layer descriptor s(l) ∈ Rds comprising
distributional statistics: the ℓ2 norm, mean, stan-
dard deviation, extremal values, activation sparsity,
near-zero mass, and the kurtosis and median abso-
lute deviation (MAD) of the hidden state, capturing
the peakedness and robust dispersion of the activa-
tion distribution:
κ(h) = 1
d
d	X
j=1
 hj − ¯h
ˆσ
4
,
MAD(h) = medianj (|hj − median(h)|)
(2)
where ¯h and ˆσ denote the empirical mean and
standard deviation of the hidden-state vector, and
median(h) is its coordinate-wise median. Collect-
ing descriptors across all sampled layers yields the
sequential representation S ∈ RK×ds , which en-
codes how the LLM’s internal dynamics evolve as
a function of depth.
3.1.4 Anchor-Based Depth Probing
To enable consistent cross-architecture compar-
isons at semantically meaningful network depths,
we define four anchor layers corresponding to pro-
3

-- 3 of 12 --

portional depth fractions {αj }4
j=1. For each anchor,
the closest available sampled layer is identified:
l∗
j = arg min
l ∈ IK
| l − ⌊αj · L⌉ | (3)
and the corresponding descriptor s(l∗
j ) is retained
as an anchor feature. This construction ensures that
features at early, middle, and late network stages
are explicitly represented in the global feature vec-
tor regardless of

Chunk 7 · 1,988 chars

closest available sampled layer is identified:
l∗
j = arg min
l ∈ IK
| l − ⌊αj · L⌉ | (3)
and the corresponding descriptor s(l∗
j ) is retained
as an anchor feature. This construction ensures that
features at early, middle, and late network stages
are explicitly represented in the global feature vec-
tor regardless of the total model depth, providing a
principled substitute for hardcoded layer indices.
3.1.5 Global Feature Vector
A global feature vector g ∈ Rdg aggregates in-
formation across the full forward pass. It com-
prises: (i) the top-k next-token probabilities and
their pairwise differences, capturing the sharpness
of the model’s next-token distribution; (ii) the en-
tropy, standard deviation, and maximum of the
logit vector z; (iii) first- and second-order statis-
tics of the layer-wise ℓ2-norm trajectory {rl}l∈IK ,
rl = ∥h(l)∥2, capturing norm growth and volatil-
ity across depth; (iv) the anchor descriptors from
Eq. 3; and (v) cross-feature interaction terms be-
tween logit statistics and norm dynamics, which
encode the coupling between the model’s output
confidence and its internal representational geom-
etry. Together, S and g form the complete feature
representation passed to classification stage.
3.2 MultiHaluDet Architecture
3.2.1 Projection and Multi-Scale Attention
The sequential input S is first projected into a uni-
form hidden space of dimension H via a linear
layer followed by layer normalization and a GELU
activation. A multi-scale attention module then pro-
cesses the projected sequence at multiple temporal
resolutions simultaneously. For each scale factor
c ∈ C, the sequence is locally average-pooled to
compress K positions into ⌈K/c⌉ positions, passed
through a scale-specific linear projection, and up-
sampled back to length K via nearest-neighbor
interpolation. The contributions of each scale are
combined through a learned position-wise gate:
˜ht = X
c∈C
wt,c Pc
UpsamplePoolc
Hproj

t

(4)
where wt,c = softmax(Wght)c is the

Chunk 8 · 1,991 chars

ositions into ⌈K/c⌉ positions, passed
through a scale-specific linear projection, and up-
sampled back to length K via nearest-neighbor
interpolation. The contributions of each scale are
combined through a learned position-wise gate:
˜ht = X
c∈C
wt,c Pc
UpsamplePoolc
Hproj

t

(4)
where wt,c = softmax(Wght)c is the position-
wise scale gate computed from the original pro-
jected hidden state, and Pc is the per-scale linear
projection. The output is combined with the origi-
nal projection via a residual connection and layer
normalization, preserving fine-grained layer infor-
mation while enriching it with multi-scale context.
3.2.2 Layer-Weighted Transformer Encoder
Because different LLM layers carry information
of varying discriminative value for hallucination
detection, we modulate the fused sequence by a
learnable, softmax-normalized importance vector
λ ∈ RK before encoding:
ˆhk = ¯λk · ˜hk + pk, k = 1, . . . , K,
¯λ = softmax(λ) (5)
where pk is a learned positional embedding that
encodes the relative depth of each sampled layer
within the representational sequence. The modu-
lated sequence is encoded by a stack of Pre-LN
Transformer encoder layers with multi-head self-
attention and GELU feed-forward sublayers, en-
abling the model to attend across LLM depths and
capture long-range inter-layer dependencies in the
hidden state trajectory.
3.2.3 Self-Attention Pooling
Rather than discarding positional information
through mean pooling or relying on a fixed sum-
mary token, we aggregate the transformer output
via a learned attention pooling mechanism. A two-
layer MLP with a tanh nonlinearity assigns a scalar
relevance score to each position, and the final se-
quential representation is their weighted sum:
u =
K	X
k=1
αkek, αk = exp(a(ek))
PK
j=1 exp(a(ej )) (6)
where ek is the encoder output at position k and
a : RH → R is the learned scoring MLP. This
allows the model to focus on the LLM layers most
informative for the current input, providing a form
of

Chunk 9 · 1,997 chars

and the final se-
quential representation is their weighted sum:
u =
K	X
k=1
αkek, αk = exp(a(ek))
PK
j=1 exp(a(ej )) (6)
where ek is the encoder output at position k and
a : RH → R is the learned scoring MLP. This
allows the model to focus on the LLM layers most
informative for the current input, providing a form
of input-adaptive depth selection.
3.2.4 Global Branch and Gated Fusion
The global feature vector g is processed inde-
pendently through a two-layer MLP with layer
normalization and dropout, yielding a compact
global representation v ∈ RH/2. The sequen-
tial representation u ∈ RH and global represen-
tation are concatenated to form the joint embed-
ding c = [u; v] ∈ R3H/2. A sigmoid gate then
4

-- 4 of 12 --

performs element-wise re-weighting to suppress
uninformative dimensions:
˜c = c ⊙ σWgate c + bgate
 (7)
The gated representation ˜c is passed through a
three-layer MLP classifier to produce the final hal-
lucination logit. A separate two-layer projection
head maps ˜c onto a unit hypersphere for the con-
trastive objective described in Section 4.
3.3 Out-of-Fold Stacking
To obtain unbiased deep representations for the
meta-learner without data leakage, we employ K-
fold out-of-fold (OOF) stacking over the training
set. For fold k, a fresh MULTIHALUDET instance
is trained on the remaining K − 1 folds and used
to extract the gated fusion representation ˜c for the
held-out fold. After all K folds, the OOF features
form a complete training matrix:
XOOF ∈ RNtrain×dc , dc = dim(˜c) (8)
with every training sample represented exactly once
without ever being used in its own fold’s training.
Test-set representations are obtained by averaging
the gated fusion embeddings across all K fold mod-
els in feature space:
˜ctest = 1
K
K	X
k=1
˜c(k)test (9)
This averaging operates at the representation level
rather than the probability level, providing implicit
test-time ensembling and producing a more stable
input to the meta-learner. The OOF and test fea-
tures are

Chunk 10 · 1,994 chars

e gated fusion embeddings across all K fold mod-
els in feature space:
˜ctest = 1
K
K	X
k=1
˜c(k)test (9)
This averaging operates at the representation level
rather than the probability level, providing implicit
test-time ensembling and producing a more stable
input to the meta-learner. The OOF and test fea-
tures are standardized before being passed to the
ensemble stage.
3.4 Ensemble Meta-Learner
The OOF deep features serve as input to a di-
verse ensemble of classifiers comprising gradient-
boosted trees, random forests, a kernel support
vector machine, a multi-layer perceptron, and a
logistic regression baseline. Model diversity across
both inductive biases (linear, kernel, tree, neural)
and hyperparameter profiles reduces variance and
improves robustness to the distributional proper-
ties of the deep features. Rather than fixed heuris-
tic weights, the ensemble prediction is formed
by stacking the base classifier probability esti-
mates through a logistic meta-regressor trained
on held-out out-of-fold predictions. Denoting the
log-odds transform of each base probability by
ℓm = log(ˆpm/(1 − ˆpm)), the final ensemble prob-
ability is:
ˆpens = σ β0 +
M	X
m=1
βm ℓm
!
(10)
where σ(·) is the sigmoid function and the coeffi-
cients {βm}M
m=0 are learned by an ℓ2-regularized
logistic regression meta-learner trained on a sec-
ondary hold-out split of the out-of-fold probability
outputs. This formulation adapts the ensemble com-
position to the empirical discriminative strength
of each base learner while producing inherently
calibrated probability estimates. All component
models are configured with balanced class weights
to maintain sensitivity under any residual distri-
butional skew. The classification threshold is set
by Youden’s J statistic τ ∗ = arg maxτ [TPR(τ ) −
FPR(τ )], which maximizes the simultaneous im-
provement in sensitivity and specificity.
3.5 Multilingual Adaptation Strategy
To evaluate the multilingual generalization of
our hidden state probing

Chunk 11 · 1,997 chars

any residual distri-
butional skew. The classification threshold is set
by Youden’s J statistic τ ∗ = arg maxτ [TPR(τ ) −
FPR(τ )], which maximizes the simultaneous im-
provement in sensitivity and specificity.
3.5 Multilingual Adaptation Strategy
To evaluate the multilingual generalization of
our hidden state probing framework across vary-
ing resource tiers, we translated the source En-
glish datasets into French (high-resource), Bangla
(medium-resource), and Amharic (low-resource)
utilizing the Gemini 2.5 Flash model. Formally,
let the original source dataset be defined as
Den = {(qi, ai, yi)}N
i=1. We introduce a transla-
tion mapping Tℓ for the set of target languages
L = {fr, bn, am}. For every query-response pair,
this operation strictly preserves the original bi-
nary hallucination label, yielding a new language-
specific instance defined as: (q(ℓ)
i , a(ℓ)
i , y(ℓ)
i ) =
(Tℓ(qi), Tℓ(ai), yi). The multilingual evaluation
space is then constructed as the union of the origi-
nal English data and all language-specific subsets.
To guarantee the semantic integrity and con-
text preservation of the translated data, we con-
ducted human evaluation for quality assurance. We
sampled 100 instances per language from each
dataset, resulting in 300 evaluated samples per
dataset and a total of 600 evaluated samples across
both datasets. As three of the authors are native
Bangla speakers, the Bangla subset was evaluated
directly. For French and Amharic, we employed a
back-translation methodology, translating the tar-
get samples back into English, to verify semantic
equivalence against the original source. This man-
5

-- 5 of 12 --

ual evaluation confirmed an initial translation ac-
curacy of 96%. The remaining 4% of instances,
which exhibited minor semantic drift or structural
errors, were explicitly polished and regenerated
using Gemini 2.5 Flash to ensure strict alignment
with the source truth conditions.
By passing these high-fidelity translated inputs
through the

Chunk 12 · 1,992 chars

d an initial translation ac-
curacy of 96%. The remaining 4% of instances,
which exhibited minor semantic drift or structural
errors, were explicitly polished and regenerated
using Gemini 2.5 Flash to ensure strict alignment
with the source truth conditions.
By passing these high-fidelity translated inputs
through the frozen LLM, we generate the sequen-
tial representation S and global feature vector g
natively within each target language’s token space.
This formulation ensures that the subsequent deep
architecture evaluates intrinsic, language-agnostic
hallucination signals embedded within the model’s
internal dynamics, allowing us to probe representa-
tion alignment without requiring language-specific
fine-tuning.
4 Experimental Setup
4.1 Dataset
We evaluate on two question-answering bench-
marks. Let Q denote the set of questions and
A the set of candidate answers. Each dataset
D = {(qi, ai, yi)}N
i=1 consists of N samples where
qi ∈ Q is a question, ai ∈ A is an answer,
and yi ∈ {0, 1} is a binary label indicating non-
hallucination (yi = 0) or hallucination (yi = 1).
HaluEval (Li et al., 2023) provides N = 10, 000
human-annotated QA pairs with native hallucina-
tion labels derived from real LLM outputs, where
each (qi, ai, yi) is explicitly labeled based on fac-
tual consistency with verified knowledge sources.
TriviaQA (Joshi et al., 2017) is adapted for our
evaluation by collecting realistic, model-generated
hallucinations. Let DTrivia = {(qi, a∗
i )}M
i=1 denote
the original dataset, where qi is a question and a∗
i is
the ground-truth correct answer. To generate plausi-
ble hard negatives, we prompt an early-generation
language model known for its propensity to hallu-
cinate, Gemma-2-2B (∼ 32%-38% hallucination
rate), to answer each qi. Responses that are plau-
sible but factually incorrect are extracted and de-
noted as a−
i . We then construct our final evaluation
dataset by pairing each question with its true cor-
rect answer and its corresponding

Chunk 13 · 1,998 chars

wn for its propensity to hallu-
cinate, Gemma-2-2B (∼ 32%-38% hallucination
rate), to answer each qi. Responses that are plau-
sible but factually incorrect are extracted and de-
noted as a−
i . We then construct our final evaluation
dataset by pairing each question with its true cor-
rect answer and its corresponding model-generated
hallucination. To facilitate our multilingual eval-
uation, we also expanded these source English
datasets into three different target languages.
4.2 Baselines
We compare against four types of hallucination de-
tection methods. Prompt-based methods include
P(True) (Kadavath et al., 2022), which utilizes a
simple prompt template to enable the model to
assess the correctness of its own response. Logit-
based methods use the uncertainty of LLM out-
puts to detect hallucination; we adopt AvgProb
and AvgEnt from Huang et al. (2023) to aggregate
logit-based uncertainty across all tokens, and also
compare with EUBHD (Su et al., 2024), which fo-
cuses on key tokens rather than considering all
tokens. Consistency-based methods are moti-
vated by the idea that consistent responses indicate
factual knowledge; we apply Unigram and NLI
variants, as well as INSIDE (Chen et al., 2024),
which leverages eigenvalues of the covariance ma-
trix of responses. Classification-based methods
train a classifier on labeled statements; we com-
pare with SAPLMA (Azaria and Mitchell, 2023),
which trains a classifier on the last token’s last-layer
hidden state; MIND (Su et al., 2024), which uses
unsupervised training on auto-generated pseudo-
labels; and Probe@Exact, which relies on infor-
mation from potentially correct tokens. We also
include HD-NDEs (Li et al., 2025) (Neural ODEs,
CDEs, and SDEs), which model hidden state tra-
jectories using neural differential equations.
4.3 Hyperparameters
We extract hidden states from frozen LLMs in half-
precision without gradient computation. For all
LLM forward passes and synthetic generation tasks,
the temperature is set

Chunk 14 · 1,989 chars

(Li et al., 2025) (Neural ODEs,
CDEs, and SDEs), which model hidden state tra-
jectories using neural differential equations.
4.3 Hyperparameters
We extract hidden states from frozen LLMs in half-
precision without gradient computation. For all
LLM forward passes and synthetic generation tasks,
the temperature is set to 0.0 to ensure deterministic
outputs. Dynamic layer sampling maps variable-
depth models to K = 32 uniform indices. The
deep architecture uses hidden dimension d = 384,
8 attention heads, and 6 transformer encoder layers.
Training employs AdamW optimizer with learning
rate 2 × 10−4, weight decay 6 × 10−5, and Re-
duceLROnPlateau scheduling over 45 epochs with
early stopping patience of 15 epochs. We apply a
composite loss combining BCE, focal, asymmetric,
and contrastive objectives with label smoothing,
along with data augmentation via Mixup and Cut-
Mix. We employ 5-fold stratified cross-validation
with out-of-fold feature extraction to prevent leak-
age. The stacked ensemble combines 6 classi-
fiers (RandomForest, XGBoost, GradientBoosting,
LightGBM, LogisticRegression, Support Vector
Machine). Their probability outputs are fused by a
logistic meta-regressor trained on held-out out-of-
fold predictions, producing inherently calibrated
ensemble probabilities. All experiments are con-
ducted on AMD Ryzen 5 8500G CPU with 32GB
6

-- 6 of 12 --

Method HaluEval TriviaQA
Mistral-7B LLaMA2-7B Mistral-7B LLaMA2-7B
P(True) 49.7 46.7 48.3 42.3
AvgProb 43.6 42.1 48.5 44.1
AvgEnt 49.7 47.3 47.6 41.1
EUBHD 70.5 71.9 80.6 80.5
Unigram 62.3 58.2 59.5 56.8
NLI 63.1 61.3 61.4 59.4
INSIDE 76.0 74.5 81.3 81.7
SAPLMA 89.4 87.0 84.1 80.0
MIND 94.5 86.1 84.5 79.4
Probe@Exact 93.4 88.3 84.1 81.3
Neural ODEs 91.2 89.5 83.7 81.7
Neural CDEs 95.4 91.4 84.1 83.7
Neural SDEs 93.7 92.8 85.1 81.0
MultiHaluDet 98.43 98.55 98.30 98.26
Table 1: Hallucination detection performance (AUROC
%) on HaluEval and TriviaQA using Mistral-7B-Instruct
and LLaMA2-7B. Best results in

Chunk 15 · 1,996 chars

.5 86.1 84.5 79.4
Probe@Exact 93.4 88.3 84.1 81.3
Neural ODEs 91.2 89.5 83.7 81.7
Neural CDEs 95.4 91.4 84.1 83.7
Neural SDEs 93.7 92.8 85.1 81.0
MultiHaluDet 98.43 98.55 98.30 98.26
Table 1: Hallucination detection performance (AUROC
%) on HaluEval and TriviaQA using Mistral-7B-Instruct
and LLaMA2-7B. Best results in bold.
RAM and NVIDIA GeForce RTX 5060 Ti GPU
with 16GB VRAM.
4.4 Evaluation Metric
We utilize AUROC (%), which stands for the area
under the ROC curve, to objectively evaluate the
effectiveness of models. The higher the value of
AUROC, the stronger the ability of this method for
hallucination detection. All experiments employ
5-fold cross-validation with stratified sampling to
ensure reliable estimates.
5 Results and Analysis
5.1 RQ1: To what extent does MultiHaluDet
improve hallucination detection compared
to existing internal-state and confidence
based baselines?
Table 1 presents the detection performance (AU-
ROC %) of MULTIHALUDET compared against
thirteen baseline methods, evaluated across both
the HaluEval and TriviaQA datasets using Mistral-
7B and LLaMA2-7B.
Failure of Surface-Level Heuristics. The re-
sults reveal a clear hierarchy in the efficacy of dif-
ferent detection paradigms. Token-level probability
and entropy metrics (P(True), AvgProb, AvgEnt)
systematically fail to provide meaningful detection
signals, hovering around or below random chance
(41.1% – 49.7% AUROC). This confirms that raw
output confidence is severely miscalibrated and
insufficient for identifying deep factual inconsis-
tencies. Intermediate methods incorporating struc-
tural linguistic features (Unigram, NLI) offer only
marginal improvements, generally plateauing in the
low 60% range.
Representational Baselines. Approaches that
leverage deeper internal representations show sig-
nificantly more promise. Methods like SAPLMA,
MIND, and traditional probing (Probe@Exact)
push performance into the 80%–94% range. The
strongest baselines are those modeling the con-
tinuous

Chunk 16 · 1,988 chars

plateauing in the
low 60% range.
Representational Baselines. Approaches that
leverage deeper internal representations show sig-
nificantly more promise. Methods like SAPLMA,
MIND, and traditional probing (Probe@Exact)
push performance into the 80%–94% range. The
strongest baselines are those modeling the con-
tinuous dynamics of hidden states (Neural ODEs,
CDEs, SDEs). Neural CDEs, in particular, achieve
a competitive 95.4% on Mistral-7B for HaluEval.
However, these methods exhibit high variance be-
tween model architectures; for instance, MIND’s
performance drops sharply from 94.5% on Mistral-
7B to 86.1% on LLaMA2-7B.
Ours. MULTIHALUDET achieves state-of-the-
art performance across all experimental condi-
tions, substantially outperforming the strongest
continuous-time baselines. On HaluEval, we
achieve 98.43% AUROC with Mistral-7B and
98.55% with LLaMA2-7B. Crucially, MUL-
TIHALUDET demonstrates exceptional cross-
architecture robustness. While nearly all baselines
suffer noticeable performance degradation when
transitioning from Mistral to LLaMA2, our frame-
work maintains near-identical, near-perfect efficacy
(actually scoring slightly higher on LLaMA2 in
HaluEval). Similarly, on the TriviaQA dataset,
which features plausible but factually incorrect
synthetic hard negatives, we achieve 98.30% and
98.26% respectively. These consistent gains across
diverse datasets and architectures demonstrate that
our multi-scale attention mechanism and out-of-
fold stacking approach successfully aggregate ro-
bust, architecture-agnostic signals from LLM hid-
den state trajectories.
5.2 RQ2: How robust is MultiHaluDet across
typologically diverse languages with
varying resource availability?
To demonstrate the cross-lingual generalization of
MULTIHALUDET, we extend our evaluation be-
yond English to include three typologically diverse
languages: French, Bangla, and Amharic.
Table 2 shows that hallucination detection per-
formance correlates with the representational

Chunk 17 · 1,996 chars

ges with
varying resource availability?
To demonstrate the cross-lingual generalization of
MULTIHALUDET, we extend our evaluation be-
yond English to include three typologically diverse
languages: French, Bangla, and Amharic.
Table 2 shows that hallucination detection per-
formance correlates with the representational fre-
quency of these languages in the base models.
MULTIHALUDET maintains exceptionally strong
performance on French, achieving 96.2% and
95.8% on HaluEval for Mistral-7B and LLaMA2-
7B respectively, trailing the English baselines
(98.4% and 98.5%) by only a marginal fraction.
Similar high retention is observed on TriviaQA,
where French scores reach 95.5% and 94.9%. For
Bangla, we observe a more noticeable but con-
7

-- 7 of 12 --

Language Method HaluEval ( M⃝ / L⃝) TriviaQA ( M⃝ / L⃝)
English (Base) Best Baseline 95.4 / 92.8 85.1 / 83.7
MultiHaluDet 98.4 / 98.5 98.3 / 98.2
French (High) Best Baseline 92.1 / 89.4 81.3 / 80.5
MultiHaluDet 96.2 / 95.8 95.5 / 94.9
Bangla (Medium) Best Baseline 78.4 / 75.1 69.2 / 67.8
MultiHaluDet 89.1 / 88.4 87.6 / 86.3
Amharic (Low) Best Baseline 62.3 / 59.8 54.1 / 52.6
MultiHaluDet 78.5 / 76.2 75.8 / 73.4
M⃝: Mistral-7B L⃝: LLaMA2-7B
Table 2: Cross-lingual hallucination detection perfor-
mance. MultiHaluDet consistently outperforms the best
baseline across all resource settings.
trolled degradation, yielding HaluEval scores of
89.1% (Mistral-7B) and 88.4% (LLaMA2-7B),
alongside 87.6% and 86.3% on TriviaQA. While
the framework successfully adapts to the language,
accurately detecting factual inconsistencies re-
quires the model to effectively navigate Bangla’s
rich morphology and the complex linguistic adap-
tations necessary for regional dialect variations.
Amharic, a severely low-resource language
with limited representation in both Mistral and
LLaMA2, presents the greatest challenge. The
base models inherently struggle with factual recall
in Amharic, which limits the quality of the inter-
nal representations our

Chunk 18 · 1,991 chars

adap-
tations necessary for regional dialect variations.
Amharic, a severely low-resource language
with limited representation in both Mistral and
LLaMA2, presents the greatest challenge. The
base models inherently struggle with factual recall
in Amharic, which limits the quality of the inter-
nal representations our detector relies on. As a
result, absolute performance understandably drops
to 78.5% and 76.2% on HaluEval, and 75.8% and
73.4% on TriviaQA for Mistral-7B and LLaMA2-
7B, respectively. However, MULTIHALUDET still
manages to extract meaningful detection signals,
maintaining a robust lead over random chance and
standard confidence baselines.
5.3 Ablation Study
To understand the necessity of our proposed archi-
tectural choices, we conduct an ablation study by
systematically removing key components of the
MULTIHALUDET framework. We evaluate the de-
graded models on both HaluEval and TriviaQA
using Mistral-7B and LLaMA2-7B. The results are
summarized in Table 3.
The removal of the Out-of-Fold (OOF) Stack-
ing mechanism causes the most severe degrada-
tion across all configurations, resulting in a precip-
itous drop of nearly 10 percentage points (falling
to 88.67% on Mistral-7B HaluEval and 87.41%
on TriviaQA). Without OOF stacking, the meta-
classifier heavily overfits to the localized noise of
early hidden layers, completely failing to general-
ize when faced with the plausible hard negatives in
Method Mistral-7B LLaMA2-7B
HaluEval TriviaQA HaluEval TriviaQA
Full 98.43 98.30 98.55 98.26
w/o MSA 91.45 (↓6.98) 90.82 (↓7.48) 92.14 (↓6.41) 91.33 (↓6.93)
w/o OOF 88.67 (↓9.76) 87.41 (↓10.89) 89.25 (↓9.30) 88.19 (↓10.07)
w/o TP 93.28 (↓5.15) 92.56 (↓5.74) 93.71 (↓4.84) 93.04 (↓5.22)
Table 3: Ablation study of MULTIHALUDET. Perfor-
mance is measured in AUROC (%). The red text in-
dicates the absolute performance drop when a specific
core component is removed, confirming the structural
necessity of the complete framework. Abbreviations –
w/o: without, MSA:

Chunk 19 · 1,995 chars

92.56 (↓5.74) 93.71 (↓4.84) 93.04 (↓5.22)
Table 3: Ablation study of MULTIHALUDET. Perfor-
mance is measured in AUROC (%). The red text in-
dicates the absolute performance drop when a specific
core component is removed, confirming the structural
necessity of the complete framework. Abbreviations –
w/o: without, MSA: Multi-Scale Attention, OOF: Out-
of-Fold Stacking, TP: Trajectory Probing.
TriviaQA. This confirms that our stacking approach
is non-negotiable for robust feature aggregation.
Bypassing the Multi-Scale Attention module and
relying on standard global average pooling results
in a substantial 6–8% drop in AUROC. Hallucina-
tion signals are not uniformly distributed across
the generation trajectory; they often manifest as
sudden, localized semantic shifts in the middle lay-
ers. The sharp decline in performance without this
module proves that capturing these local trajectory
shifts is strictly necessary, and standard pooling
mechanisms are too coarse to detect subtle factual
deviations.
Finally, replacing our continuous trajectory prob-
ing with a static, final-layer representation (w/o Tra-
jectory Probing) degrades performance by approxi-
mately 5 percentage points. While the final layer
contains significant semantic information, this re-
sult empirically proves our core hypothesis: the
process of how a model arrives at an answer (the
hidden state evolution) contains critical truthful-
ness signals that are permanently lost if one only
analyzes the final output state.
6 Conclusion
We presented MULTIHALUDET, a four-stage
framework that detects hallucinations by probing
the hidden state trajectories of frozen LLMs with-
out fine-tuning. By integrating multi-scale atten-
tion with out-of-fold deep feature generation and
learned ensemble meta-learning, our approach ef-
fectively captures complex semantic shifts indicat-
ing factual inconsistencies. Extensive evaluations
show that MULTIHALUDET achieves state-of-the-
art performance on standard benchmarks,

Chunk 20 · 1,998 chars

integrating multi-scale atten-
tion with out-of-fold deep feature generation and
learned ensemble meta-learning, our approach ef-
fectively captures complex semantic shifts indicat-
ing factual inconsistencies. Extensive evaluations
show that MULTIHALUDET achieves state-of-the-
art performance on standard benchmarks, while
also demonstrating robust cross-lingual generaliza-
tion across high, medium, and low-resource lan-
guages.
8

-- 8 of 12 --

Limitations
While MULTIHALUDET demonstrates strong per-
formance in detecting multilingual hallucinations,
several limitations must be acknowledged.
First, our framework inherently relies on white-
box access to the internal hidden states and logits
of the target LLMs. Consequently, it cannot be
directly applied to proprietary, black-box models
(e.g., GPT-4 or Claude) where access to internal
representational trajectories is restricted.
Second, although our approach successfully
bypasses the computational burden of language-
specific fine-tuning, extracting and processing full-
depth hidden states across multiple layers still in-
curs non-trivial memory overhead and requires a
complete forward pass. This makes it more com-
putationally demanding than simple surface-level
heuristics.
Finally, our multilingual evaluation leverages
datasets translated from English using Gemini 2.5
Flash. Despite implementing a rigorous human
quality assurance and back-translation pipeline
to ensure semantic integrity, evaluating native,
naturally occurring prompts in medium and low-
resource languages (like Bangla and Amharic)
might reveal cultural and linguistic nuances that
are not fully captured by translated benchmarks.
References
Amos Azaria and Tom Mitchell. 2023. The internal
state of an llm knows when it’s lying. In Findings
of the Association for Computational Linguistics:
EMNLP 2023, pages 967–976.
Jakub Binkowski, Denis Janiak, Albert Sawczyn, Bog-
dan Gabrys, and Tomasz Jan Kajdanowicz. 2025.
Hallucination detection in llms using

Chunk 21 · 1,994 chars

rences
Amos Azaria and Tom Mitchell. 2023. The internal
state of an llm knows when it’s lying. In Findings
of the Association for Computational Linguistics:
EMNLP 2023, pages 967–976.
Jakub Binkowski, Denis Janiak, Albert Sawczyn, Bog-
dan Gabrys, and Tomasz Jan Kajdanowicz. 2025.
Hallucination detection in llms using spectral fea-
tures of attention maps. In Proceedings of the 2025
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 24365–24396.
Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu,
Mingyuan Tao, Zhihang Fu, and Jieping Ye. 2024.
Inside: Llms’ internal states retain the power of hallu-
cination detection. arXiv preprint arXiv:2402.03744.
I Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan, Ke-
hua Feng, Chunting Zhou, Junxian He, Graham Neu-
big, Pengfei Liu, and 1 others. 2023. Factool: Fac-
tuality detection in generative ai–a tool augmented
framework for multi-task and multi-domain scenar-
ios. arXiv preprint arXiv:2307.13528.
Yung-Sung Chuang, Linlu Qiu, Cheng-Yu Hsieh, Ran-
jay Krishna, Yoon Kim, and James Glass. 2024.
Lookback lens: Detecting and mitigating contextual
hallucinations in large language models using only
attention maps. In Proceedings of the 2024 Con-
ference on Empirical Methods in Natural Language
Processing, pages 1419–1436.
Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and
Yarin Gal. 2024. Detecting hallucinations in large
language models using semantic entropy. Nature,
630(8017):625–630.
Bairu Hou, Yang Zhang, Jacob Andreas, and Shiyu
Chang. 2025. A probabilistic framework for llm hal-
lucination detection via belief tree propagation. In
Proceedings of the 2025 Conference of the Nations of
the Americas Chapter of the Association for Compu-
tational Linguistics: Human Language Technologies
(Volume 1: Long Papers), pages 3076–3099.
Yuheng Huang, Jiayang Song, Zhijie Wang, Shengming
Zhao, Huaming Chen, Felix Juefei-Xu, and Lei Ma.
2023. Look before you leap: An exploratory study of
uncertainty measurement for large

Chunk 22 · 1,997 chars

cas Chapter of the Association for Compu-
tational Linguistics: Human Language Technologies
(Volume 1: Long Papers), pages 3076–3099.
Yuheng Huang, Jiayang Song, Zhijie Wang, Shengming
Zhao, Huaming Chen, Felix Juefei-Xu, and Lei Ma.
2023. Look before you leap: An exploratory study of
uncertainty measurement for large language models.
arXiv preprint arXiv:2307.10236.
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke
Zettlemoyer. 2017. Triviaqa: A large scale distantly
supervised challenge dataset for reading comprehen-
sion. In Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 1601–1611.
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom
Henighan, Dawn Drain, Ethan Perez, Nicholas
Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli
Tran-Johnson, and 1 others. 2022. Language mod-
els (mostly) know what they know. arXiv preprint
arXiv:2207.05221.
Sahil Kale and Antonio Luca Alfeo. 2025. Lie to
me: Knowledge graphs for robust hallucination self-
detection in llms. arXiv preprint arXiv:2512.23547.
Hazel Kim, Tom A Lamb, Adel Bibi, Philip Torr, and
Yarin Gal. 2025. Detecting llm hallucination through
layer-wise information deficiency: Analysis of am-
biguous prompts and unanswerable questions. In Pro-
ceedings of the 2025 Conference on Empirical Meth-
ods in Natural Language Processing, pages 32298–
32310.
Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa
Schut, Shreshth Malik, and Yarin Gal. 2024. Seman-
tic entropy probes: Robust and cheap hallucination
detection in llms. arXiv preprint arXiv:2406.15927.
Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-
Yun Nie, and Ji-Rong Wen. 2023. Halueval: A
large-scale hallucination evaluation benchmark for
large language models (2023). URL https://arxiv.
org/abs/2305.11747, 2305.
Qing Li, Jiahui Geng, Zongxiong Chen, Derui Zhu,
Yuxia Wang, Congbo Ma, Chenyang Lyu, and Fakhri
Karray. 2025. Hd-ndes: Neural differential equations
for hallucination detection in llms.

Chunk 23 · 1,993 chars

val: A
large-scale hallucination evaluation benchmark for
large language models (2023). URL https://arxiv.
org/abs/2305.11747, 2305.
Qing Li, Jiahui Geng, Zongxiong Chen, Derui Zhu,
Yuxia Wang, Congbo Ma, Chenyang Lyu, and Fakhri
Karray. 2025. Hd-ndes: Neural differential equations
for hallucination detection in llms. In Proceedings
of the 63rd Annual Meeting of the Association for
9

-- 9 of 12 --

Computational Linguistics (Volume 1: Long Papers),
pages 6173–6186.
Shize Liang and Hongzhi Wang. 2025. Neural probe-
based hallucination detection for large language mod-
els. arXiv preprint arXiv:2512.20949.
Abhika Mishra, Akari Asai, Vidhisha Balachandran,
Yizhong Wang, Graham Neubig, Yulia Tsvetkov, and
Hannaneh Hajishirzi. 2024. Fine-grained hallucina-
tion detection and editing for language models. arXiv
preprint arXiv:2401.06855.
Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Re-
ichart, Idan Szpektor, Hadas Kotek, and Yonatan
Belinkov. 2024. Llms know more than they show:
On the intrinsic representation of llm hallucinations.
arXiv preprint arXiv:2410.02707.
Ernesto Quevedo, Jorge Yero Salazar, Rachel Koerner,
Pablo Rivas, and Tomas Cerny. 2024. Detecting hal-
lucinations in large language model generation: A
token probability approach. In World Congress in
Computer Science, Computer Engineering & Applied
Computing, pages 154–173. Springer.
Adi Simhi, Itay Itzhak, Fazl Barez, Gabriel Stanovsky,
and Yonatan Belinkov. 2025. Trust me, i’m wrong:
High-certainty hallucinations in llms. arXiv e-prints,
pages arXiv–2502.
Weihang Su, Changyue Wang, Qingyao Ai, Yiran Hu,
Zhijing Wu, Yujia Zhou, and Yiqun Liu. 2024. Un-
supervised real-time hallucination detection based
on the internal states of large language models. In
Findings of the Association for Computational Lin-
guistics: ACL 2024, pages 14379–14391.
Bhanu Prakash Vangala, Sajid Mahmud, Pawan Neu-
pane, Joel Selvaraj, and Jianlin Cheng. 2025. Hal-
lumat: Detecting hallucinations in llm-generated ma-
terials

Chunk 24 · 1,995 chars

n detection based
on the internal states of large language models. In
Findings of the Association for Computational Lin-
guistics: ACL 2024, pages 14379–14391.
Bhanu Prakash Vangala, Sajid Mahmud, Pawan Neu-
pane, Joel Selvaraj, and Jianlin Cheng. 2025. Hal-
lumat: Detecting hallucinations in llm-generated ma-
terials science content through multi-stage verifica-
tion. arXiv preprint arXiv:2512.22396.
Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jian-
shu Chen, and Dong Yu. 2023. A stitch in time saves
nine: Detecting and mitigating hallucinations of
llms by validating low-confidence generation. arXiv
preprint arXiv:2307.03987.
Borui Yang, Md Afif Al Mamun, Jie M Zhang, and
Gias Uddin. 2025a. Hallucination detection in
large language models with metamorphic relations.
Proceedings of the ACM on Software Engineering,
2(FSE):425–445.
Chengxu Yang, Jingling Yuan, Siqi Cai, Jiawei Jiang,
and Chuang Hu. 2025b. Heaven-sent or hell-bent?
benchmarking the intelligence and defectiveness of
llm hallucinations. arXiv preprint arXiv:2512.21635.
Chenggong Zhang and Haopeng Wang. 2025. Hallu-
cination detection and evaluation of large language
model. arXiv preprint arXiv:2512.22416.
Jiawei Zhang, Chejian Xu, Yu Gai, Freddy Lecue, Dawn
Song, and Bo Li. 2024. Knowhalu: Hallucination
detection via multi-form knowledge based factual
checking. arXiv preprint arXiv:2404.02935.
Luan Zhang, Dandan Song, Zhijing Wu, Yuhang Tian,
Changzhi Zhou, Jing Xu, Ziyi Yang, and Shuhao
Zhang. 2025. Detecting hallucination in large lan-
guage models through deep internal representation
analysis. In Proceedings of the Thirty-Fourth Inter-
national Joint Conference on Artificial Intelligence,
IJCAI-25, pages 8357–8365.
10

-- 10 of 12 --

Chunk 25 · 1,995 chars

gence,
IJCAI-25, pages 8357–8365.
10

-- 10 of 12 --

A Multilingual Data Examples
In this section, we provide representative data points from our evaluation benchmarks across all four
languages: English (Base), French (High-Resource), Bangla (Medium-Resource), and Amharic (Low-
Resource). Table 2 presents an instance from the HaluEval dataset, illustrating the grounding knowledge,
the dialogue history, and both the faithful and hallucinated responses.
Language Knowledge Dialogue Right Response
Right
Response
Label
Hallucinated Hallucinated
Label
English
Iron Man is starring Robert
Downey Jr. Robert Downey Jr.
starred in Zodiac (Crime Fiction
Film). Zodiac (Crime Fiction
Film) is starring Jake
Gyllenhaal.
[Human]: Do you like
Iron Man [Assistant]:
Sure do! Robert
Downey Jr. is a
favorite. [Human]: Yes
i like him too did you
know he also was in
Zodiac a crime fiction
film.
I like crime fiction! Didn't
know RDJ was in there. Jake
Gyllenhaal starred as well.
0
I'm not a fan of crime movies,
but I did know that RDJ starred
in Zodiac with Tom Hanks.
1
French
Iron Man met en vedette
Robert Downey Jr. Robert
Downey Jr. a joué dans Zodiac
(film policier). Zodiac (film
policier) met en vedette Jake
Gyllenhaal.
[Humain]: Aimez-vous
Iron Man [Assistant]:
Bien sûr ! Robert
Downey Jr. est un de
mes favoris.
[Humain]: Oui, je
l'aime bien aussi.
Saviez-vous qu'il a
également joué dans
Zodiac, un film
policier.
J'aime les films policiers !
Je ne savais pas que RDJ y
était. Jake Gyllenhaal y a
également joué.
0
Je ne suis pas fan de films
policiers, mais je savais que
RDJ avait joué dans Zodiac
avec Tom Hanks.
1
Bangla
আয়রন মান-এ অিভনয় কেরেছন
রবাট ডাউিন জুিনয়র। রবাট ডাউিন
জুিনয়র জািডয়াক (াইম িফকশন
িফল্ম )-এ অিভনয় কেরেছন।
জািডয়াক (াইম িফকশন িফল্ম )-এ
অিভনয় কেরেছন জক িজেলনহাল।
[মানুষ]: আপিন িক
আয়রন মান পছন্দ কেরন
[সহকারী]: অবশই! রবাট
ডাউিন জুিনয়র আমার
অনতম িয়। [মানুষ]: হঁা,
আিমও তােক পছন্দ কির।
আপিন িক জােনন য িতিন
জািডয়াক নােম একটি
াইম িফকশন িফেল্ম ও
িছেলন।
আিম াইম িফকশন

Chunk 26 · 1,993 chars

রবাট ডাউিন
জুিনয়র জািডয়াক (াইম িফকশন
িফল্ম 	)-এ অিভনয় কেরেছন।
জািডয়াক (াইম িফকশন িফল্ম 	)-এ
অিভনয় কেরেছন জক িজেলনহাল।
[মানুষ]: আপিন িক
আয়রন মান পছন্দ 	কেরন
[সহকারী]: অবশই! রবাট
ডাউিন জুিনয়র আমার
অনতম িয়। [মানুষ]: হঁা,
আিমও তােক পছন্দ 	কির।
আপিন িক জােনন য িতিন
জািডয়াক নােম একটি
াইম িফকশন িফেল্ম 	ও
িছেলন।
আিম াইম িফকশন পছন্দ 	কির!
জানতাম না য আরিডেজ সখােন
িছেলন। জক িজেলনহালও এেত
অিভনয় কেরেছন।
0
আিম াইম মুিভর ভক্ত 	নই, তেব আিম
জানতাম য আরিডেজ টম হাঙ্ক 	েসর
সােথ জািডয়াক-এ অিভনয় কেরেছন।
1
Amharic
አይረን ማን (Iron Man) ሮበርት ዳውኒ
ጁኒየርን ያካትታል። ሮበርት ዳውኒ ጁኒየር
በዞዲያክ (የወንጀል ልብወለድ ፊልም) ላይ
ተጫውቷል። ዞዲያክ (የወንጀል ልብወለድ
ፊልም) ጄክ ጂለንሃልን ያካትታል።
[ሰው]፦ አይረን ማን
ትወዳለህ? [ረዳት]፦
በእርግጥ! ሮበርት ዳውኒ
ጁኒየር ከምወዳቸው አንዱ
ነው። [ሰው]፦ አዎ እኔም
እወደዋለሁ፣ በዞዲያክ
የወንጀል ልብወለድ ፊልም
ላይም እንደነበረ ታውቃለህ?
የወንጀል ልብወለድ እወዳለሁ! RDJ
እዚያ እንደነበረ አላውቅም ነበር። ጄክ
ጂለንሃልም ተጫውቷል።
0
የወንጀል ፊልሞች አድናቂ አይደለሁም፣ ግን
RDJ ከቶም ሃንክስ ጋር በዞዲያክ ላይ
እንደተጫወተ አውቃለሁ።
1
Figure 2: Representative multilingual data points from the HaluEval dataset, demonstrating the semantic preservation
of entities and factual inconsistencies across translations.
Table 3 illustrates an example from our modified TriviaQA dataset. TriviaQA (Joshi et al., 2017) is
adapted for our evaluation by collecting realistic, model-generated hallucinations. In the original dataset,
each entry consists of a question and its ground-truth correct answer. To generate plausible hard negatives,
we prompt an early-generation language model known for its propensity to hallucinate, Gemma-2-2B, to
answer each question. Responses that are plausible but factually incorrect are extracted to serve as the
hallucinated answers. We then construct our final evaluation dataset by pairing each question with its true
correct answer and its corresponding model-generated hallucination.
11

-- 11 of 12 --

Language 	Context (Truncated) 	Question 	Ground Truth
Right
Response
Label
Hallucinated Hallucinated
Label
English
[DOC] [TLE] THEME FROM
MAHOGANY... The Theme from
the movie "Mahogany" also
titled "Do You Know

Chunk 27 · 1,813 chars

estion with its true
correct answer and its corresponding model-generated hallucination.
11

-- 11 of 12 --

Language Context (Truncated) Question Ground Truth
Right
Response
Label
Hallucinated Hallucinated
Label
English
[DOC] [TLE] THEME FROM
MAHOGANY... The Theme from
the movie "Mahogany" also
titled "Do You Know Where
You're Going To" is a song
written by Michael Masser and
Gerald Giffin and was sung by
Dianna Ross as the theme to the
1975 Paramount film...
Do You Know Where
You're Going To? was
the theme from
which film?
Mahogany 0 Love Story 1
French
[DOC] [TLE] THÈME DE
MAHOGANY... Le thème du film
« Mahogany », également
intitulé « Do You Know Where
You're Going To », est une
chanson écrite par Michael
Masser et Gerald Giffin et a été
chantée par Diana Ross comme
thème du film de Paramount de
1975...
« Do You Know Where
You're Going To ? »
était le thème de quel
film ?
Mahogany 0 Love Story 1
Bangla
[DOC] [TLE] মহগিন িথম... "মহগিন"
মুিভর িথম, যার িশেরানাম "ড ইউ না
হায়ার ইউ আর গািয়ং ট", এটি
মাইেকল মাসার এবং জরাল্ড িগিফেনর
লখা একটি গান এবং এটি ১৯৭৫
সােলর পারামাউন্ট চলিের িথম
িহেসেব ডায়ানা রস গেয়িছেলন...
"ড ইউ না হায়ার ইউ
আর গািয়ং ট?" গানটি
কান চলিের িথম িছল?
মহগিন (Mahogany) 0 লাভ াির (Love Story) 1
Amharic
[DOC] [TLE] የማሆጋኒ ጭብጥ
(THEME FROM MAHOGANY)... ከ
"ማሆጋኒ" ፊልም የተወሰደው እና "Do You
Know Where You're Going To"
በመባል የሚታወቀው ዘፈን በሚካኤል ማሰር
እና ጄራልድ ጊፊን የተጻፈ ሲሆን በ1975
የፓራማውንት ፊልም ጭብጥ ሆኖ በዲያና
ሮስ ተዘፍኗል።...
"Do You Know Where
You're Going To?"
የትኛው ፊልም ጭብጥ ነበር?
ማሆጋኒ (Mahogany) 0 ላቭ ስቶሪ (Love Story) 1
Figure 3: Representative multilingual data points from the modified TriviaQA dataset. The context has been
truncated for brevity. The hallucinated response (a−
i ) is generated by prompting Gemma-2-B to produce a plausible
but factually incorrect answer.
12

-- 12 of 12 --