MultiHaluDet: Multilingual Hallucination Detection via LLM Hidden State Probing
Summary
MULTIHALUDET is a four-stage framework designed to detect hallucinations in Large Language Models (LLMs) by probing their full hidden state trajectories without requiring language-specific fine-tuning. Addressing the limitations of existing methods that rely on surface-level heuristics or single-layer representations, this approach extracts sequential features across multiple transformer layers. It processes these features using a hybrid architecture featuring multi-scale attention and self-attention pooling to capture both fine-grained and coarse-grained patterns of factual inconsistency. The system generates out-of-fold embeddings that feed into a learned ensemble meta-learner, combining diverse classifiers for robust prediction. Extensive experiments on the HaluEval and TriviaQA benchmarks demonstrate that MULTIHALUDET achieves state-of-the-art performance, reaching up to 98.55% AUROC with Mistral-7B and LLaMA2-7B architectures. Crucially, the framework exhibits strong cross-lingual generalization. Evaluations across high-resource (French), medium-resource (Bangla), and low-resource (Amharic) languages show consistent superiority over baselines, maintaining high detection accuracy even in typologically diverse linguistic tiers. Ablation studies confirm that key components, particularly out-of-fold stacking and multi-scale attention, are essential for capturing the complex semantic shifts indicative of hallucinations. While the method requires white-box access to internal model states and incurs higher computational costs than simple heuristics, it offers a reliable solution for detecting deep factual inconsistencies across diverse languages and model architectures.
PDF viewer
Chunks(28)
Chunk 0 · 1,991 chars
MULTIHALUDET: Multilingual Hallucination Detection via LLM Hidden State Probing Riasad Alvi1, Nurul Labib Sayeedi1, Md. Faiyaz Abdullah Sayeedi1,2 1United International University, 2BRAC University ralvi212069@bscse.uiu.ac.bd, nsayeedi2410045@bsds.uiu.ac.bd, msayeedi212049@bscse.uiu.ac.bd https://github.com/alvi-uiu/MultiHaluDet Abstract Hallucinations in Large Language Models (LLMs) represent a critical barrier to their reli- able deployment, a vulnerability heavily exacer- bated in non-English and resource-constrained contexts. Existing detection approaches that rely on output confidence heuristics or single- layer internal representations frequently fail to capture deep, complex factual inconsisten- cies across diverse languages. To address this, we introduce MULTIHALUDET, a novel four- stage framework that detects multilingual hal- lucinations by probing the full hidden state trajectories of frozen LLMs without requiring language-specific fine-tuning. Our method ex- tracts sequential features across multiple layers and processes them via a hybrid architecture using multi-scale attention and self-attention pooling. By generating out-of-fold embeddings that feed into a learned ensemble meta-learner, MULTIHALUDET captures both fine-grained and coarse-grained patterns of factual inconsis- tency. Extensive experiments demonstrate that our framework achieves state-of-the-art detec- tion performance, reaching up to 98.55% AU- ROC on the English HaluEval and TriviaQA benchmarks using Mistral-7B and LLaMA2- 7B architectures. Crucially, we rigorously eval- uate our framework’s cross-lingual generaliza- tion across high (French), medium (Bangla), and low-resource (Amharic) languages. MUL- TIHALUDET demonstrates exceptional repre- sentational robustness, consistently outperform- ing baselines and successfully transferring hal- lucination detection capabilities across typolog- ically diverse linguistic tiers. 1 Introduction Hallucinations in LLMs have emerged as a
Chunk 1 · 1,983 chars
d low-resource (Amharic) languages. MUL- TIHALUDET demonstrates exceptional repre- sentational robustness, consistently outperform- ing baselines and successfully transferring hal- lucination detection capabilities across typolog- ically diverse linguistic tiers. 1 Introduction Hallucinations in LLMs have emerged as a criti- cal barrier to their reliable deployment across high- stakes domains (Farquhar et al., 2024; Mishra et al., 2024; Varshney et al., 2023). These generations pose significant risks (Farquhar et al., 2024), mo- tivating diverse detection techniques. Evidence- based methods (Chern et al., 2023; Zhang et al., 2024) retrieve external information to verify fac- tual consistency, but are computationally intensive. Evidence-free methods leverage inherent model characteristics: logit-based and consistency-based methods estimate uncertainty and output stability, while classification-based methods (Orgad et al., 2024; Binkowski et al., 2025) probe internal hidden states without external retrieval. While internal state probing shows promise (Or- gad et al., 2024; Binkowski et al., 2025), current approaches remain inadequate, particularly across diverse languages. Recent work reveals truthful- ness information is concentrated in specific to- kens (Orgad et al., 2024); however, single-position methods struggle when non-factual tokens are dis- tributed across sequences. Furthermore, attention- based approaches (Chuang et al., 2024; Binkowski et al., 2025) using simple ratios achieve limited dis- crimination. Probabilistic frameworks (Hou et al., 2025) require complex reasoning pipelines, while active validation (Varshney et al., 2023) and tool- augmented systems (Chern et al., 2023; Zhang et al., 2024) introduce substantial latency. As noted in Farquhar et al. (2024), hallucinations manifest as semantic-level confabulations rather than token- level uncertainty. Existing methods largely fail to address these confabulations across non-English and
Chunk 2 · 1,994 chars
23) and tool- augmented systems (Chern et al., 2023; Zhang et al., 2024) introduce substantial latency. As noted in Farquhar et al. (2024), hallucinations manifest as semantic-level confabulations rather than token- level uncertainty. Existing methods largely fail to address these confabulations across non-English and resource-constrained languages. Transformer architectures process information through deep layers, embedding language-agnostic hallucination signals within hidden state trajecto- ries. Deep sequence modeling with multi-scale fea- ture aggregation offers a promising solution. We in- troduce MULTIHALUDET, a supervised framework leveraging multi-scale attention and transformer encoders to model hidden state dynamics across the full depth of frozen LLMs. Unlike methods focused on individual tokens or static layers, our approach aggregates information across multiple scales through self-attention pooling. We evaluate 1 arXiv:2605.24919v1 [cs.CL] 24 May 2026 -- 1 of 12 -- across multiple architectures and language tiers, demonstrating strong cross-lingual performance. Our contributions are summarized as follows: • We introduce MULTIHALUDET, a four-stage framework comprising dynamic feature ex- traction, multi-scale attention encoding, out- of-fold deep feature generation, and learned ensemble meta-learning, which jointly cap- ture fine-grained and coarse-grained patterns from hidden-state trajectories. • We evaluate our framework on the HaluE- val and TriviaQA benchmarks using Llama- 2-7B and Mistral-7B-Instruct, demonstrating substantial performance gains and representa- tional robustness over existing baselines. • We comprehensively evaluate the cross- lingual generalization of our framework by translating standard benchmarks into French (high-resource), Bangla (medium-resource), and Amharic (low-resource). Our results demonstrate that internal state probing can maintain strong detection signals across ty- pologically diverse linguistic tiers under
Chunk 3 · 1,996 chars
aluate the cross- lingual generalization of our framework by translating standard benchmarks into French (high-resource), Bangla (medium-resource), and Amharic (low-resource). Our results demonstrate that internal state probing can maintain strong detection signals across ty- pologically diverse linguistic tiers under con- trolled translation-based evaluation, without language-specific fine-tuning. 2 Related Work Hallucination detection approaches fall into three categories: (i) evidence-based methods verify- ing outputs against external knowledge, (ii) self- detection methods leveraging internal model states, and (iii) consistency-based approaches assessing output stability. Evidence-based methods retrieve external infor- mation for verification. Kale and Alfeo (2025) converts LLM responses into knowledge graphs for atomic fact verification. Vangala et al. (2025) introduced multi-source retrieval with contradic- tion graph analysis. Zhang and Wang (2025) em- ployed HHEM for lightweight consistency assess- ment. While effective, these methods depend on retrieval quality, making them vulnerable to gaps and latency. Self-detection methods probe LLM internal rep- resentations without external resources. Su et al. (2024) introduced MIND, training classifiers on auto-generated pseudo-labels. Liang and Wang (2025) employed Bayesian optimization to identify optimal layer insertion points. Zhang et al. (2025) proposed MHAD, selecting specific neurons via linear probing. Kossen et al. (2024) approximated semantic uncertainty from hidden states. Chen et al. (2024) proposed INSIDE, measuring consis- tency through eigenvalues of response covariances. Kim et al. (2025) analyzed layer-wise usable infor- mation across transformer depths. Quevedo et al. (2024) demonstrated strong detection using only four token probability features. Consistency-based methods assess output stability. Yang et al. (2025a) proposed MetaQA, leveraging metamorphic rela- tions for semantic consistency
Chunk 4 · 1,996 chars
(2025) analyzed layer-wise usable infor-
mation across transformer depths. Quevedo et al.
(2024) demonstrated strong detection using only
four token probability features. Consistency-based
methods assess output stability. Yang et al. (2025a)
proposed MetaQA, leveraging metamorphic rela-
tions for semantic consistency verification.
Despite these advances, challenges remain.
Evidence-based methods suffer from retrieval de-
pendency. Self-detection methods relying on
single-position representations struggle when non-
factual tokens appear at sequence beginnings.
Consistency-based methods incur costs from multi-
ple generation passes. Simhi et al. (2025) demon-
strated that LLMs can hallucinate with high cer-
tainty despite possessing correct knowledge. Yang
et al. (2025b) revealed the need to distinguish be-
tween intelligent and defective hallucinations.
Our work aligns with internal state probing meth-
ods but addresses their key limitations. While
Liang and Wang (2025) and Zhang et al. (2025)
focus on specific layers or tokens, we employ dy-
namic layer sampling and multi-scale attention to
aggregate information across the full depth trajec-
tory. Unlike Kossen et al. (2024) and Chen et al.
(2024), which probe specific positions, our archi-
tecture processes sequential layer features with
transformer encoders and self-attention pooling,
enabling adaptive depth selection. Our out-of-fold
stacking with ensemble meta-learning provides
more robust generalization than single-classifier ap-
proaches (Su et al., 2024; Liang and Wang, 2025).
3 Methodology
We present MULTIHALUDET (Figure 1), a four-
stage framework for hallucination detection.
3.1 LLM-Based Feature Extraction
3.1.1 Prompt Construction and Forward Pass
Let D = {(qi, ai, yi)}N
i=1 denote a dataset of
question–answer pairs, where qi is a natural-
language question, ai is a candidate answer, and
yi ∈ {0, 1} is a binary label indicating whether
ai is a hallucination (yi = 1) or a faithful re-
sponse (yi = 0).Chunk 5 · 1,990 chars
Feature Extraction
3.1.1 Prompt Construction and Forward Pass
Let D = {(qi, ai, yi)}N
i=1 denote a dataset of
question–answer pairs, where qi is a natural-
language question, ai is a candidate answer, and
yi ∈ {0, 1} is a binary label indicating whether
ai is a hallucination (yi = 1) or a faithful re-
sponse (yi = 0). Each sample (qi, ai) is formatted
as a structured natural-language prompt and tok-
enized with a fixed maximum sequence length. The
2
-- 2 of 12 --
Figure 1: Overview of the four-stage MULTIHALUDET framework for multilingual hallucination detection.
prompt is passed through a frozen, quantized LLM
in a single forward pass, yielding a sequence of hid-
den state tensors {H(l)}L
l=0, where L is the number
of transformer layers and H(l) ∈ RT ×d collects
the d-dimensional representations of T tokens at
layer l. The next-token logit vector z ∈ RV at the
final position is retained for global statistics. No
gradient is computed during feature extraction; the
LLM parameters remain fully frozen throughout
all experiments.
3.1.2 Dynamic Layer Sampling
To ensure architectural compatibility across LLMs
of varying depth, we introduce a dynamic layer
sampling strategy that maps any model’s L trans-
former layers to a fixed target count K, producing
a uniform sequential representation regardless of
model size. Three regimes are handled without
model-specific configuration:
• Exact match (L = K): identity mapping IK =
{1, . . . , K}.
• Shallow model (L < K): all L layers are used
and the deepest layer is repeated to pad the se-
quence to K.
• Deep model (L > K): K indices are selected
by uniform interpolation over the full depth:
IK =
clip
1 + (L − 1) · k
K − 1
, 1, L
K−1
k=0
(1)
where ⌊·⌉ denotes rounding to the nearest inte-
ger and clip(·, 1, L) clamps indices to valid range.
This formulation is differentiable with respect to K
and requires no architecture-specific configuration,
making it directly applicable to any transformer-
based LLM.
3.1.3Chunk 6 · 1,998 chars
1 + (L − 1) · k
K − 1
, 1, L
K−1
k=0
(1)
where ⌊·⌉ denotes rounding to the nearest inte-
ger and clip(·, 1, L) clamps indices to valid range.
This formulation is differentiable with respect to K
and requires no architecture-specific configuration,
making it directly applicable to any transformer-
based LLM.
3.1.3 Sequential Per-Layer Features
For each sampled layer index l ∈ IK , we extract
a compact descriptor from two complementary
views of the hidden state tensor: the last-token
representation h(l) = H(l)
T,:, which reflects the
model’s final contextual state, and the sequence
mean ¯h(l) = 1
T
PT
t=1 H(l)
t,: , which captures the ag-
gregate token-level context. From these, we com-
pute a per-layer descriptor s(l) ∈ Rds comprising
distributional statistics: the ℓ2 norm, mean, stan-
dard deviation, extremal values, activation sparsity,
near-zero mass, and the kurtosis and median abso-
lute deviation (MAD) of the hidden state, capturing
the peakedness and robust dispersion of the activa-
tion distribution:
κ(h) = 1
d
d X
j=1
hj − ¯h
ˆσ
4
,
MAD(h) = medianj (|hj − median(h)|)
(2)
where ¯h and ˆσ denote the empirical mean and
standard deviation of the hidden-state vector, and
median(h) is its coordinate-wise median. Collect-
ing descriptors across all sampled layers yields the
sequential representation S ∈ RK×ds , which en-
codes how the LLM’s internal dynamics evolve as
a function of depth.
3.1.4 Anchor-Based Depth Probing
To enable consistent cross-architecture compar-
isons at semantically meaningful network depths,
we define four anchor layers corresponding to pro-
3
-- 3 of 12 --
portional depth fractions {αj }4
j=1. For each anchor,
the closest available sampled layer is identified:
l∗
j = arg min
l ∈ IK
| l − ⌊αj · L⌉ | (3)
and the corresponding descriptor s(l∗
j ) is retained
as an anchor feature. This construction ensures that
features at early, middle, and late network stages
are explicitly represented in the global feature vec-
tor regardless ofChunk 7 · 1,988 chars
closest available sampled layer is identified:
l∗
j = arg min
l ∈ IK
| l − ⌊αj · L⌉ | (3)
and the corresponding descriptor s(l∗
j ) is retained
as an anchor feature. This construction ensures that
features at early, middle, and late network stages
are explicitly represented in the global feature vec-
tor regardless of the total model depth, providing a
principled substitute for hardcoded layer indices.
3.1.5 Global Feature Vector
A global feature vector g ∈ Rdg aggregates in-
formation across the full forward pass. It com-
prises: (i) the top-k next-token probabilities and
their pairwise differences, capturing the sharpness
of the model’s next-token distribution; (ii) the en-
tropy, standard deviation, and maximum of the
logit vector z; (iii) first- and second-order statis-
tics of the layer-wise ℓ2-norm trajectory {rl}l∈IK ,
rl = ∥h(l)∥2, capturing norm growth and volatil-
ity across depth; (iv) the anchor descriptors from
Eq. 3; and (v) cross-feature interaction terms be-
tween logit statistics and norm dynamics, which
encode the coupling between the model’s output
confidence and its internal representational geom-
etry. Together, S and g form the complete feature
representation passed to classification stage.
3.2 MultiHaluDet Architecture
3.2.1 Projection and Multi-Scale Attention
The sequential input S is first projected into a uni-
form hidden space of dimension H via a linear
layer followed by layer normalization and a GELU
activation. A multi-scale attention module then pro-
cesses the projected sequence at multiple temporal
resolutions simultaneously. For each scale factor
c ∈ C, the sequence is locally average-pooled to
compress K positions into ⌈K/c⌉ positions, passed
through a scale-specific linear projection, and up-
sampled back to length K via nearest-neighbor
interpolation. The contributions of each scale are
combined through a learned position-wise gate:
˜ht = X
c∈C
wt,c Pc
UpsamplePoolc
Hproj
t
(4)
where wt,c = softmax(Wght)c is theChunk 8 · 1,991 chars
ositions into ⌈K/c⌉ positions, passed through a scale-specific linear projection, and up- sampled back to length K via nearest-neighbor interpolation. The contributions of each scale are combined through a learned position-wise gate: ˜ht = X c∈C wt,c Pc UpsamplePoolc Hproj t (4) where wt,c = softmax(Wght)c is the position- wise scale gate computed from the original pro- jected hidden state, and Pc is the per-scale linear projection. The output is combined with the origi- nal projection via a residual connection and layer normalization, preserving fine-grained layer infor- mation while enriching it with multi-scale context. 3.2.2 Layer-Weighted Transformer Encoder Because different LLM layers carry information of varying discriminative value for hallucination detection, we modulate the fused sequence by a learnable, softmax-normalized importance vector λ ∈ RK before encoding: ˆhk = ¯λk · ˜hk + pk, k = 1, . . . , K, ¯λ = softmax(λ) (5) where pk is a learned positional embedding that encodes the relative depth of each sampled layer within the representational sequence. The modu- lated sequence is encoded by a stack of Pre-LN Transformer encoder layers with multi-head self- attention and GELU feed-forward sublayers, en- abling the model to attend across LLM depths and capture long-range inter-layer dependencies in the hidden state trajectory. 3.2.3 Self-Attention Pooling Rather than discarding positional information through mean pooling or relying on a fixed sum- mary token, we aggregate the transformer output via a learned attention pooling mechanism. A two- layer MLP with a tanh nonlinearity assigns a scalar relevance score to each position, and the final se- quential representation is their weighted sum: u = K X k=1 αkek, αk = exp(a(ek)) PK j=1 exp(a(ej )) (6) where ek is the encoder output at position k and a : RH → R is the learned scoring MLP. This allows the model to focus on the LLM layers most informative for the current input, providing a form of
Chunk 9 · 1,997 chars
and the final se- quential representation is their weighted sum: u = K X k=1 αkek, αk = exp(a(ek)) PK j=1 exp(a(ej )) (6) where ek is the encoder output at position k and a : RH → R is the learned scoring MLP. This allows the model to focus on the LLM layers most informative for the current input, providing a form of input-adaptive depth selection. 3.2.4 Global Branch and Gated Fusion The global feature vector g is processed inde- pendently through a two-layer MLP with layer normalization and dropout, yielding a compact global representation v ∈ RH/2. The sequen- tial representation u ∈ RH and global represen- tation are concatenated to form the joint embed- ding c = [u; v] ∈ R3H/2. A sigmoid gate then 4 -- 4 of 12 -- performs element-wise re-weighting to suppress uninformative dimensions: ˜c = c ⊙ σWgate c + bgate (7) The gated representation ˜c is passed through a three-layer MLP classifier to produce the final hal- lucination logit. A separate two-layer projection head maps ˜c onto a unit hypersphere for the con- trastive objective described in Section 4. 3.3 Out-of-Fold Stacking To obtain unbiased deep representations for the meta-learner without data leakage, we employ K- fold out-of-fold (OOF) stacking over the training set. For fold k, a fresh MULTIHALUDET instance is trained on the remaining K − 1 folds and used to extract the gated fusion representation ˜c for the held-out fold. After all K folds, the OOF features form a complete training matrix: XOOF ∈ RNtrain×dc , dc = dim(˜c) (8) with every training sample represented exactly once without ever being used in its own fold’s training. Test-set representations are obtained by averaging the gated fusion embeddings across all K fold mod- els in feature space: ˜ctest = 1 K K X k=1 ˜c(k)test (9) This averaging operates at the representation level rather than the probability level, providing implicit test-time ensembling and producing a more stable input to the meta-learner. The OOF and test fea- tures are
Chunk 10 · 1,994 chars
e gated fusion embeddings across all K fold mod-
els in feature space:
˜ctest = 1
K
K X
k=1
˜c(k)test (9)
This averaging operates at the representation level
rather than the probability level, providing implicit
test-time ensembling and producing a more stable
input to the meta-learner. The OOF and test fea-
tures are standardized before being passed to the
ensemble stage.
3.4 Ensemble Meta-Learner
The OOF deep features serve as input to a di-
verse ensemble of classifiers comprising gradient-
boosted trees, random forests, a kernel support
vector machine, a multi-layer perceptron, and a
logistic regression baseline. Model diversity across
both inductive biases (linear, kernel, tree, neural)
and hyperparameter profiles reduces variance and
improves robustness to the distributional proper-
ties of the deep features. Rather than fixed heuris-
tic weights, the ensemble prediction is formed
by stacking the base classifier probability esti-
mates through a logistic meta-regressor trained
on held-out out-of-fold predictions. Denoting the
log-odds transform of each base probability by
ℓm = log(ˆpm/(1 − ˆpm)), the final ensemble prob-
ability is:
ˆpens = σ β0 +
M X
m=1
βm ℓm
!
(10)
where σ(·) is the sigmoid function and the coeffi-
cients {βm}M
m=0 are learned by an ℓ2-regularized
logistic regression meta-learner trained on a sec-
ondary hold-out split of the out-of-fold probability
outputs. This formulation adapts the ensemble com-
position to the empirical discriminative strength
of each base learner while producing inherently
calibrated probability estimates. All component
models are configured with balanced class weights
to maintain sensitivity under any residual distri-
butional skew. The classification threshold is set
by Youden’s J statistic τ ∗ = arg maxτ [TPR(τ ) −
FPR(τ )], which maximizes the simultaneous im-
provement in sensitivity and specificity.
3.5 Multilingual Adaptation Strategy
To evaluate the multilingual generalization of
our hidden state probingChunk 11 · 1,997 chars
any residual distri-
butional skew. The classification threshold is set
by Youden’s J statistic τ ∗ = arg maxτ [TPR(τ ) −
FPR(τ )], which maximizes the simultaneous im-
provement in sensitivity and specificity.
3.5 Multilingual Adaptation Strategy
To evaluate the multilingual generalization of
our hidden state probing framework across vary-
ing resource tiers, we translated the source En-
glish datasets into French (high-resource), Bangla
(medium-resource), and Amharic (low-resource)
utilizing the Gemini 2.5 Flash model. Formally,
let the original source dataset be defined as
Den = {(qi, ai, yi)}N
i=1. We introduce a transla-
tion mapping Tℓ for the set of target languages
L = {fr, bn, am}. For every query-response pair,
this operation strictly preserves the original bi-
nary hallucination label, yielding a new language-
specific instance defined as: (q(ℓ)
i , a(ℓ)
i , y(ℓ)
i ) =
(Tℓ(qi), Tℓ(ai), yi). The multilingual evaluation
space is then constructed as the union of the origi-
nal English data and all language-specific subsets.
To guarantee the semantic integrity and con-
text preservation of the translated data, we con-
ducted human evaluation for quality assurance. We
sampled 100 instances per language from each
dataset, resulting in 300 evaluated samples per
dataset and a total of 600 evaluated samples across
both datasets. As three of the authors are native
Bangla speakers, the Bangla subset was evaluated
directly. For French and Amharic, we employed a
back-translation methodology, translating the tar-
get samples back into English, to verify semantic
equivalence against the original source. This man-
5
-- 5 of 12 --
ual evaluation confirmed an initial translation ac-
curacy of 96%. The remaining 4% of instances,
which exhibited minor semantic drift or structural
errors, were explicitly polished and regenerated
using Gemini 2.5 Flash to ensure strict alignment
with the source truth conditions.
By passing these high-fidelity translated inputs
through theChunk 12 · 1,992 chars
d an initial translation ac-
curacy of 96%. The remaining 4% of instances,
which exhibited minor semantic drift or structural
errors, were explicitly polished and regenerated
using Gemini 2.5 Flash to ensure strict alignment
with the source truth conditions.
By passing these high-fidelity translated inputs
through the frozen LLM, we generate the sequen-
tial representation S and global feature vector g
natively within each target language’s token space.
This formulation ensures that the subsequent deep
architecture evaluates intrinsic, language-agnostic
hallucination signals embedded within the model’s
internal dynamics, allowing us to probe representa-
tion alignment without requiring language-specific
fine-tuning.
4 Experimental Setup
4.1 Dataset
We evaluate on two question-answering bench-
marks. Let Q denote the set of questions and
A the set of candidate answers. Each dataset
D = {(qi, ai, yi)}N
i=1 consists of N samples where
qi ∈ Q is a question, ai ∈ A is an answer,
and yi ∈ {0, 1} is a binary label indicating non-
hallucination (yi = 0) or hallucination (yi = 1).
HaluEval (Li et al., 2023) provides N = 10, 000
human-annotated QA pairs with native hallucina-
tion labels derived from real LLM outputs, where
each (qi, ai, yi) is explicitly labeled based on fac-
tual consistency with verified knowledge sources.
TriviaQA (Joshi et al., 2017) is adapted for our
evaluation by collecting realistic, model-generated
hallucinations. Let DTrivia = {(qi, a∗
i )}M
i=1 denote
the original dataset, where qi is a question and a∗
i is
the ground-truth correct answer. To generate plausi-
ble hard negatives, we prompt an early-generation
language model known for its propensity to hallu-
cinate, Gemma-2-2B (∼ 32%-38% hallucination
rate), to answer each qi. Responses that are plau-
sible but factually incorrect are extracted and de-
noted as a−
i . We then construct our final evaluation
dataset by pairing each question with its true cor-
rect answer and its correspondingChunk 13 · 1,998 chars
wn for its propensity to hallu- cinate, Gemma-2-2B (∼ 32%-38% hallucination rate), to answer each qi. Responses that are plau- sible but factually incorrect are extracted and de- noted as a− i . We then construct our final evaluation dataset by pairing each question with its true cor- rect answer and its corresponding model-generated hallucination. To facilitate our multilingual eval- uation, we also expanded these source English datasets into three different target languages. 4.2 Baselines We compare against four types of hallucination de- tection methods. Prompt-based methods include P(True) (Kadavath et al., 2022), which utilizes a simple prompt template to enable the model to assess the correctness of its own response. Logit- based methods use the uncertainty of LLM out- puts to detect hallucination; we adopt AvgProb and AvgEnt from Huang et al. (2023) to aggregate logit-based uncertainty across all tokens, and also compare with EUBHD (Su et al., 2024), which fo- cuses on key tokens rather than considering all tokens. Consistency-based methods are moti- vated by the idea that consistent responses indicate factual knowledge; we apply Unigram and NLI variants, as well as INSIDE (Chen et al., 2024), which leverages eigenvalues of the covariance ma- trix of responses. Classification-based methods train a classifier on labeled statements; we com- pare with SAPLMA (Azaria and Mitchell, 2023), which trains a classifier on the last token’s last-layer hidden state; MIND (Su et al., 2024), which uses unsupervised training on auto-generated pseudo- labels; and Probe@Exact, which relies on infor- mation from potentially correct tokens. We also include HD-NDEs (Li et al., 2025) (Neural ODEs, CDEs, and SDEs), which model hidden state tra- jectories using neural differential equations. 4.3 Hyperparameters We extract hidden states from frozen LLMs in half- precision without gradient computation. For all LLM forward passes and synthetic generation tasks, the temperature is set
Chunk 14 · 1,989 chars
(Li et al., 2025) (Neural ODEs, CDEs, and SDEs), which model hidden state tra- jectories using neural differential equations. 4.3 Hyperparameters We extract hidden states from frozen LLMs in half- precision without gradient computation. For all LLM forward passes and synthetic generation tasks, the temperature is set to 0.0 to ensure deterministic outputs. Dynamic layer sampling maps variable- depth models to K = 32 uniform indices. The deep architecture uses hidden dimension d = 384, 8 attention heads, and 6 transformer encoder layers. Training employs AdamW optimizer with learning rate 2 × 10−4, weight decay 6 × 10−5, and Re- duceLROnPlateau scheduling over 45 epochs with early stopping patience of 15 epochs. We apply a composite loss combining BCE, focal, asymmetric, and contrastive objectives with label smoothing, along with data augmentation via Mixup and Cut- Mix. We employ 5-fold stratified cross-validation with out-of-fold feature extraction to prevent leak- age. The stacked ensemble combines 6 classi- fiers (RandomForest, XGBoost, GradientBoosting, LightGBM, LogisticRegression, Support Vector Machine). Their probability outputs are fused by a logistic meta-regressor trained on held-out out-of- fold predictions, producing inherently calibrated ensemble probabilities. All experiments are con- ducted on AMD Ryzen 5 8500G CPU with 32GB 6 -- 6 of 12 -- Method HaluEval TriviaQA Mistral-7B LLaMA2-7B Mistral-7B LLaMA2-7B P(True) 49.7 46.7 48.3 42.3 AvgProb 43.6 42.1 48.5 44.1 AvgEnt 49.7 47.3 47.6 41.1 EUBHD 70.5 71.9 80.6 80.5 Unigram 62.3 58.2 59.5 56.8 NLI 63.1 61.3 61.4 59.4 INSIDE 76.0 74.5 81.3 81.7 SAPLMA 89.4 87.0 84.1 80.0 MIND 94.5 86.1 84.5 79.4 Probe@Exact 93.4 88.3 84.1 81.3 Neural ODEs 91.2 89.5 83.7 81.7 Neural CDEs 95.4 91.4 84.1 83.7 Neural SDEs 93.7 92.8 85.1 81.0 MultiHaluDet 98.43 98.55 98.30 98.26 Table 1: Hallucination detection performance (AUROC %) on HaluEval and TriviaQA using Mistral-7B-Instruct and LLaMA2-7B. Best results in
Chunk 15 · 1,996 chars
.5 86.1 84.5 79.4 Probe@Exact 93.4 88.3 84.1 81.3 Neural ODEs 91.2 89.5 83.7 81.7 Neural CDEs 95.4 91.4 84.1 83.7 Neural SDEs 93.7 92.8 85.1 81.0 MultiHaluDet 98.43 98.55 98.30 98.26 Table 1: Hallucination detection performance (AUROC %) on HaluEval and TriviaQA using Mistral-7B-Instruct and LLaMA2-7B. Best results in bold. RAM and NVIDIA GeForce RTX 5060 Ti GPU with 16GB VRAM. 4.4 Evaluation Metric We utilize AUROC (%), which stands for the area under the ROC curve, to objectively evaluate the effectiveness of models. The higher the value of AUROC, the stronger the ability of this method for hallucination detection. All experiments employ 5-fold cross-validation with stratified sampling to ensure reliable estimates. 5 Results and Analysis 5.1 RQ1: To what extent does MultiHaluDet improve hallucination detection compared to existing internal-state and confidence based baselines? Table 1 presents the detection performance (AU- ROC %) of MULTIHALUDET compared against thirteen baseline methods, evaluated across both the HaluEval and TriviaQA datasets using Mistral- 7B and LLaMA2-7B. Failure of Surface-Level Heuristics. The re- sults reveal a clear hierarchy in the efficacy of dif- ferent detection paradigms. Token-level probability and entropy metrics (P(True), AvgProb, AvgEnt) systematically fail to provide meaningful detection signals, hovering around or below random chance (41.1% – 49.7% AUROC). This confirms that raw output confidence is severely miscalibrated and insufficient for identifying deep factual inconsis- tencies. Intermediate methods incorporating struc- tural linguistic features (Unigram, NLI) offer only marginal improvements, generally plateauing in the low 60% range. Representational Baselines. Approaches that leverage deeper internal representations show sig- nificantly more promise. Methods like SAPLMA, MIND, and traditional probing (Probe@Exact) push performance into the 80%–94% range. The strongest baselines are those modeling the con- tinuous
Chunk 16 · 1,988 chars
plateauing in the low 60% range. Representational Baselines. Approaches that leverage deeper internal representations show sig- nificantly more promise. Methods like SAPLMA, MIND, and traditional probing (Probe@Exact) push performance into the 80%–94% range. The strongest baselines are those modeling the con- tinuous dynamics of hidden states (Neural ODEs, CDEs, SDEs). Neural CDEs, in particular, achieve a competitive 95.4% on Mistral-7B for HaluEval. However, these methods exhibit high variance be- tween model architectures; for instance, MIND’s performance drops sharply from 94.5% on Mistral- 7B to 86.1% on LLaMA2-7B. Ours. MULTIHALUDET achieves state-of-the- art performance across all experimental condi- tions, substantially outperforming the strongest continuous-time baselines. On HaluEval, we achieve 98.43% AUROC with Mistral-7B and 98.55% with LLaMA2-7B. Crucially, MUL- TIHALUDET demonstrates exceptional cross- architecture robustness. While nearly all baselines suffer noticeable performance degradation when transitioning from Mistral to LLaMA2, our frame- work maintains near-identical, near-perfect efficacy (actually scoring slightly higher on LLaMA2 in HaluEval). Similarly, on the TriviaQA dataset, which features plausible but factually incorrect synthetic hard negatives, we achieve 98.30% and 98.26% respectively. These consistent gains across diverse datasets and architectures demonstrate that our multi-scale attention mechanism and out-of- fold stacking approach successfully aggregate ro- bust, architecture-agnostic signals from LLM hid- den state trajectories. 5.2 RQ2: How robust is MultiHaluDet across typologically diverse languages with varying resource availability? To demonstrate the cross-lingual generalization of MULTIHALUDET, we extend our evaluation be- yond English to include three typologically diverse languages: French, Bangla, and Amharic. Table 2 shows that hallucination detection per- formance correlates with the representational
Chunk 17 · 1,996 chars
ges with varying resource availability? To demonstrate the cross-lingual generalization of MULTIHALUDET, we extend our evaluation be- yond English to include three typologically diverse languages: French, Bangla, and Amharic. Table 2 shows that hallucination detection per- formance correlates with the representational fre- quency of these languages in the base models. MULTIHALUDET maintains exceptionally strong performance on French, achieving 96.2% and 95.8% on HaluEval for Mistral-7B and LLaMA2- 7B respectively, trailing the English baselines (98.4% and 98.5%) by only a marginal fraction. Similar high retention is observed on TriviaQA, where French scores reach 95.5% and 94.9%. For Bangla, we observe a more noticeable but con- 7 -- 7 of 12 -- Language Method HaluEval ( M⃝ / L⃝) TriviaQA ( M⃝ / L⃝) English (Base) Best Baseline 95.4 / 92.8 85.1 / 83.7 MultiHaluDet 98.4 / 98.5 98.3 / 98.2 French (High) Best Baseline 92.1 / 89.4 81.3 / 80.5 MultiHaluDet 96.2 / 95.8 95.5 / 94.9 Bangla (Medium) Best Baseline 78.4 / 75.1 69.2 / 67.8 MultiHaluDet 89.1 / 88.4 87.6 / 86.3 Amharic (Low) Best Baseline 62.3 / 59.8 54.1 / 52.6 MultiHaluDet 78.5 / 76.2 75.8 / 73.4 M⃝: Mistral-7B L⃝: LLaMA2-7B Table 2: Cross-lingual hallucination detection perfor- mance. MultiHaluDet consistently outperforms the best baseline across all resource settings. trolled degradation, yielding HaluEval scores of 89.1% (Mistral-7B) and 88.4% (LLaMA2-7B), alongside 87.6% and 86.3% on TriviaQA. While the framework successfully adapts to the language, accurately detecting factual inconsistencies re- quires the model to effectively navigate Bangla’s rich morphology and the complex linguistic adap- tations necessary for regional dialect variations. Amharic, a severely low-resource language with limited representation in both Mistral and LLaMA2, presents the greatest challenge. The base models inherently struggle with factual recall in Amharic, which limits the quality of the inter- nal representations our
Chunk 18 · 1,991 chars
adap- tations necessary for regional dialect variations. Amharic, a severely low-resource language with limited representation in both Mistral and LLaMA2, presents the greatest challenge. The base models inherently struggle with factual recall in Amharic, which limits the quality of the inter- nal representations our detector relies on. As a result, absolute performance understandably drops to 78.5% and 76.2% on HaluEval, and 75.8% and 73.4% on TriviaQA for Mistral-7B and LLaMA2- 7B, respectively. However, MULTIHALUDET still manages to extract meaningful detection signals, maintaining a robust lead over random chance and standard confidence baselines. 5.3 Ablation Study To understand the necessity of our proposed archi- tectural choices, we conduct an ablation study by systematically removing key components of the MULTIHALUDET framework. We evaluate the de- graded models on both HaluEval and TriviaQA using Mistral-7B and LLaMA2-7B. The results are summarized in Table 3. The removal of the Out-of-Fold (OOF) Stack- ing mechanism causes the most severe degrada- tion across all configurations, resulting in a precip- itous drop of nearly 10 percentage points (falling to 88.67% on Mistral-7B HaluEval and 87.41% on TriviaQA). Without OOF stacking, the meta- classifier heavily overfits to the localized noise of early hidden layers, completely failing to general- ize when faced with the plausible hard negatives in Method Mistral-7B LLaMA2-7B HaluEval TriviaQA HaluEval TriviaQA Full 98.43 98.30 98.55 98.26 w/o MSA 91.45 (↓6.98) 90.82 (↓7.48) 92.14 (↓6.41) 91.33 (↓6.93) w/o OOF 88.67 (↓9.76) 87.41 (↓10.89) 89.25 (↓9.30) 88.19 (↓10.07) w/o TP 93.28 (↓5.15) 92.56 (↓5.74) 93.71 (↓4.84) 93.04 (↓5.22) Table 3: Ablation study of MULTIHALUDET. Perfor- mance is measured in AUROC (%). The red text in- dicates the absolute performance drop when a specific core component is removed, confirming the structural necessity of the complete framework. Abbreviations – w/o: without, MSA:
Chunk 19 · 1,995 chars
92.56 (↓5.74) 93.71 (↓4.84) 93.04 (↓5.22) Table 3: Ablation study of MULTIHALUDET. Perfor- mance is measured in AUROC (%). The red text in- dicates the absolute performance drop when a specific core component is removed, confirming the structural necessity of the complete framework. Abbreviations – w/o: without, MSA: Multi-Scale Attention, OOF: Out- of-Fold Stacking, TP: Trajectory Probing. TriviaQA. This confirms that our stacking approach is non-negotiable for robust feature aggregation. Bypassing the Multi-Scale Attention module and relying on standard global average pooling results in a substantial 6–8% drop in AUROC. Hallucina- tion signals are not uniformly distributed across the generation trajectory; they often manifest as sudden, localized semantic shifts in the middle lay- ers. The sharp decline in performance without this module proves that capturing these local trajectory shifts is strictly necessary, and standard pooling mechanisms are too coarse to detect subtle factual deviations. Finally, replacing our continuous trajectory prob- ing with a static, final-layer representation (w/o Tra- jectory Probing) degrades performance by approxi- mately 5 percentage points. While the final layer contains significant semantic information, this re- sult empirically proves our core hypothesis: the process of how a model arrives at an answer (the hidden state evolution) contains critical truthful- ness signals that are permanently lost if one only analyzes the final output state. 6 Conclusion We presented MULTIHALUDET, a four-stage framework that detects hallucinations by probing the hidden state trajectories of frozen LLMs with- out fine-tuning. By integrating multi-scale atten- tion with out-of-fold deep feature generation and learned ensemble meta-learning, our approach ef- fectively captures complex semantic shifts indicat- ing factual inconsistencies. Extensive evaluations show that MULTIHALUDET achieves state-of-the- art performance on standard benchmarks,
Chunk 20 · 1,998 chars
integrating multi-scale atten- tion with out-of-fold deep feature generation and learned ensemble meta-learning, our approach ef- fectively captures complex semantic shifts indicat- ing factual inconsistencies. Extensive evaluations show that MULTIHALUDET achieves state-of-the- art performance on standard benchmarks, while also demonstrating robust cross-lingual generaliza- tion across high, medium, and low-resource lan- guages. 8 -- 8 of 12 -- Limitations While MULTIHALUDET demonstrates strong per- formance in detecting multilingual hallucinations, several limitations must be acknowledged. First, our framework inherently relies on white- box access to the internal hidden states and logits of the target LLMs. Consequently, it cannot be directly applied to proprietary, black-box models (e.g., GPT-4 or Claude) where access to internal representational trajectories is restricted. Second, although our approach successfully bypasses the computational burden of language- specific fine-tuning, extracting and processing full- depth hidden states across multiple layers still in- curs non-trivial memory overhead and requires a complete forward pass. This makes it more com- putationally demanding than simple surface-level heuristics. Finally, our multilingual evaluation leverages datasets translated from English using Gemini 2.5 Flash. Despite implementing a rigorous human quality assurance and back-translation pipeline to ensure semantic integrity, evaluating native, naturally occurring prompts in medium and low- resource languages (like Bangla and Amharic) might reveal cultural and linguistic nuances that are not fully captured by translated benchmarks. References Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when it’s lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976. Jakub Binkowski, Denis Janiak, Albert Sawczyn, Bog- dan Gabrys, and Tomasz Jan Kajdanowicz. 2025. Hallucination detection in llms using
Chunk 21 · 1,994 chars
rences Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when it’s lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976. Jakub Binkowski, Denis Janiak, Albert Sawczyn, Bog- dan Gabrys, and Tomasz Jan Kajdanowicz. 2025. Hallucination detection in llms using spectral fea- tures of attention maps. In Proceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing, pages 24365–24396. Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. 2024. Inside: Llms’ internal states retain the power of hallu- cination detection. arXiv preprint arXiv:2402.03744. I Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan, Ke- hua Feng, Chunting Zhou, Junxian He, Graham Neu- big, Pengfei Liu, and 1 others. 2023. Factool: Fac- tuality detection in generative ai–a tool augmented framework for multi-task and multi-domain scenar- ios. arXiv preprint arXiv:2307.13528. Yung-Sung Chuang, Linlu Qiu, Cheng-Yu Hsieh, Ran- jay Krishna, Yoon Kim, and James Glass. 2024. Lookback lens: Detecting and mitigating contextual hallucinations in large language models using only attention maps. In Proceedings of the 2024 Con- ference on Empirical Methods in Natural Language Processing, pages 1419–1436. Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. 2024. Detecting hallucinations in large language models using semantic entropy. Nature, 630(8017):625–630. Bairu Hou, Yang Zhang, Jacob Andreas, and Shiyu Chang. 2025. A probabilistic framework for llm hal- lucination detection via belief tree propagation. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Compu- tational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3076–3099. Yuheng Huang, Jiayang Song, Zhijie Wang, Shengming Zhao, Huaming Chen, Felix Juefei-Xu, and Lei Ma. 2023. Look before you leap: An exploratory study of uncertainty measurement for large
Chunk 22 · 1,997 chars
cas Chapter of the Association for Compu- tational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3076–3099. Yuheng Huang, Jiayang Song, Zhijie Wang, Shengming Zhao, Huaming Chen, Felix Juefei-Xu, and Lei Ma. 2023. Look before you leap: An exploratory study of uncertainty measurement for large language models. arXiv preprint arXiv:2307.10236. Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehen- sion. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 1601–1611. Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, and 1 others. 2022. Language mod- els (mostly) know what they know. arXiv preprint arXiv:2207.05221. Sahil Kale and Antonio Luca Alfeo. 2025. Lie to me: Knowledge graphs for robust hallucination self- detection in llms. arXiv preprint arXiv:2512.23547. Hazel Kim, Tom A Lamb, Adel Bibi, Philip Torr, and Yarin Gal. 2025. Detecting llm hallucination through layer-wise information deficiency: Analysis of am- biguous prompts and unanswerable questions. In Pro- ceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 32298– 32310. Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, and Yarin Gal. 2024. Seman- tic entropy probes: Robust and cheap hallucination detection in llms. arXiv preprint arXiv:2406.15927. Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian- Yun Nie, and Ji-Rong Wen. 2023. Halueval: A large-scale hallucination evaluation benchmark for large language models (2023). URL https://arxiv. org/abs/2305.11747, 2305. Qing Li, Jiahui Geng, Zongxiong Chen, Derui Zhu, Yuxia Wang, Congbo Ma, Chenyang Lyu, and Fakhri Karray. 2025. Hd-ndes: Neural differential equations for hallucination detection in llms.
Chunk 23 · 1,993 chars
val: A large-scale hallucination evaluation benchmark for large language models (2023). URL https://arxiv. org/abs/2305.11747, 2305. Qing Li, Jiahui Geng, Zongxiong Chen, Derui Zhu, Yuxia Wang, Congbo Ma, Chenyang Lyu, and Fakhri Karray. 2025. Hd-ndes: Neural differential equations for hallucination detection in llms. In Proceedings of the 63rd Annual Meeting of the Association for 9 -- 9 of 12 -- Computational Linguistics (Volume 1: Long Papers), pages 6173–6186. Shize Liang and Hongzhi Wang. 2025. Neural probe- based hallucination detection for large language mod- els. arXiv preprint arXiv:2512.20949. Abhika Mishra, Akari Asai, Vidhisha Balachandran, Yizhong Wang, Graham Neubig, Yulia Tsvetkov, and Hannaneh Hajishirzi. 2024. Fine-grained hallucina- tion detection and editing for language models. arXiv preprint arXiv:2401.06855. Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Re- ichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. 2024. Llms know more than they show: On the intrinsic representation of llm hallucinations. arXiv preprint arXiv:2410.02707. Ernesto Quevedo, Jorge Yero Salazar, Rachel Koerner, Pablo Rivas, and Tomas Cerny. 2024. Detecting hal- lucinations in large language model generation: A token probability approach. In World Congress in Computer Science, Computer Engineering & Applied Computing, pages 154–173. Springer. Adi Simhi, Itay Itzhak, Fazl Barez, Gabriel Stanovsky, and Yonatan Belinkov. 2025. Trust me, i’m wrong: High-certainty hallucinations in llms. arXiv e-prints, pages arXiv–2502. Weihang Su, Changyue Wang, Qingyao Ai, Yiran Hu, Zhijing Wu, Yujia Zhou, and Yiqun Liu. 2024. Un- supervised real-time hallucination detection based on the internal states of large language models. In Findings of the Association for Computational Lin- guistics: ACL 2024, pages 14379–14391. Bhanu Prakash Vangala, Sajid Mahmud, Pawan Neu- pane, Joel Selvaraj, and Jianlin Cheng. 2025. Hal- lumat: Detecting hallucinations in llm-generated ma- terials
Chunk 24 · 1,995 chars
n detection based on the internal states of large language models. In Findings of the Association for Computational Lin- guistics: ACL 2024, pages 14379–14391. Bhanu Prakash Vangala, Sajid Mahmud, Pawan Neu- pane, Joel Selvaraj, and Jianlin Cheng. 2025. Hal- lumat: Detecting hallucinations in llm-generated ma- terials science content through multi-stage verifica- tion. arXiv preprint arXiv:2512.22396. Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jian- shu Chen, and Dong Yu. 2023. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987. Borui Yang, Md Afif Al Mamun, Jie M Zhang, and Gias Uddin. 2025a. Hallucination detection in large language models with metamorphic relations. Proceedings of the ACM on Software Engineering, 2(FSE):425–445. Chengxu Yang, Jingling Yuan, Siqi Cai, Jiawei Jiang, and Chuang Hu. 2025b. Heaven-sent or hell-bent? benchmarking the intelligence and defectiveness of llm hallucinations. arXiv preprint arXiv:2512.21635. Chenggong Zhang and Haopeng Wang. 2025. Hallu- cination detection and evaluation of large language model. arXiv preprint arXiv:2512.22416. Jiawei Zhang, Chejian Xu, Yu Gai, Freddy Lecue, Dawn Song, and Bo Li. 2024. Knowhalu: Hallucination detection via multi-form knowledge based factual checking. arXiv preprint arXiv:2404.02935. Luan Zhang, Dandan Song, Zhijing Wu, Yuhang Tian, Changzhi Zhou, Jing Xu, Ziyi Yang, and Shuhao Zhang. 2025. Detecting hallucination in large lan- guage models through deep internal representation analysis. In Proceedings of the Thirty-Fourth Inter- national Joint Conference on Artificial Intelligence, IJCAI-25, pages 8357–8365. 10 -- 10 of 12 -- A Multilingual Data Examples In this section, we provide representative data points from our evaluation benchmarks across all four languages: English (Base), French (High-Resource), Bangla (Medium-Resource), and Amharic (Low- Resource). Table 2 presents an instance
Chunk 25 · 1,995 chars
gence, IJCAI-25, pages 8357–8365. 10 -- 10 of 12 -- A Multilingual Data Examples In this section, we provide representative data points from our evaluation benchmarks across all four languages: English (Base), French (High-Resource), Bangla (Medium-Resource), and Amharic (Low- Resource). Table 2 presents an instance from the HaluEval dataset, illustrating the grounding knowledge, the dialogue history, and both the faithful and hallucinated responses. Language Knowledge Dialogue Right Response Right Response Label Hallucinated Hallucinated Label English Iron Man is starring Robert Downey Jr. Robert Downey Jr. starred in Zodiac (Crime Fiction Film). Zodiac (Crime Fiction Film) is starring Jake Gyllenhaal. [Human]: Do you like Iron Man [Assistant]: Sure do! Robert Downey Jr. is a favorite. [Human]: Yes i like him too did you know he also was in Zodiac a crime fiction film. I like crime fiction! Didn't know RDJ was in there. Jake Gyllenhaal starred as well. 0 I'm not a fan of crime movies, but I did know that RDJ starred in Zodiac with Tom Hanks. 1 French Iron Man met en vedette Robert Downey Jr. Robert Downey Jr. a joué dans Zodiac (film policier). Zodiac (film policier) met en vedette Jake Gyllenhaal. [Humain]: Aimez-vous Iron Man [Assistant]: Bien sûr ! Robert Downey Jr. est un de mes favoris. [Humain]: Oui, je l'aime bien aussi. Saviez-vous qu'il a également joué dans Zodiac, un film policier. J'aime les films policiers ! Je ne savais pas que RDJ y était. Jake Gyllenhaal y a également joué. 0 Je ne suis pas fan de films policiers, mais je savais que RDJ avait joué dans Zodiac avec Tom Hanks. 1 Bangla আয়রন মান-এ অিভনয় কেরেছন রবাট ডাউিন জুিনয়র। রবাট ডাউিন জুিনয়র জািডয়াক (াইম িফকশন িফল্ম )-এ অিভনয় কেরেছন। জািডয়াক (াইম িফকশন িফল্ম )-এ অিভনয় কেরেছন জক িজেলনহাল। [মানুষ]: আপিন িক আয়রন মান পছন্দ কেরন [সহকারী]: অবশই! রবাট ডাউিন জুিনয়র আমার অনতম িয়। [মানুষ]: হঁা, আিমও তােক পছন্দ কির। আপিন িক জােনন য িতিন জািডয়াক নােম একটি াইম িফকশন িফেল্ম ও িছেলন। আিম াইম িফকশন
Chunk 26 · 1,993 chars
রবাট ডাউিন জুিনয়র জািডয়াক (াইম িফকশন িফল্ম )-এ অিভনয় কেরেছন। জািডয়াক (াইম িফকশন িফল্ম )-এ অিভনয় কেরেছন জক িজেলনহাল। [মানুষ]: আপিন িক আয়রন মান পছন্দ কেরন [সহকারী]: অবশই! রবাট ডাউিন জুিনয়র আমার অনতম িয়। [মানুষ]: হঁা, আিমও তােক পছন্দ কির। আপিন িক জােনন য িতিন জািডয়াক নােম একটি াইম িফকশন িফেল্ম ও িছেলন। আিম াইম িফকশন পছন্দ কির! জানতাম না য আরিডেজ সখােন িছেলন। জক িজেলনহালও এেত অিভনয় কেরেছন। 0 আিম াইম মুিভর ভক্ত নই, তেব আিম জানতাম য আরিডেজ টম হাঙ্ক েসর সােথ জািডয়াক-এ অিভনয় কেরেছন। 1 Amharic አይረን ማን (Iron Man) ሮበርት ዳውኒ ጁኒየርን ያካትታል። ሮበርት ዳውኒ ጁኒየር በዞዲያክ (የወንጀል ልብወለድ ፊልም) ላይ ተጫውቷል። ዞዲያክ (የወንጀል ልብወለድ ፊልም) ጄክ ጂለንሃልን ያካትታል። [ሰው]፦ አይረን ማን ትወዳለህ? [ረዳት]፦ በእርግጥ! ሮበርት ዳውኒ ጁኒየር ከምወዳቸው አንዱ ነው። [ሰው]፦ አዎ እኔም እወደዋለሁ፣ በዞዲያክ የወንጀል ልብወለድ ፊልም ላይም እንደነበረ ታውቃለህ? የወንጀል ልብወለድ እወዳለሁ! RDJ እዚያ እንደነበረ አላውቅም ነበር። ጄክ ጂለንሃልም ተጫውቷል። 0 የወንጀል ፊልሞች አድናቂ አይደለሁም፣ ግን RDJ ከቶም ሃንክስ ጋር በዞዲያክ ላይ እንደተጫወተ አውቃለሁ። 1 Figure 2: Representative multilingual data points from the HaluEval dataset, demonstrating the semantic preservation of entities and factual inconsistencies across translations. Table 3 illustrates an example from our modified TriviaQA dataset. TriviaQA (Joshi et al., 2017) is adapted for our evaluation by collecting realistic, model-generated hallucinations. In the original dataset, each entry consists of a question and its ground-truth correct answer. To generate plausible hard negatives, we prompt an early-generation language model known for its propensity to hallucinate, Gemma-2-2B, to answer each question. Responses that are plausible but factually incorrect are extracted to serve as the hallucinated answers. We then construct our final evaluation dataset by pairing each question with its true correct answer and its corresponding model-generated hallucination. 11 -- 11 of 12 -- Language Context (Truncated) Question Ground Truth Right Response Label Hallucinated Hallucinated Label English [DOC] [TLE] THEME FROM MAHOGANY... The Theme from the movie "Mahogany" also titled "Do You Know
Chunk 27 · 1,813 chars
estion with its true correct answer and its corresponding model-generated hallucination. 11 -- 11 of 12 -- Language Context (Truncated) Question Ground Truth Right Response Label Hallucinated Hallucinated Label English [DOC] [TLE] THEME FROM MAHOGANY... The Theme from the movie "Mahogany" also titled "Do You Know Where You're Going To" is a song written by Michael Masser and Gerald Giffin and was sung by Dianna Ross as the theme to the 1975 Paramount film... Do You Know Where You're Going To? was the theme from which film? Mahogany 0 Love Story 1 French [DOC] [TLE] THÈME DE MAHOGANY... Le thème du film « Mahogany », également intitulé « Do You Know Where You're Going To », est une chanson écrite par Michael Masser et Gerald Giffin et a été chantée par Diana Ross comme thème du film de Paramount de 1975... « Do You Know Where You're Going To ? » était le thème de quel film ? Mahogany 0 Love Story 1 Bangla [DOC] [TLE] মহগিন িথম... "মহগিন" মুিভর িথম, যার িশেরানাম "ড ইউ না হায়ার ইউ আর গািয়ং ট", এটি মাইেকল মাসার এবং জরাল্ড িগিফেনর লখা একটি গান এবং এটি ১৯৭৫ সােলর পারামাউন্ট চলিের িথম িহেসেব ডায়ানা রস গেয়িছেলন... "ড ইউ না হায়ার ইউ আর গািয়ং ট?" গানটি কান চলিের িথম িছল? মহগিন (Mahogany) 0 লাভ াির (Love Story) 1 Amharic [DOC] [TLE] የማሆጋኒ ጭብጥ (THEME FROM MAHOGANY)... ከ "ማሆጋኒ" ፊልም የተወሰደው እና "Do You Know Where You're Going To" በመባል የሚታወቀው ዘፈን በሚካኤል ማሰር እና ጄራልድ ጊፊን የተጻፈ ሲሆን በ1975 የፓራማውንት ፊልም ጭብጥ ሆኖ በዲያና ሮስ ተዘፍኗል።... "Do You Know Where You're Going To?" የትኛው ፊልም ጭብጥ ነበር? ማሆጋኒ (Mahogany) 0 ላቭ ስቶሪ (Love Story) 1 Figure 3: Representative multilingual data points from the modified TriviaQA dataset. The context has been truncated for brevity. The hallucinated response (a− i ) is generated by prompting Gemma-2-B to produce a plausible but factually incorrect answer. 12 -- 12 of 12 --