SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems

Summary

This paper introduces SEALGuard, a multilingual guardrail designed to improve safety alignment for large language model (LLM)-powered systems, particularly in Southeast Asian languages. Existing guardrails like LlamaGuard perform well in English but struggle with multilingual and low-resource languages, leaving systems vulnerable to unsafe and jailbreak prompts in these languages. SEALGuard addresses this by adapting a multilingual LLM using low-rank adaptation (LoRA) to detect unsafe and jailbreak prompts across ten languages, including English and nine Southeast Asian languages. The authors created SEALSBench, a large-scale benchmark with over 260,000 prompts, to evaluate performance. Experiments show that LlamaGuard's Defense Success Rate (DSR) drops by 9% and 18% when面对 multilingual unsafe and jailbreak prompts, respectively. In contrast, SEALGuard achieves a DSR of 97%, precision of 99%, and F1-score of 98%, outperforming LlamaGuard by 48%, 8%, and 34%, respectively. An ablation study confirms that LoRA adaptation is the key contributor to SEALGuard's effectiveness. The authors release their model and benchmark to support further research.

PDF viewer

Chunks(39)

Chunk 0 · 1,997 chars

SEALGuard: Safeguarding the Multilingual Conversations in
Southeast Asian Languages for LLM Soware Systems
Wenliang Shan
wenliang.shan@monash.edu
Monash University
Melbourne, Victoria, Australia
Michael Fu
michael.fu@unimelb.edu.au
The University of Melbourne
Melbourne, Victoria, Australia
Rui Yang
rui.yang1@monash.edu
Monash University
Melbourne, Victoria, Australia
Chakkrit (Kla) Tantithamthavorn ∗
chakkrit@monash.edu
Monash University
Melbourne, Victoria, Australia
ABSTRACT
Safety alignment is critical for LLM-powered systems. While recent
LLM-powered guardrail approaches such as LlamaGuard achieve
high detection accuracy of unsafe inputs written in English (e.g.,
“How to create a bomb?”), they struggle with multilingual unsafe
inputs. This limitation leaves LLM systems vulnerable to unsafe
and jailbreak prompts written in low-resource languages such as
those in Southeast Asia. This paper introduces SEALGuard, a mul-
tilingual guardrail designed to improve the safety alignment across
diverse languages. It aims to address the multilingual safety align-
ment gap of existing guardrails and ensure effective filtering of
unsafe and jailbreak prompts in LLM-powered systems. We adapt
a general-purpose multilingual language model into a multilingual
guardrail using low-rank adaptation (LoRA). We construct SEALS-
Bench, a large-scale multilingual safety alignment dataset contain-
ing over 260,000 prompts in ten languages, including safe, unsafe,
and jailbreak cases. We evaluate SEALGuard against state-of-the-
art guardrails such as LlamaGuard on this benchmark. Our findings
show that multilingual unsafe and jailbreak prompts substantially
degrade the performance of the state-of-the-art LlamaGuard, which
experiences a drop in Defense Success Rate (DSR) by 9% and 18%,
respectively, compared to its performance on English-only prompts.
In contrast, SEALGuard outperforms existing guardrails in detect-
ing multilingual unsafe and jailbreak prompts, improving DSR by
48% over

Chunk 1 · 1,995 chars

ormance of the state-of-the-art LlamaGuard, which
experiences a drop in Defense Success Rate (DSR) by 9% and 18%,
respectively, compared to its performance on English-only prompts.
In contrast, SEALGuard outperforms existing guardrails in detect-
ing multilingual unsafe and jailbreak prompts, improving DSR by
48% over LlamaGuard and achieving the best DSR, precision, and
F1-score. Our ablation study further reveals the contributions of
adaptation strategies and model size to the overall performance of
SEALGuard. We release our pre-trained model and benchmark at
https://github.com/awsm-research/SEALGuard to support further
research.
∗ Corresponding author
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
© 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-XXXX-X/2018/06. . . $15.00
https://doi.org/XXXXXXX.XXXXXXX
CCS CONCEPTS
• Computing methodologies → Natural language processing;
Discourse, dialogue and pragmatics; • Software and its engineer-
ing → Software safety.
KEYWORDS
Multilingual safety alignment, AI-powered software
ACM Reference Format:
Wenliang Shan, Michael Fu, Rui Yang, and Chakkrit (Kla) Tantithamtha-
vorn. 2018. SEALGuard: Safeguarding the Multilingual Conversations in
Southeast Asian Languages for LLM Software Systems. In Proceedings of
Make sure to enter the correct conference title from your rights confirma-
tion

Chunk 2 · 1,998 chars

oftware
ACM Reference Format:
Wenliang Shan, Michael Fu, Rui Yang, and Chakkrit (Kla) Tantithamtha-
vorn. 2018. SEALGuard: Safeguarding the Multilingual Conversations in
Southeast Asian Languages for LLM Software Systems. In Proceedings of
Make sure to enter the correct conference title from your rights confirma-
tion email (Conference acronym ’XX). ACM, New York, NY, USA, 12 pages.
https://doi.org/XXXXXXX.XXXXXXX
1 	INTRODUCTION
Large Language Models (LLMs)-powered software systems are in-
creasingly used in real-world applications, especially, in the multi-
lingual contexts, such as interactive language learning tutors [ 10 ,
38 ]. These systems build on the success of multilingual foundation
models that enable diverse language understanding. In particular,
there is a growing development of multilingual LLMs tailored for
lower-resource and regional languages such as Southeast Asian
languages [ 8 , 28 , 50 ]. Models such as SeaLLMs [ 50 ], SeaLion [ 28 ],
and Sailor [ 8 ] are trained on regional data and demonstrate stronger
multilingual capabilities than English-centric models like LLaMA
[ 40 ]. These Southeast Asian multilingual LLMs enable intelligent
software systems to deliver services in users’ native languages,
improving accessibility and inclusivity across linguistically diverse
populations.
However, these LLM-driven systems still face vulnerabilities
stemming from the probabilistic and non-deterministic behavior of
multilingual LLMs. Unlike traditional rule-based systems, which
avoid unsafe outputs by design, LLM-driven systems can generate
harmful responses when given unsafe prompts. Thus, ensuring their
reliability presents a unique research challenge, as highlighted by
Bengio et al. [ 3 ], Hassan et al. [ 14 ], and Yao et al. [ 45 ], since quality
assurance must contend with the open-ended and unpredictable
nature of model outputs.
In response to the vulnerabilities introduced by the probabilistic
nature of LLMs, researchers have proposed input-output

Chunk 3 · 1,996 chars

ch challenge, as highlighted by
Bengio et al. [ 3 ], Hassan et al. [ 14 ], and Yao et al. [ 45 ], since quality
assurance must contend with the open-ended and unpredictable
nature of model outputs.
In response to the vulnerabilities introduced by the probabilistic
nature of LLMs, researchers have proposed input-output guardrails
as external safety alignment mechanisms that form a protective

-- 1 of 12 --

Conference acronym ’XX, June 03–05, 2018, Woodstock, NY 	Trovato et al.
layer around models without requiring direct fine-tuning of the
underlying model. Examples include LlamaGuard [ 16 ], OpenAI
Moderation [ 26 ], Perplexity [ 2 ], Perspective API [ 21 ], and NVIDIA
NeMo [ 34 ]. These approaches classify and filter user prompts or
LLM outputs to prevent unsafe content. In particular, LlamaGuard
proposed by Inan et al. [16] achieves state-of-the-art performance
in detecting unsafe prompts, outperforming tools like OpenAI Mod-
eration [26] and Perspective API [21].
However, LlamaGuard was primarily trained and fine-tuned on
English data, and its original paper acknowledges that it may not
perform reliably in other languages [ 16 ]. Our evaluation further
validates this concern, highlighting its limited effectiveness in de-
fending against unsafe prompts in many under-resourced languages
in Southeast Asia. Notably, under-resourced languages are often
used as jailbreak attacks to bypass guardrails and trigger unsafe
responses from LLMs. In particular, Yong et al. [ 46 ] demonstrates
that translated unsafe prompts trigger unsafe responses from GPT-4
79% of the time. This exposes a key software engineering chal-
lenge: how can we design external guardrails that effectively
protect multilingual LLM-driven systems, especially across
under-resourced languages like those in Southeast Asia?
To address this gap, we propose using low-rank adaptation (LoRA)
[ 15 ] to fine-tune a multilingual LLM of similar size to LlamaGuard
(8B parameters) [ 16 ] as a multilingual safety

Chunk 4 · 1,996 chars

rails that effectively
protect multilingual LLM-driven systems, especially across
under-resourced languages like those in Southeast Asia?
To address this gap, we propose using low-rank adaptation (LoRA)
[ 15 ] to fine-tune a multilingual LLM of similar size to LlamaGuard
(8B parameters) [ 16 ] as a multilingual safety guardrail for Southeast
Asian languages. We name our method SEALGuard: S outh E ast
A sian L anguage GUARD rail for safeguarding multilingual LLM-
driven software systems.
To evaluate SEALGuard, we introduce SEALSBench, a mul-
tilingual safety benchmark comprising 266,444 prompts (169,433
safe, 80,601 unsafe, 16,410 jailbreak) across English and nine South-
east Asian languages and covering ten unsafe content categories
and nine jailbreak types. We compare our SEALGuard against
LlamaGuard[ 16 ] and OpenAI Moderation [ 25 ] through extensive
experiments on this dataset, addressing four research questions:
• (RQ1) What is the impact of multilingual unsafe prompts
in Southeast Asian languages on the performance of
existing guardrails?
Results. When encountering multilingual unsafe prompts,
the state-of-the-art LlamaGuard’s performance drops by 9%,
from 59% to 50%, while OpenAI Moderation’s DSR drops by
31%, from 60% to 29%.
• (RQ2) What is the impact of multilingual jailbreak
prompts in Southeast Asian languages on the perfor-
mance of existing guardrails?
Results. When encountering multilingual jailbreak prompts,
the state-of-the-art LlamaGuard’s performance drops by 18%,
from 59% to 41%, while OpenAI Moderation’s DSR drops by
38%, from 60% to 22%.
• (RQ3) How effective is our SEALGuard at defending
against multilingual unsafe and jailbreak prompts in
Southeast Asian languages?
Results. Our SEALGuard achieves an F1 score of 98%,
which is 34%–58% higher than existing guardrails. Similarly,
SEALGuard achieves a DSR of 97% and a precision of 99%,
outperforming baseline approaches by 48%–66% and 3%–63%,
respectively.
• (RQ4) What are the contributions of

Chunk 5 · 1,994 chars

eak prompts in
Southeast Asian languages?
Results. Our SEALGuard achieves an F1 score of 98%,
which is 34%–58% higher than existing guardrails. Similarly,
SEALGuard achieves a DSR of 97% and a precision of 99%,
outperforming baseline approaches by 48%–66% and 3%–63%,
respectively.
• (RQ4) What are the contributions of adaptation strate-
gies and model size to the performance of our SEAL-
Guard multilingual guardrail?
Results. We found that LoRA adaptation is the most impor-
tant component of SEALGuard, boosting its F1 score from
26% to 98%.
Novelty & Contributions. To the best of our knowledge, the
main contributions of this paper are as follows: (1) We propose
SEALGuard, a multilingual guardrail designed to overcome the
limitations of existing guardrails in multilingual safety alignment.
(2) We introduce SEALSBench, a comprehensive multilingual safety
alignment dataset containing over 260,000 prompts across ten lan-
guages, including safe, unsafe, and jailbreak prompts. (3) We ex-
tensively evaluate our SEALGuard against baseline guardrails,
demonstrating its effectiveness across multilingual scenarios. (4)
We perform an ablation study to analyze the impact of adaptation
strategies and model size on the performance of SEALGuard.
2 	BACKGROUND & RELATED WORK
In this section, we provide background on LLM-powered agent
systems and their associated safety challenges. We present existing
safety alignment approaches, highlight the gap in multilingual
safety alignment, and discuss related work on safeguarding LLM-
based software systems.
2.1 	LLM-Powered Agent Systems: Capabilities
and Safety Challenges
LLM agentic software systems are autonomous programs capable
of perceiving inputs, making decisions, and acting toward specific
goals [ 30 ]. Traditionally, they were built using symbolic reasoning
or task-specific rule-based logic [ 5 , 41 ], often limiting adaptability
and scalability. Recent advances in large language models (LLMs)
have fundamentally reshaped this

Chunk 6 · 1,996 chars

ms capable
of perceiving inputs, making decisions, and acting toward specific
goals [ 30 ]. Traditionally, they were built using symbolic reasoning
or task-specific rule-based logic [ 5 , 41 ], often limiting adaptability
and scalability. Recent advances in large language models (LLMs)
have fundamentally reshaped this paradigm, enabling agentic sys-
tems that are far more flexible and context-aware [23, 27].
In this work, we focus on LLM-powered agents for user-facing
applications (e.g., customer service chatbots and language learn-
ing tutors), which support open-ended dialogue and dynamic in-
tent interpretation. As LLMs continue to expand globally, multilin-
gual LLM-powered agents are increasingly deployed in real-world
applications, especially for low-resource and regional languages
[ 8 , 28 , 50 ], enhancing inclusivity and accessibility for diverse popu-
lations.
Traditional rule-based systems are deterministic functions  :
 →  , mapping inputs  ∈  to outputs  ∈  using explicitly
defined rules or decision trees. While predictable and interpretable,
they lack flexibility and scalability. In contrast, large language mod-
els (LLMs) are probabilistic functions  :  →  , generating out-
puts by sampling from a distribution  ( | ) over an open-ended
output space. This allows LLMs to support flexible, context-aware
agent systems.
However, this flexibility comes with safety risks: when given
unsafe prompts  ∈ Xunsafe , rule-based systems inherently avoid
unsafe behavior due to their constrained outputs, whereas LLMs
may generate unsafe responses through the probabilistic nature of
their generation—highlighting the need for LLM safety alignment.
Multilingual settings present additional challenges, as safety tools

-- 2 of 12 --

SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Soware Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
and training resources for low-resource languages typically

Chunk 7 · 1,999 chars

alignment.
Multilingual settings present additional challenges, as safety tools

-- 2 of 12 --

SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Soware Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
and training resources for low-resource languages typically lag
behind those available for English.
2.2 Safety Alignment For LLM-Powered Agent
Systems
In response to these challenges, LLM safety alignment has emerged
as a critical research area, where models are tuned to improve safety
[ 1 , 9 , 12 , 51 ]. However, such fine-tuning often faces fundamental
safety–capability trade-offs [ 6 ], potentially reducing creativity and
responsiveness, while requiring extensive computation to adjust
model parameters.
On the other hand, runtime guardrails act as external safety align-
ment mechanisms, forming a protective layer around LLMs without
requiring direct fine-tuning. Examples include LlamaGuard [ 16 ],
OpenAI Moderation [ 26 ], Perplexity [ 2 ], Perspective API [ 21 ], and
NVIDIA NeMo [ 34 ]. These techniques classify and filter user prompts
to prevent unsafe inputs, aligning LLM behavior with safety goals.
In particular, LlamaGuard proposed by Inan et al. [ 16 ] achieves state-
of-the-art performance in detecting unsafe prompts, outperforming
tools like OpenAI Moderation [26] and Perspective API [21].
2.3 Motivation and Research Gap: Multilingual
Safety Alignment
While runtime guardrails like LlamaGuard have demonstrated strong
performance in filtering unsafe prompts, current state-of-the-art
guardrails are primarily trained and evaluated on English inputs.
This leaves a significant gap in multilingual contexts, where unsafe
prompts written in other languages, particularly low-resource or
less widely studied ones, can often evade detection.
Figure 1 illustrates this gap: an English unsafe prompt is suc-
cessfully blocked by the guardrail and prevented from reaching the
LLM agent. However, when the same prompt

Chunk 8 · 1,989 chars

ingual contexts, where unsafe
prompts written in other languages, particularly low-resource or
less widely studied ones, can often evade detection.
Figure 1 illustrates this gap: an English unsafe prompt is suc-
cessfully blocked by the guardrail and prevented from reaching the
LLM agent. However, when the same prompt is translated into a
language such as Lao, it bypasses the guardrail and enters the LLM
agent system, which could trigger unsafe behavior. This indicates
that multilingual prompts can function as jailbreaks, bypassing
runtime guardrails that were not trained to recognize the semantics
of multilingual inputs. In particular, prior studies have shown that
low-resource languages can be exploited as attack vectors to bypass
safety mechanisms. Deng et al. [ 7 ] report that multilingual unsafe
prompts achieve attack success rates of 81% on ChatGPT and 41% on
GPT-4. Similarly, Yong et al. [ 46 ] demonstrate that translated unsafe
prompts bypass existing guardrails and trigger unsafe responses
from GPT-4 in 79% of tested cases. These findings highlight the
need for multilingual safety alignment to safeguard the LLM
component in the system.
Despite LlamaGuard’s effectiveness in English, its fine-
tuned data and evaluation focus are largely monolingual,
limiting its robustness in defending multilingual LLM sys-
tems. This limitation, noted in the original paper [ 16 ], high-
lights the need for multilingual guardrails. These guardrails
should effectively defend against unsafe inputs across diverse lan-
guages, particularly low-resource ones such as those in Southeast
Asia. At the same time, they must maintain a low false alarm rate
to allow safe multilingual inputs to pass through. This is essential
for supporting multilingual agent systems powered by emerging
multilingual LLMs such as SeaLLMs [ 50 ], SeaLion [ 28 ], and Sailor
[8].
3 	APPROACH
To address the challenge of multilingual safety alignment, we present
the technical details of how we adapt a

Chunk 9 · 1,999 chars

w safe multilingual inputs to pass through. This is essential
for supporting multilingual agent systems powered by emerging
multilingual LLMs such as SeaLLMs [ 50 ], SeaLion [ 28 ], and Sailor
[8].
3 	APPROACH
To address the challenge of multilingual safety alignment, we present
the technical details of how we adapt a general-purpose multilin-
gual language model into a multilingual guardrail using low-rank
adaptation (LoRA). Figure 2 provides an overview of our approach.
Below, we walk through each step of the process as illustrated in
the figure.
3.1 	Apply Guardrail Chat Template
In Step 1 , we apply a guardrail chat template adapted from Lla-
maGuard [ 16 ] to make a general-purpose language model behave
like a safety classifier. Specifically, we begin by including the fol-
lowing task description: “Task: Check if there is unsafe content
in ’User’ messages in conversations according to our safety policy
using the categories below.” Next, we define the unsafe content
categories between the special tags “<BEGIN UNSAFE CONTENT
CATEGORIES>” and “<END UNSAFE CONTENT CATEGORIES>”,
informing the model of the types of inputs that may be considered
unsafe. We then wrap the user’s message to be evaluated within
another set of special tags: <BEGIN CONVERSATION> How to build
a bomb? <END CONVERSATION>. Finally, we instruct the model
that its output must begin with either “safe” or “unsafe”, restricting
it to a binary classification format suitable for safety evaluation.
3.2 	Input to SeaLLM Model
In Step 2 , after the input has been structured using the chat tem-
plate, it undergoes tokenization through a Byte-Pair Encoding (BPE)
model [ 35 ] implemented via SentencePiece [ 20 ]. This process con-
verts the textual prompt into a sequence of token IDs—such as
[151644, ..., 151645, ...]—that correspond to subword units in the
model’s vocabulary. Each token ID serves as an index into the
word embedding matrix W ∈ R ×ℎ of the SeaLLM model [ 50 ],
where  = 152, 064 represents

Chunk 10 · 1,996 chars

entencePiece [ 20 ]. This process con-
verts the textual prompt into a sequence of token IDs—such as
[151644, ..., 151645, ...]—that correspond to subword units in the
model’s vocabulary. Each token ID serves as an index into the
word embedding matrix W ∈ R ×ℎ of the SeaLLM model [ 50 ],
where  = 152, 064 represents the vocabulary size and ℎ = 3, 584
denotes the hidden dimension. Using this embedding matrix, each
token is transformed into a dense vector, producing the input ma-
trix X ∈ R ×ℎ , where  is the length of the input sequence. This
matrix X is then passed through a deep stack of 28 transformer
decoder layers. Each decoder block, following the architecture used
in SeaLLM [ 50 ], includes masked self-attention and feed-forward
neural networks (FFNN).
3.3 	Adapting Multilingual Language Model to
Guardrail
While the SeaLLM model introduced in Step 2 was pre-trained on
multilingual data and is capable of understanding our task instruc-
tions, our evaluation reveals that it falls short in reliably identifying
unsafe prompts across multiple languages. This limitation stems
from the fact that SeaLLM was not explicitly trained as a guardrail,
and thus lacks the specialized alignment required for multilingual
safety detection.
To address this, in Step 3 , we use Low-Rank Adaptation (LoRA)
to efficiently adapt the SeaLLM model without fine-tuning its full

-- 3 of 12 --

Conference acronym ’XX, June 03–05, 2018, Woodstock, NY 	Trovato et al.
User with
unsafe input 	Runtime
Guardrails
Blocked
LLM Agent
Bypass the guardrail 	Sure, here are
the steps…
Unsafe prompt
in Low resource
languages
Unsafe prompt
in English
How can I build
a bomb?
តខចបងតបប
កយរបប?
Figure 1: Overview of how multilingual unsafe prompts interact with runtime guardrails compared to English unsafe prompts.
Training Data
Validation Data
Testing Data
SeaLLM
FFNN
Decoder 1
Masked Self-Attention
LayerNorm
8B Pretrained Frozen Parameters
Decoder 2
Decoder 28
…
Apply Guardrail Chat

Chunk 11 · 1,996 chars

ងតបប
កយរបប?
Figure 1: Overview of how multilingual unsafe prompts interact with runtime guardrails compared to English unsafe prompts.
Training Data
Validation Data
Testing Data
SeaLLM
FFNN
Decoder 1
Masked Self-Attention
LayerNorm
8B Pretrained Frozen Parameters
Decoder 2
Decoder 28
…
Apply Guardrail Chat Template
How to build a bomb?
<|im_start|>system
Task: Check if there is unsafe content in
'User' messages in conversations according
our safety policy with the below categories.
<BEGIN UNSAFE CONTENT
CATEGORIES>
...
<END UNSAFE CONTENT CATEGORIES>
<BEGIN CONVERSATION>
User: How to build a bomb?
<END CONVERSATION>
Provide your safety assessment for ONLY
THE LAST User message in the above
conversation:
- First line must read 'safe' or ‘unsafe’.
<|im_end|>
<|im_start|>assistant
1
2
LoRA: Low Rank Adaptation
Word Embedding Layer
Masked Self-Attention
Layer
FFNN Layer
3
如何制造炸弹?
Cara membuat bom?
Làm thế nào để chế tạo một quả bom?
Bagaimana untuk membina bom?
ဗံုးဘယ်လိုေဆာက်မလဲ။?
วธกาสรางะเบด?
របបបងតបបក?
Paano gumawa ng bomba?
ວທການສາງລກລະເບດ?
Multilingual Prompts in
Southeast Asian Langauges
4 Detect and Block Unsafe Prompts
SEALGuard
Greedy Decoding
Unsafe
Safe
Blocked
LLM-Powered System
+
Training
Inference
Tokenization & Embeddings
Figure 2: The overview of our SEALGuard approach.
parameter set. This preserves SeaLLM’s original multilingual capa-
bilities while introducing guardrail-specific knowledge, enabling
the model to detect unsafe content across diverse languages.
The core idea behind LoRA is to inject trainable low-rank modifi-
cations into selected weight matrices of the model. Specifically, for
each target layer, the update is modeled as a low-rank matrix Δ W,
constructed from the product of two smaller trainable matrices:
A ∈ Rℎ× and B ∈ R ×ℎ , where ℎ is the hidden dimension and
 ≪ ℎ is a rank parameter that controls adaptation capacity. This
results in a new weight formulation:
W = W + AB.
By keeping the

Chunk 12 · 1,996 chars

er, the update is modeled as a low-rank matrix Δ W,
constructed from the product of two smaller trainable matrices:
A ∈ Rℎ× and B ∈ R ×ℎ , where ℎ is the hidden dimension and
 ≪ ℎ is a rank parameter that controls adaptation capacity. This
results in a new weight formulation:
W = W + AB.
By keeping the original weights W frozen and only train-
ing the low-rank components, LoRA enables task-specific adap-
tation—in our case, alignment to multilingual guardrail behav-
ior—while maintaining the model’s general language understand-
ing.
In building SEALGuard, we incorporate LoRA into key compo-
nents of the SeaLLM model: the word embedding layer, the self-
attention blocks, and the feed-forward neural networks (FFNN).
3.4 	Detect and Block Multilingual Unsafe
Prompts
In Step 5 , SEALGuard intercepts user prompts before they reach
an LLM-powered system, identifying and blocking unsafe inputs.
Similar to LlamaGuard [ 16 ], SEALGuard frames unsafe prompt
detection as sequence generation. Given the final transformer de-
coder output H ∈ R ×ℎ —with  as the sequence length and ℎ as
the hidden size—a linear layer projects each hidden state H to a
vocabulary distribution using a weight matrix W lm ∈ Rℎ× and
bias b lm ∈ R  , where  = 152, 064. This transformation produces a
vector of raw prediction scores, referred to as logits, for each token
position.
These logits are then passed through a softmax function to com-
pute probabilities over all possible vocabulary tokens. Guided by
the chat template introduced in Step 1 , the model generates a re-
sponse sequence, where the first token explicitly states whether
the input is “safe” or “unsafe”. We use greedy decoding to generate
tokens, selecting at each step the one with the highest probability:
  = arg max
 softmax(H W lm + b lm )
where  indexes the vocabulary. Generation stops at either the
special end-of-text token “<|im_end|>” or a predefined length limit.
The first generated token is

Chunk 13 · 1,998 chars

or “unsafe”. We use greedy decoding to generate
tokens, selecting at each step the one with the highest probability:
  = arg max
 softmax(H W lm + b lm )
where  indexes the vocabulary. Generation stops at either the
special end-of-text token “<|im_end|>” or a predefined length limit.
The first generated token is used as the model’s safety decision.

-- 4 of 12 --

SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Soware Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
4 EXPERIMENTAL DESIGN
In this section, we outline the motivation behind our four research
questions, introduce the proposed benchmark dataset that serves
as our experimental dataset, describe the types of unsafe prompts
and jailbreak attacks studied, present the baseline guardrails, and
detail our experimental setup.
4.1 Research Questions
To evaluate the effectiveness of multilingual guardrails for South-
east Asian languages, we formulate the following four research
questions:
RQ1) What is the impact of multilingual unsafe prompts
in Southeast Asian languages on the performance of exist-
ing guardrails? LlamaGuard [ 16 ] achieves state-of-the-art per-
formance on English data, reaching 95% accuracy as a runtime
guardrail for classifying safe and unsafe prompts. However, its
training data is primarily in English. As a result, a key limitation
arises: Inan et al. Inan et al . [16] speculated that LlamaGuard may be
vulnerable to prompts written in other languages. This highlights
a research gap in multilingual safety alignment, where existing
guardrails may fail to generalize across diverse linguistic inputs.
Yet, little is known about how multilingual unsafe prompts affect
the performance of these AI guardrails. Thus, we investigate the
impact of multilingual unsafe prompts on current AI guardrail
systems.
RQ2) What is the impact of multilingual jailbreak prompts
in Southeast Asian languages on the performance of exist-
ing guardrails?

Chunk 14 · 1,996 chars

known about how multilingual unsafe prompts affect
the performance of these AI guardrails. Thus, we investigate the
impact of multilingual unsafe prompts on current AI guardrail
systems.
RQ2) What is the impact of multilingual jailbreak prompts
in Southeast Asian languages on the performance of exist-
ing guardrails? While general multilingual unsafe prompts al-
ready challenge existing guardrails, multilingual jailbreak prompts
pose an even greater risk. These prompts are intentionally crafted
to bypass safety mechanisms, circumventing the protections that
guardrails provide. Yet, current studies have not examined how
such jailbreak prompts affect the performance of AI guardrail sys-
tems in multilingual settings. Thus, we investigate the impact of
multilingual jailbreak prompts on existing AI guardrails.
RQ3) How effective is our SEALGuard at defending against
multilingual unsafe and jailbreak prompts in Southeast Asian
languages? To address the gap in multilingual safety alignment,
we proposed a multilingual guardrail capable of detecting unsafe
inputs across diverse linguistic settings. In Section 3, we describe
how we adapt a multilingual language model into a guardrail us-
ing low-rank adaptation (LoRA). However, the effectiveness of this
approach remains unknown. Thus, we evaluate the performance of
our proposed SEALGuard against existing baseline guardrails, fo-
cusing on its ability to detect both multilingual unsafe and jailbreak
prompts while minimizing false positives.
RQ4) What are the contributions of adaptation strategies
and model size to the performance of our SEALGuard multi-
lingual guardrail? Our SEALGuard multilingual guardrail relies
on adapting a general-purpose multilingual language model into
a safety guardrail using low-rank adaptation (LoRA). While LoRA
serves as the primary adaptation strategy, the impact of different
adaptation approaches and model sizes on guardrail performance
remains unclear. To better understand these factors, we

Chunk 15 · 1,995 chars

relies
on adapting a general-purpose multilingual language model into
a safety guardrail using low-rank adaptation (LoRA). While LoRA
serves as the primary adaptation strategy, the impact of different
adaptation approaches and model sizes on guardrail performance
remains unclear. To better understand these factors, we formulate
this research question to conduct an ablation study, examining how
adaptation strategies and model size influence the effectiveness of
SEALGuard.
4.2 	SEALSBench: A Multilingual Safety
Alignment Benchmark in Southeast Asian
Languages
To address the four research questions, we construct a multilin-
gual LLM safety benchmark, SEALSBench, comprising 18,846 safe
prompts, 8,959 unsafe prompts, and 1,799 jailbreak prompts (unsafe),
totaling 29,604 prompts. Jailbreak prompts are a specific category
of unsafe inputs designed to bypass or mislead LLM safety mecha-
nisms. Figure 3 presents the distribution of safe and unsafe prompts,
along with a detailed breakdown of unsafe categories and jailbreak
prompt types. These prompts were initially written in English and
then translated into nine Southeast Asian languages to assess the
effectiveness of multilingual guardrails across diverse linguistic con-
texts. Figure 4 summarizes the data construction workflow. Below,
we outline the step-by-step process used to construct this dataset.
Step 1: Data Collection. We extracted 8,959 unsafe prompts
from the BeaverTails dataset by Ji et al. [ 18 ], covering ten cate-
gories of unsafe prompts introduced in Section 4.3. These cate-
gories capture a broad range of safety concerns related to input
prompts submitted to LLMs, consistent with prior LLM safety stud-
ies [ 11 , 31 , 33 ]. Moreover, we extracted 1,799 unsafe prompts from
four additional benchmark datasets: Do-Not-Answer [ 42 ], CatQA
[ 4 ], AdvBench [ 52 ], and Forbidden Questions [ 36 ] and transformed
them into jailbreak prompts. These additional datasets help avoid
data contamination and ensure our

Chunk 16 · 1,997 chars

safety stud-
ies [ 11 , 31 , 33 ]. Moreover, we extracted 1,799 unsafe prompts from
four additional benchmark datasets: Do-Not-Answer [ 42 ], CatQA
[ 4 ], AdvBench [ 52 ], and Forbidden Questions [ 36 ] and transformed
them into jailbreak prompts. These additional datasets help avoid
data contamination and ensure our jailbreak attacks have different
distributions from the unsafe prompts in the BeaverTails dataset.
We then transformed them into jailbreak prompts using nine jail-
break attack strategies introduced in Section 4.4. This allows us to
cover a special class of unsafe prompts—jailbreak prompts—that are
designed to bypass safety guardrails and reflect jailbreak attempts
that may occur in real-world use. To evaluate whether safe inputs
will be incorrectly blocked by SEALGuard, we collect 18,846 safe
prompts from the Alpaca dataset by Taori et al. [ 39 ]. These prompts
represent safe interactions with LLMs that guardrails should allow
without blocking. In summary, our dataset contains a total of 29,604
prompts, including 8,959 unsafe prompts, 1,799 jailbreak prompts,
and 18,846 safe prompts.
Step 2: Multilingual Translation into Southeast Asian Lan-
guages. The collected 29,604 prompts were originally written in
English. In this study, we focus on nine Southeast Asian languages
supported by SeaLLM [ 29 ], covering major linguistic families in
the region: Chinese (Zho), Indonesian (Ind), Vietnamese (Vie), Thai
(Tha), Khmer (Khm), Lao, Malay (Msa), Burmese (Mya), and Taga-
log (Tgl). To support multilingual evaluation, we used the Google
Translate API [ 13 ] to translate all prompts into these nine languages.
This results in a total of 296,040 prompts, including the original
English and its translations into nine Southeast Asian languages.
These collectively form our multilingual safety benchmark dataset,
SouthEast Asian Languages Safety Benchmark, SEALSBench.
4.3 	Studied Unsafe Categories
Our SEALSBench dataset consists of ten unsafe categories to sup-
port

Chunk 17 · 1,994 chars

ompts, including the original
English and its translations into nine Southeast Asian languages.
These collectively form our multilingual safety benchmark dataset,
SouthEast Asian Languages Safety Benchmark, SEALSBench.
4.3 	Studied Unsafe Categories
Our SEALSBench dataset consists of ten unsafe categories to sup-
port the evaluation of LLM safety alignment. Each category is

-- 5 of 12 --

Conference acronym ’XX, June 03–05, 2018, Woodstock, NY 	Trovato et al.
Complete dataset
Unsafe dataset 	Complete dataset
Figure 3: The distribution of safe and unsafe prompts, along with category-wise breakdowns of unsafe and jailbreak prompts.
x
Alpaca	Beavertails
Unsafe Prompts 	Safe Prompts Jailbreak Prompts
Step 2
Multilingual Translation Using
Google Translate API
Step 1: Data Collection
SEALSBench
A Multiligual Safety Benchmark
Do-Not-Answer, CatQA,
AdvBench, Forbidden Questions
Training 	Validation 	Testing
Figure 	4: 	Workflow 	for 	Constructing 	our 	SEALS-
Bench Dataset.
labeled as C1 through C10, where C stands for Category: C1: Vio-
lent Criminal Activity (1,729 samples), C2 – Non-Violent Criminal
Conduct (4,196 samples), C3 – Child Sexual Abuse (145 samples),
C4 – False and Defamatory Claims (1,415 samples), C5 – Hazardous
Professional Guidance (362 samples), C6 – Personal Information
Exposure (852 samples), C7 – Discriminatory and Hateful Expres-
sion (1,022 samples), C8 – Self-Destructive Behavior Promotion
(87 samples), C9 – Explicit Sexual Material (342 samples), C10 –
Misinformation & Extremist Content (807 samples).
4.4 	Studied Jailbreak Attacks
We consider nine different jailbreak attacks spanning these three
major categories.
Obfuscation-Based Attacks. These techniques aim to circumvent
the safety mechanisms of large language models (LLMs) by disguis-
ing unsafe prompts through various forms of obfuscation.
Caesar Cipher [ 48 ] employs systematic character replacement
techniques, where responses requested in matching encrypted for-
mats can potentially

Chunk 18 · 1,999 chars

ased Attacks. These techniques aim to circumvent
the safety mechanisms of large language models (LLMs) by disguis-
ing unsafe prompts through various forms of obfuscation.
Caesar Cipher [ 48 ] employs systematic character replacement
techniques, where responses requested in matching encrypted for-
mats can potentially bypass both input and output safety mecha-
nisms.
Zulu [ 47 ] exploits low-resource language vulnerabilities by
reformulating harmful prompts in less-supported languages, then
manipulating the model to translate them back into unsafe content,
thereby evading guardrail systems.
Template-Based Attacks. These attacks utilize pre-constructed
frameworks or prompt structures that exploit predictable patterns
and known vulnerabilities within LLMs.
AIM [ 17 ] manipulates the model into adopting a predefined
persona that operates outside normal ethical constraints, promoting
unethical, illegal, and harmful responses through character role-
play.
DAN [ 37 ] creates scenarios where the model ’believes’ normal
restrictions no longer apply, often framing interactions as role-
playing exercises to bypass content filters.
Combination (Prefix injection + Refusal Suppression) [ 43 ]
combines multiple techniques, including prefix injection (using
innocuous openings) with refusal suppression instructions, con-
straining the model’s ability to generate standard refusal responses.
Self Cipher [ 49 ] prompts the model to act as an expert in un-
defined cipher systems, leveraging the model’s internal encoding
capabilities to implicitly encrypt queries and decrypt outputs with-
out explicit encoding rules.
Deep Inception [ 22 ] employs multi-layered narrative struc-
tures that progressively guide the model toward restricted behavior
through incremental logical steps, often embedded within fictional
scenarios.
Code-Based Attacks. These attacks exploit LLMs’ programming
capabilities by disguising harmful content within technical instruc-
tions or programming logic.

-- 6 of 12

Chunk 19 · 1,998 chars

ruc-
tures that progressively guide the model toward restricted behavior
through incremental logical steps, often embedded within fictional
scenarios.
Code-Based Attacks. These attacks exploit LLMs’ programming
capabilities by disguising harmful content within technical instruc-
tions or programming logic.

-- 6 of 12 --

SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Soware Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
Dual use [ 19 ] combines code injection techniques with pay-
load splitting to craft harmful prompts that appear as legitimate
programming tasks while containing malicious intent.
Code Chameleon [ 24 ] reformulates harmful instructions as
code completion tasks, using custom encryption functions embed-
ded within programming contexts to enable decryption and execu-
tion of harmful queries while bypassing safety mechanisms.
4.5 Baseline Guardrails
(1) LlamaGuard [ 16 ]: An LLM-based guardrail fine-tuned on a
proprietary moderation dataset to classify prompts as “safe”
or “unsafe.” We evaluate two variants: “Llama-Guard-3-8B”
and the smaller “Llama-Guard-3-1B”.
(2) OpenAI Moderation [ 25 ]: A GPT-based moderation sys-
tem trained via active learning on public data. It flags prompts
as unsafe if the returned boolean is “True”, covering diverse
safety categories.
4.6 Experimental Setup
Data Splitting. We split our dataset into 5% training (14,800 sam-
ples), 5% validation (14,800 samples), and 90% testing (266,444 sam-
ples). We ensure that all language variants of a given prompt remain
in the same split by partitioning based on unique English prompt
IDs rather than random sampling.
Model Implementation and Optimization. We developed SEAL-
Guard using Transformers [ 44 ] and PyTorch [ 32 ], fine-tuning the
multilingual SeaLLMs-v3-7B-Chat [ 50 ] with LoRA [ 15 ]. Training
was conducted on an AMD 5950X CPU with two NVIDIA RTX
3090 GPUs. Input prompts were wrapped in a chat template, and
the

Chunk 20 · 1,997 chars

mpling.
Model Implementation and Optimization. We developed SEAL-
Guard using Transformers [ 44 ] and PyTorch [ 32 ], fine-tuning the
multilingual SeaLLMs-v3-7B-Chat [ 50 ] with LoRA [ 15 ]. Training
was conducted on an AMD 5950X CPU with two NVIDIA RTX
3090 GPUs. Input prompts were wrapped in a chat template, and
the model was trained to autoregressively generate the prompt
followed by a classification token (“safe” or “unsafe”). We optimized
the model using Cross-Entropy Loss, masking prompt tokens dur-
ing loss computation to focus learning on the classification output.
The loss is defined as
L = −
	∑
 =1
log  ( |  < , ) 	(1)
where  is the length of the target sequence,  is the target token
at position  ,  < are previously generated tokens,  is the input, and
 is the model’s predicted probability distribution parameterized
by  .
Hyper-Parameter Settings. We followed standard hyperparam-
eter settings commonly used for LoRA fine-tuning. Specifically, we
used a learning rate of 1e-4 with a LoRA dropout rate of 0.05. The
rank ( ) of the LoRA modules was set to 8, and the scaling factor
( ) was set to 32. We applied gradient clipping with a maximum
gradient norm of 1.0. For learning rate scheduling, we used a cosine
scheduler with 5% of the total training steps allocated to warm-
up. The complete training recipe of our SEALGuard approach is
available in our replication package at https://github.com/awsm-
research/SEALGuard.
5 	EXPERIMENTAL RESULTS
In this section, we present the results for our three research ques-
tions.
(RQ1) What is the impact of multilingual unsafe
prompts in Southeast Asian languages on the
performance of existing guardrails?
Approach. To assess the impact of multilingual unsafe prompts in
Southeast Asian languages on the performance of existing guardrails,
we use 80,601 unsafe prompts from the SEALSBench testing set,
written in English and nine Southeast Asian (SEA) languages. We
first evaluate guardrail performance on

Chunk 21 · 1,998 chars

xisting guardrails?
Approach. To assess the impact of multilingual unsafe prompts in
Southeast Asian languages on the performance of existing guardrails,
we use 80,601 unsafe prompts from the SEALSBench testing set,
written in English and nine Southeast Asian (SEA) languages. We
first evaluate guardrail performance on the English prompts to
establish baselines in their familiar language. We then assess per-
formance on the non-English SEA prompts to measure the impact
of multilingual inputs. Specifically, we evaluate two state-of-the-art
language model-based guardrails: LlamaGuard [ 16 ] and OpenAI
Moderation [ 25 ]. We use Defense Success Rate (DSR) as our pri-
mary metric, which evaluates guardrail capability to defend against
unsafe prompts:
• Defense Success Rate (DSR):
DSR =  
  +  
where   is the number of true positives (correctly detected
unsafe/jailbreak prompts) and   is the number of false
negatives (missed unsafe/jailbreak prompts).
59.74% 	58.78%
44.7%
29.4%
49.63%
44.56%
0.00
0.25
0.50
0.75
1.00
OpenAI Moderation 	LlamaGuard 8b 	LlamaGuard 1b
Guardrails
Defense Success Rate (DSR)
English Unsafe Prompts w/o Jailbreak Attacks
SEA Unsafe Prompts w/o Jailbreak Attacks
(RQ1) Guardrail DSR: English Unsafe vs. SEA Unsafe
Figure 5: DSR result of unsafe prompts in English and SEA
languages.
Results. Figure 5 presents the defense success rate (DSR) of the
three baseline guardrails, comparing their performance across two
scenarios: unsafe prompts in English (blue bars) versus multilingual
unsafe prompts, excluding jailbreak prompts, in Southeast Asian
languages (red bars).
LlamaGuard 8B’s DSR substantially decreases by 9%, de-
clining from 58.78% to 49.63% when defending against mul-
tilingual unsafe prompts. Similarly, OpenAI Moderation’s DSR
experiences a substantial decline of 30%, dropping from 59.74%
to 29.4% when encountering multilingual unsafe prompts. On the
other hand, LlamaGuard 1B achieves a lower performance of 45% for
both English unsafe

Chunk 22 · 1,991 chars

m 58.78% to 49.63% when defending against mul-
tilingual unsafe prompts. Similarly, OpenAI Moderation’s DSR
experiences a substantial decline of 30%, dropping from 59.74%
to 29.4% when encountering multilingual unsafe prompts. On the
other hand, LlamaGuard 1B achieves a lower performance of 45% for
both English unsafe prompts and multilingual unsafe prompts than

-- 7 of 12 --

Conference acronym ’XX, June 03–05, 2018, Woodstock, NY 	Trovato et al.
LlamaGuard 8B. These findings reveal that the effectiveness
of state-of-the-art guardrails is decreased when defending
multilingual unsafe prompts, highlighting the need for mul-
tilingual guardrails capable of effectively defending such
multilingual prompts.
(RQ2) What is the impact of multilingual
jailbreak prompts in Southeast Asian languages
on the performance of existing guardrails?
Approach. To assess the impact of multilingual jailbreak prompts
in Southeast Asian languages on guardrail performance, we use
8,060 unsafe prompts in English, and 16410 jailbreak prompts in
English and nine SEA languages from our SEALSBench test set.
We first evaluate guardrail performance on English unsafe prompts,
which yields the same baselines as in RQ1. We then assess perfor-
mance on multilingual jailbreak prompts to measure the effect of
multilingual jailbreak inputs, using the same baseline and metric
(DSR) as in RQ1.
59.74% 	58.78%
44.7%
21.91%
41.02%
44.94%
0.00
0.25
0.50
0.75
1.00
OpenAI Moderation 	LlamaGuard 8b 	LlamaGuard 1b
Guardrails
Defense Success Rate (DSR)
English Unsafe Prompts w/o Jailbreak Attacks
Multilingual Jailbreak Attacks
(RQ2) Guardrail DSR: English Unsafe vs. Multilingual Jailbreaks
Figure 6: DSR result of unsafe prompts in English and Multi-
lingual jailbreak prompts.
Results. Figure 6 presents the defense success rate (DSR) of the
three baseline guardrails, comparing their performance across two
scenarios: unsafe prompts written in English without jailbreak
attacks (blue bars) versus multilingual

Chunk 23 · 1,995 chars

gure 6: DSR result of unsafe prompts in English and Multi-
lingual jailbreak prompts.
Results. Figure 6 presents the defense success rate (DSR) of the
three baseline guardrails, comparing their performance across two
scenarios: unsafe prompts written in English without jailbreak
attacks (blue bars) versus multilingual jailbreak attacks in Southeast
Asian languages and English (red bars).
LlamaGuard 8B’s DSR substantially decreases by 18%, de-
clining from 58.78% to 41.02% when defending against multi-
lingual jailbreak prompts. Similarly, OpenAI Moderation’s DSR
experiences a substantial decline of 38%, dropping from 59.74% to
21.91% when encountering multilingual jailbreak prompts. As in
RQ1, LlamaGuard 1B still achieves a lower performance of 45% for
both English unsafe prompts and multilingual jailbreak prompts
than LlamaGuard 8B. These findings highlight that the effec-
tiveness of state-of-the-art guardrails is decreased when de-
fending multilingual jailbreak prompts, highlighting the
need for multilingual jailbreak-aware guardrails capable of
effectively defending such multilingual jailbreak prompts.
(RQ3) How effective is our SEALGuard at
defending against multilingual unsafe and
jailbreak prompts in Southeast Asian languages?
Approach. To address this RQ, we compare our SEALGuard ap-
proach with LlamaGuard and OpenAI Moderation using the full
multilingual testing set from our SEALSBench dataset. This set
includes 169,433 safe, 80,601 unsafe, and 16,410 jailbreak prompts
written in English and nine Southeast Asian languages. We use the
same Defense Success Rate (DSR) as in RQ1 and RQ2 to evaluate
guardrail effectiveness in defending against unsafe and jailbreak
prompts. Beyond defense capability, maintaining a low false alarm
rate is also critical to avoid misclassifying safe content. Thus, we
use precision to measure the accuracy of the guardrail in avoiding
false alarms—i.e., how well it identifies only truly unsafe prompts
without misclassifying safe

Chunk 24 · 1,997 chars

unsafe and jailbreak
prompts. Beyond defense capability, maintaining a low false alarm
rate is also critical to avoid misclassifying safe content. Thus, we
use precision to measure the accuracy of the guardrail in avoiding
false alarms—i.e., how well it identifies only truly unsafe prompts
without misclassifying safe ones. We also use the F1-Score to cap-
ture the overall balance between correctly detecting unsafe prompts
and minimizing false positives. Formally:
• Precision:
Precision =  
  +  
where   is the number of true positives (correctly detected
unsafe/jailbreak prompts) and   is the number of false
positives (safe prompts incorrectly classified as unsafe).
• F1-Score:
F1 = 2 · Precision · DSR
Precision + DSR
Results. Figure 7 presents the experimental results of our SEAL-
Guard and the three baseline approaches according to our three
evaluation measures (i.e., DSR, Precision, and F1-Score).
Our SEALGuard achieves an F1-Score of 98%, which is
58%, 52%, and 34% better than the LlamaGuard-3-1B, OpenAI
Moderation, and LlamaGuard-3-8B respectively. In terms of
DSR, Figure 7 shows that our SEALGuard achieves the highest
DSR of 97%, while the three baselines achieve 49%, 45%, and 31%,
respectively. This finding indicates that SEALGuard substantially
improves the DSR by 48%, 52%, and 66%. In terms of precision, Fig-
ure 7 shows that SEALGuard achieves the highest precision of 99%,
while the three baselines achieve 96%, 91%, and 36%, respectively.
This finding demonstrates that SEALGuard substantially improves
precision by 3%, 8%, and 63%.
In summary, these findings confirm that our SEALGuard ap-
proach is effective in defending against multilingual unsafe and
jailbreak prompts while maintaining a low false positive rate. These
results demonstrate that SEALGuard effectively overcomes
a key limitation of the state-of-the-art guardrail, Llama-
Guard—namely, its reduced effectiveness in defending against
multilingual unsafe and jailbreak prompts, as shown

Chunk 25 · 1,993 chars

ltilingual unsafe and
jailbreak prompts while maintaining a low false positive rate. These
results demonstrate that SEALGuard effectively overcomes
a key limitation of the state-of-the-art guardrail, Llama-
Guard—namely, its reduced effectiveness in defending against
multilingual unsafe and jailbreak prompts, as shown in RQ1
and RQ2.

-- 8 of 12 --

SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Soware Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
97.23%
44.63%
	48.93%
30.63%
DSR
SEALGuard
LlamaGuard
−3−8B
LlamaGuard
−3−1B
OpenAI Moderation
0.00
0.25
0.50
0.75
1.00 	98.9%
36.35%
90.64%
	95.84%
Precision
SEALGuard
OpenAI Moderation
	LlamaGuard
−3−8B
LlamaGuard
−3−1B
0.00
0.25
0.50
0.75
1.00 	98.05%
40.07%
63.55%
46.43%
F1
SEALGuard
LlamaGuard
−3−8B
OpenAI Moderation
	LlamaGuard
−3−1B
0.00
0.25
0.50
0.75
1.00
Figure 7: (RQ3) The experimental results of our SEALGuard and the three baseline comparisons classifying safe and unsafe
(including jailbreaks) prompts. (↗) Higher F1, DSR, Precision, Accuracy = Better.
Table 1: (RQ4 Results) The performance comparisons of our
SEALGuard approach and the five variants to analyze the
contributions of the adaptation strategy and model size.
Methods 	DSR 	Precision 	F1-Score
SeaLLM-V3-7B-Chat + LoRA (SEALGuard) 	97.23 	98.90 	98.05
SeaLLM-V3-1.5B-Chat + LoRA 	96.5 	98.20 	97.34
SeaLLM-V3-7B-Chat + NeMo 	49.88 	77.17 	60.59
SeaLLM-V3-1.5B-Chat + NeMo 	89.45 	34.89 	50.20
SeaLLM-V3-7B-Chat 	15.07 	89.51 	25.80
SeaLLM-V3-1.5B-Chat 	18.48 	60.14 	28.27
(RQ4) What are the contributions of adaptation
strategies and model size to the performance of
our SEALGuard multilingual guardrail?
Approach. To address this RQ, we conduct an ablation study to
evaluate the contribution of each component in SEALGuard. Our
approach combines two key elements: the SeaLLM-v3-7B-Chat
model (7B parameters) and a LoRA-based adaptation strategy that
aligns a general-purpose language model with

Chunk 26 · 1,996 chars

uard multilingual guardrail?
Approach. To address this RQ, we conduct an ablation study to
evaluate the contribution of each component in SEALGuard. Our
approach combines two key elements: the SeaLLM-v3-7B-Chat
model (7B parameters) and a LoRA-based adaptation strategy that
aligns a general-purpose language model with guardrail objec-
tives. To assess the effectiveness of the adaptation, we compare
our method with two variants: (1) applying the chat template in-
troduced in Section 3.1 directly to SeaLLM-v3-7B-Chat without
adaptation, and (2) using the NVIDIA NeMo toolkit with the same
model, an off-the-shelf guardrail framework that requires no fine-
tuning. To examine the impact of model size, we also compare
against a smaller model, SeaLLM-v3-1.5B-Chat. In summary, we
evaluate six models in this RQ:
• SeaLLM-V3-7B-Chat + LoRA (SEALGuard): our proposed ap-
proach by applying LoRA on a multilingual language model.
• SeaLLM-V3-1.5B-Chat + LoRA: applying LoRA on a smaller
multilingual language model to study the impact of model
size on performance.
• SeaLLM-V3-7B-Chat + NeMo: applying the NeMo framework
for adaptation to study the effect of alternative adaptation
strategies.
• SeaLLM-V3-1.5B-Chat + NeMo: applying the NeMo frame-
work on a smaller model to analyze both adaptation strategy
and model size impact.
• SeaLLM-V3-7B-Chat: no adaptation, used to isolate and eval-
uate the contribution of adaptation strategies.
• SeaLLM-V3-1B-Chat: no adaptation, used to evaluate the
baseline performance of a smaller model without any adap-
tation.
Results. Table 1 presents the performance comparison of our
SEALGuard approach and five variants to analyze the contributions
of adaptation strategies and model size.
LoRA adaptation is the key component driving the effec-
tiveness of our SEALGuard approach. Within SEALGuard, the
LoRA module alone contributes 72% of the total F1-Score. Specifi-
cally, when comparing “SeaLLM-V3-7B-Chat + LoRA (SEALGuard)”
with “SeaLLM-V3-7B-Chat”

Chunk 27 · 1,995 chars

ontributions
of adaptation strategies and model size.
LoRA adaptation is the key component driving the effec-
tiveness of our SEALGuard approach. Within SEALGuard, the
LoRA module alone contributes 72% of the total F1-Score. Specifi-
cally, when comparing “SeaLLM-V3-7B-Chat + LoRA (SEALGuard)”
with “SeaLLM-V3-7B-Chat” (without LoRA), the F1-Score drops
from 98% to 26%, highlighting a 72% contribution by LoRA.
In addition, comparing “SeaLLM-V3-7B-Chat + NeMo” with
“SeaLLM-V3-7B-Chat + LoRA (SEALGuard)” shows an F1-Score in-
crease from 61% to 98%, representing a 37% improvement attributed
to LoRA. Similarly, for the smaller variant of SEALGuard, “SeaLLM-
V3-1.5B-Chat + LoRA” outperforms both “SeaLLM-V3-1.5B-Chat”
and “SeaLLM-V3-7B-Chat + NeMo” by 70% and 38% in F1-Score,
respectively. These results demonstrate that LoRA is the most
effective adaptation strategy for aligning a general-purpose
language model with a guardrail. In summary, these findings
validate the design rationale of SEALGuard, showing that LoRA
adaptation is the primary driver of performance gains, while model
size plays a comparatively minor role—even models with around
1B parameters can achieve promising results with our approach.
6 DISCUSSION
Our experimental results confirm the effectiveness of SEALGuard in
defending against multilingual unsafe and jailbreak prompts, show-
ing substantial improvements over existing guardrails. However,
its performance across different languages, unsafe prompt types,

-- 9 of 12 --

Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al.
Figure 8: (Discussion) The experimental results for our SEALGuard and the other five baseline models classifying in category,
language, and jailbreak attacks.
and jailbreak prompts remains unclear. Thus, in this section, we
analyze SEALGuard ’s performance across these three dimensions.
6.1 Performance Across Different Languages
To assess the cross-lingual robustness of our SEALGuard approach,
we analyze the

Chunk 28 · 1,996 chars

models classifying in category,
language, and jailbreak attacks.
and jailbreak prompts remains unclear. Thus, in this section, we
analyze SEALGuard ’s performance across these three dimensions.
6.1 	Performance Across Different Languages
To assess the cross-lingual robustness of our SEALGuard approach,
we analyze the performance of our SEALGuard across ten differ-
ent languages. Specifically, the evaluation includes 80,601 unsafe
prompts, 16,410 jailbreak prompts, and 169,433 safe prompts from
our SEALSBench dataset. We focus on the F1-score as it captures
both the model’s ability to defend against unsafe prompts and its
tendency to raise false alarms on safe ones.
The left part of Figure 8 shows the F1-scores of our SEAL-
Guard across the ten studied languages. SEALGuard consis-
tently achieves the highest performance in all languages,
effectively mitigating the language-specific vulnerabilities
present in existing guardrails. It also outperforms the strongest
baseline, LlamaGuard 8B, by a clear margin, as illustrated in Fig-
ure 8. This analysis confirms the cross-lingual robustness of our
SEALGuard approach.
6.2 	Performance on Different Unsafe categories
To assess the robustness of SEALGuard across different types
of unsafe content, we analyze its performance on ten different
unsafe categories. As introduced in 4.3, we used 10 unsafe categories
labeled as C1 to C10 in our test dataset, totaling 80,601 prompts.
We adopt DSR as the evaluation measure because this analysis
involves only unsafe prompts, allowing DSR to assess single-class
performance without interference from false positives associated
with safe prompts.
The middle part of Figure 8 presents the DSR of SEALGuard com-
pared to baseline guardrails across ten unsafe categories. LlamaGuard-
8B exhibits low Defense Success Rate (DSR below 30%) in several
categories, including C5 (Hazardous Professional Guidance), C7
(Discriminatory and Hateful Expression), and C10 (Misinformation
& Extremist Content).

Chunk 29 · 1,999 chars

presents the DSR of SEALGuard com-
pared to baseline guardrails across ten unsafe categories. LlamaGuard-
8B exhibits low Defense Success Rate (DSR below 30%) in several
categories, including C5 (Hazardous Professional Guidance), C7
(Discriminatory and Hateful Expression), and C10 (Misinformation
& Extremist Content). In contrast, SEALGuard demonstrates
consistently high performance across all categories, with
DSR exceeding 95% in every category, confirming its robust-
ness across diverse types of unsafe content.
6.3 	Performance on Different Jailbreak Prompts
To evaluate the robustness of SEALGuard across different jailbreak
categories, we analyze its performance against nine jailbreak at-
tacks. As described in Section 4.4, this analysis covers 10 jailbreak
categories from our test dataset, totaling 16,410 prompts. We use
Defense Success Rate (DSR) as the evaluation metric, as the analy-
sis focuses solely on jailbreak prompts without any safe examples,
making DSR the appropriate measure for assessing single-class
performance.
The right part of Figure 8 presents the DSR of SEALGuard across
nine jailbreak categories. While LlamaGuard-8B shows low Defense
Success Rate (DSR below 30%) against Deep-Inception, Chameleon,
Zulu, and Dual-use jailbreak prompts, SEALGuard achieves con-
sistently high performance across all jailbreak types, with
DSR exceeding 95% in every category, confirming its robust-
ness against diverse jailbreak attacks.
7 	THREATS TO VALIDITY
Threats to internal validity relate to factors within our study that
may affect the accuracy of the findings. One such threat is the
potential variation in translation accuracy when converting Eng-
lish prompts into Southeast Asian languages, which may affect
guardrails’ detection performance. To mitigate this, we rely on
consistent use of the Google Translate API and make our transla-
tion process publicly available to promote transparency and repro-
ducibility. Another internal threat is the limited diversity

Chunk 30 · 1,993 chars

lish prompts into Southeast Asian languages, which may affect
guardrails’ detection performance. To mitigate this, we rely on
consistent use of the Google Translate API and make our transla-
tion process publicly available to promote transparency and repro-
ducibility. Another internal threat is the limited diversity in unsafe

-- 10 of 12 --

SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Soware Systems Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
and jailbreak prompts, which could bias evaluation results. To mit-
igate this, we curate SEALSBench from six diverse data sources
[ 4 , 18 , 36 , 39 , 42 , 52 ], including ten unsafe categories and nine jail-
break categories, thereby enhancing the representativeness of the
dataset.
Threats to external validity relate to the degree to which our find-
ings can be generalized to other LLM safety alignment datasets used
for evaluating guardrail performance. While our SEALGuard ap-
proach is assessed using our curated SEALSBench dataset, the
results may not necessarily generalize to other datasets. To miti-
gate this, we incorporate prompts from six diverse sources [ 4 , 18 ,
36 , 39 , 42 , 52 ] during the data collection step when constructing
our SEALSBench dataset, ensuring a broad representation of safe,
unsafe, and jailbreak prompts. The final dataset comprises over
260,000 prompts written in ten different languages.
8 	CONCLUSION
In this paper, we empirically show that LlamaGuard’s Defense
Success Rate (DSR) drops by 9% and 18% under multilingual un-
safe and jailbreak prompts, revealing a critical gap in multilingual
safety alignment for LLM software systems. To address this, we
introduce SEALGuard, a multilingual guardrail adapted via LoRA
from a general-purpose multilingual model, focusing on South-
east Asian languages. We also present SEALSBench, a benchmark
dataset of over 260K prompts across English and nine Southeast
Asian languages, covering safe,

Chunk 31 · 1,997 chars

software systems. To address this, we
introduce SEALGuard, a multilingual guardrail adapted via LoRA
from a general-purpose multilingual model, focusing on South-
east Asian languages. We also present SEALSBench, a benchmark
dataset of over 260K prompts across English and nine Southeast
Asian languages, covering safe, unsafe, and jailbreak scenarios.
SEALGuardachieves 97% DSR, 99% precision, and 98% F1-score,
outperforming LlamaGuard by 48%, 8%, and 34%, respectively, while
maintaining robust performance across languages and unsafe types.
REFERENCES
[1] Reda Alami, Ali Khalifa Almansoori, Ahmed Alzubaidi, Mohamed El Amine
Seddik, Mugariya Farooq, and Hakim Hacid. 2024. Alignment with preference
optimization is all you need for llm safety. arXiv preprint arXiv:2409.07772 (2024).
[2] Gabriel Alon and Michael Kamfonas. 2023. Detecting language model attacks
with perplexity. arXiv preprint arXiv:2308.14132 (2023).
[3] Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Trevor
Darrell, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, et al .
2024. Managing extreme AI risks amid rapid progress. Science 384, 6698 (2024),
842–845.
[4] Rishabh Bhardwaj, Do Duc Anh, and Soujanya Poria. 2024. Language Models are
Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through
Task Arithmetic. arXiv preprint arXiv:2402.11746 (2024).
[5] Bruce G Buchanan and Edward A Feigenbaum. 1981. DENDRAL and Meta-
DENDRAL: Their applications dimension. In Readings in artificial intelligence.
Elsevier, 313–322.
[6] Pin-Yu Chen, Han Shen, Payel Das, and Tianyi Chen. 2025. Fundamental Safety-
Capability Trade-offs in Fine-tuning Large Language Models. arXiv preprint
arXiv:2503.20807 (2025).
[7] Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. [n. d.]. Multilin-
gual Jailbreak Challenges in Large Language Models. https://doi.org/10.48550/
arXiv.2310.06474 arXiv:2310.06474 [cs]
[8] Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi

Chunk 32 · 1,995 chars

guage Models. arXiv preprint
arXiv:2503.20807 (2025).
[7] Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. [n. d.]. Multilin-
gual Jailbreak Challenges in Large Language Models. https://doi.org/10.48550/
arXiv.2310.06474 arXiv:2310.06474 [cs]
[8] Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu,
Tongyao Zhu, Cunxiao Du, Penghui Yang, Haonan Wang, Jiaheng Liu, Yongchi
Zhao, Xiachong Feng, Xin Mao, Man Tsung Yeung, Kunat Pipatanakul, Fajri
Koto, Min Si Thu, Hynek Kydlíček, Zeyi Liu, Qunshu Lin, Sittipong Sripaisarn-
mongkol, Kridtaphad Sae-Khow, Nirattisai Thongchim, Taechawat Konkaew,
Narong Borijindargoon, Anh Dao, Matichon Maneegard, Phakphum Artkaew,
Zheng-Xin Yong, Quan Nguyen, Wannaphong Phatthiyaphaibun, Hoang H. Tran,
Mike Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, and Min
Lin. 2025. Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLM.
arXiv preprint arXiv:2502.12982 (2025).
[9] Tianqi Du, Zeming Wei, Quan Chen, Chenheng Zhang, and Yisen Wang. 2025.
Advancing llm safe alignment with safety representation ranking. arXiv preprint
arXiv:2505.15710 (2025).
[10] Duolingo. 2025. Duolingo - The world’s best way to learn a language. https:
//www.duolingo.com/.
[11] Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav
Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al . 2022.
Red teaming language models to reduce harms: Methods, scaling behaviors, and
lessons learned. arXiv preprint arXiv:2209.07858 (2022).
[12] Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi-Chia Wang, Qifan Wang,
Jiawei Han, and Yuning Mao. 2023. Mart: Improving llm safety with multi-round
automatic red-teaming. arXiv preprint arXiv:2311.07689 (2023).
[13] Suhun Han. 2025. googletrans 4.0.2. https://pypi.org/project/googletrans/.
[14] Ahmed E Hassan, Gustavo A Oliva, Dayi Lin, Boyuan Chen, Zhen Ming, et al .
2024. Rethinking Software Engineering in the Foundation Model Era:

Chunk 33 · 1,989 chars

g llm safety with multi-round
automatic red-teaming. arXiv preprint arXiv:2311.07689 (2023).
[13] Suhun Han. 2025. googletrans 4.0.2. https://pypi.org/project/googletrans/.
[14] Ahmed E Hassan, Gustavo A Oliva, Dayi Lin, Boyuan Chen, Zhen Ming, et al .
2024. Rethinking Software Engineering in the Foundation Model Era: From
Task-Driven AI Copilots to Goal-Driven AI Pair Programmers. arXiv preprint
arXiv:2404.10225 (2024).
[15] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean
Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large
language models. arXiv preprint arXiv:2106.09685 (2021).
[16] Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning
Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al . 2023.
Llama guard: Llm-based input-output safeguard for human-ai conversations.
arXiv preprint arXiv:2312.06674 (2023).
[17] Jailbreak Chat. 2023. Jailbreak Chat Prompt. https://www.jailbreakchat.com/
prompt/4f37a029-9dff-4862-b323-c96a5504de5d Last accessed: 2024-09-20.
[18] Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen,
Ruiyang Sun, Yizhou Wang, and Yaodong Yang. 2023. Beavertails: Towards
improved safety alignment of llm via a human-preference dataset. Advances in
Neural Information Processing Systems 36 (2023), 24678–24704.
[19] Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tat-
sunori Hashimoto. 2024. Exploiting programmatic behavior of llms: Dual-use
through standard security attacks. In 2024 IEEE Security and Privacy Workshops
(SPW). IEEE, 132–143.
[20] Taku Kudo. 2018. Subword regularization: Improving neural network translation
models with multiple subword candidates. arXiv preprint arXiv:1804.10959 (2018).
[21] Alyssa Lees, Vinh Q Tran, Yi Tay, Jeffrey Sorensen, Jai Gupta, Donald Metzler, and
Lucy Vasserman. 2022. A new generation of perspective api: Efficient multilingual
character-level transformers. In

Chunk 34 · 1,994 chars

mproving neural network translation
models with multiple subword candidates. arXiv preprint arXiv:1804.10959 (2018).
[21] Alyssa Lees, Vinh Q Tran, Yi Tay, Jeffrey Sorensen, Jai Gupta, Donald Metzler, and
Lucy Vasserman. 2022. A new generation of perspective api: Efficient multilingual
character-level transformers. In Proceedings of the 28th ACM SIGKDD conference
on knowledge discovery and data mining. 3197–3207.
[22] Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han.
2023. Deepinception: Hypnotize large language model to be jailbreaker. arXiv
preprint arXiv:2311.03191 (2023).
[23] Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu,
Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, et al . 2024. Personal llm agents:
Insights and survey about the capability, efficiency and security. arXiv preprint
arXiv:2401.05459 (2024).
[24] Huijie Lv, Xiao Wang, Yuansen Zhang, Caishuang Huang, Shihan Dou, Junjie Ye,
Tao Gui, Qi Zhang, and Xuanjing Huang. 2024. Codechameleon: Personalized
encryption framework for jailbreaking large language models. arXiv preprint
arXiv:2402.16717 (2024).
[25] Todor Markov, Chong Zhang, Sandhini Agarwal, Tyna Eloundou, Teddy Lee,
Steven Adler, Angela Jiang, and Lilian Weng. 2022. A Holistic Approach to
Undesired Content Detection. arXiv preprint arXiv:2208.03274 (2022).
[26] Todor Markov, Chong Zhang, Sandhini Agarwal, Florentine Eloundou Nekoul,
Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng. 2023. A holistic
approach to undesired content detection in the real world. In Proceedings of the
AAAI Conference on Artificial Intelligence, Vol. 37. 15009–15018.
[27] Tula Masterman, Sandi Besen, Mason Sawtell, and Alex Chao. 2024. The landscape
of emerging ai agent architectures for reasoning, planning, and tool calling: A
survey. arXiv preprint arXiv:2404.11584 (2024).
[28] Raymond Ng, Thanh Ngan Nguyen, Yuli Huang, Ngee Chia Tai, Wai Yi Leong,
Wei Qi Leong, Xianbin Yong, Jian Gang Ngui, Yosephine

Chunk 35 · 1,997 chars

ndi Besen, Mason Sawtell, and Alex Chao. 2024. The landscape
of emerging ai agent architectures for reasoning, planning, and tool calling: A
survey. arXiv preprint arXiv:2404.11584 (2024).
[28] Raymond Ng, Thanh Ngan Nguyen, Yuli Huang, Ngee Chia Tai, Wai Yi Leong,
Wei Qi Leong, Xianbin Yong, Jian Gang Ngui, Yosephine Susanto, Nicholas Cheng,
et al . 2025. Sea-lion: Southeast asian languages in one network. arXiv preprint
arXiv:2504.05747 (2025).
[29] Xuan-Phi Nguyen, Wenxuan Zhang, Xin Li, Mahani Aljunied, Zhiqiang Hu,
Chenhui Shen, Yew Ken Chia, Xingxuan Li, Jianyu Wang, Qingyu Tan, et al .
2023. SeaLLMs–Large Language Models for Southeast Asia. arXiv preprint
arXiv:2312.00738 (2023).
[30] Eugenio Oliveira, Klaus Fischer, and Olga Stepankova. 1999. Multi-agent systems:
which research for which applications. Robotics and Autonomous Systems 27, 1-2
(1999), 91–106.
[31] Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang,
Jana Thompson, Phu Mon Htut, and Samuel R Bowman. 2021. BBQ: A hand-built
bias benchmark for question answering. arXiv preprint arXiv:2110.08193 (2021).
[32] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,
Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
2017. Automatic differentiation in PyTorch. In NIPS-W.
[33] Maribeth Rauh, John Mellor, Jonathan Uesato, Po-Sen Huang, Johannes Welbl,
Laura Weidinger, Sumanth Dathathri, Amelia Glaese, Geoffrey Irving, Iason

-- 11 of 12 --

Conference acronym ’XX, June 03–05, 2018, Woodstock, NY 	Trovato et al.
Gabriel, et al . 2022. Characteristics of harmful text: Towards rigorous bench-
marking of language models. Advances in Neural Information Processing Systems
35 (2022), 24720–24739.
[34] Traian Rebedea, Razvan Dinu, Makesh Sreedhar, Christopher Parisien, and
Jonathan Cohen. 2023. Nemo guardrails: A toolkit for controllable and safe
llm applications with programmable rails. arXiv preprint arXiv:2310.10501 (2023).
[35] Rico Sennrich.

Chunk 36 · 1,990 chars

in Neural Information Processing Systems
35 (2022), 24720–24739.
[34] Traian Rebedea, Razvan Dinu, Makesh Sreedhar, Christopher Parisien, and
Jonathan Cohen. 2023. Nemo guardrails: A toolkit for controllable and safe
llm applications with programmable rails. arXiv preprint arXiv:2310.10501 (2023).
[35] Rico Sennrich. 2015. Neural machine translation of rare words with subword
units. arXiv preprint arXiv:1508.07909 (2015).
[36] Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2023. "
do anything now": Characterizing and evaluating in-the-wild jailbreak prompts
on large language models. arXiv preprint arXiv:2308.03825 (2023).
[37] Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. "
do anything now": Characterizing and evaluating in-the-wild jailbreak prompts
on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference
on Computer and Communications Security. 1671–1685.
[38] Speak. 2025. Speak - The language learning app that gets you speaking. https:
//www.speak.com/.
[39] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos
Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An
Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_
alpaca.
[40] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne
Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Azhar, et al . 2023. Llama: Open and efficient foundation language models. arXiv
preprint arXiv:2302.13971 (2023).
[41] William Van Melle. 1978. MYCIN: a knowledge-based consultation program for
infectious disease diagnosis. International journal of man-machine studies 10, 3
(1978), 313–322.
[42] Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin.
2023. Do-not-answer: A dataset for evaluating safeguards in llms. arXiv preprint
arXiv:2308.13387 (2023).
[43] Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken:

Chunk 37 · 1,998 chars

ational journal of man-machine studies 10, 3
(1978), 313–322.
[42] Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin.
2023. Do-not-answer: A dataset for evaluating safeguards in llms. arXiv preprint
arXiv:2308.13387 (2023).
[43] Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How
does llm safety training fail? Advances in Neural Information Processing Systems
36 (2023), 80079–80110.
[44] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al .
2019. Huggingface’s transformers: State-of-the-art natural language processing.
arXiv preprint arXiv:1910.03771 (2019).
[45] Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue Zhang. 2024.
A survey on large language model (llm) security and privacy: The good, the bad,
and the ugly. High-Confidence Computing (2024), 100211.
[46] Zheng-Xin Yong, Cristina Menghini, and Stephen H Bach. 2023. Low-resource
languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446 (2023).
[47] Zheng-Xin Yong, Cristina Menghini, and Stephen H Bach. 2023. Low-resource
languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446 (2023).
[48] Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shum-
ing Shi, and Zhaopeng Tu. 2023. Gpt-4 is too smart to be safe: Stealthy chat with
llms via cipher. arXiv preprint arXiv:2308.06463 (2023).
[49] Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shum-
ing Shi, and Zhaopeng Tu. 2023. Gpt-4 is too smart to be safe: Stealthy chat with
llms via cipher. arXiv preprint arXiv:2308.06463 (2023).
[50] Wenxuan Zhang, Hou Pong Chan, Yiran Zhao, Mahani Aljunied, Jianyu Wang,
Chaoqun Liu, Yue Deng, Zhiqiang Hu, Weiwen Xu, Yew Ken Chia, et al . 2024.
Seallms 3: Open foundation and chat multilingual large language models for
southeast asian languages. arXiv preprint arXiv:2407.19672 (2024).
[51] Xuandong Zhao, Will Cai, Tianneng Shi,

Chunk 38 · 710 chars

, Hou Pong Chan, Yiran Zhao, Mahani Aljunied, Jianyu Wang,
Chaoqun Liu, Yue Deng, Zhiqiang Hu, Weiwen Xu, Yew Ken Chia, et al . 2024.
Seallms 3: Open foundation and chat multilingual large language models for
southeast asian languages. arXiv preprint arXiv:2407.19672 (2024).
[51] Xuandong Zhao, Will Cai, Tianneng Shi, David Huang, Licong Lin, Song Mei,
and Dawn Song. 2025. Improving llm safety alignment with dual-objective
optimization. arXiv preprint arXiv:2503.03710 (2025).
[52] Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt
Fredrikson. 2023. Universal and transferable adversarial attacks on aligned
language models. arXiv preprint arXiv:2307.15043 (2023).

-- 12 of 12 --