X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability
In this paper, we comprehensively compare existing defense methods in multi-turn attack scenarios and reveal their shortcomings in balancing the robustness of defense and LLM usability. We analyze this issue from the perspective of LLMs' feature space, and conclude that previous methods fail to learn a precise boundary that distinguishes safe and harmful representations without an explicit formulation. To address this issue, we propose the X-Boundary to push harmful representations away from safe representations through explicit loss functions and obtain a clear distinction boundary. Such distinction boundary enables the consequential removal of harmful representations without disrupting safe ones, thereby achieving a balance between robustness against multi-turn jailbreaks and LLM usability.
conda create -n xboun python=3.10
conda activate xboun
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
sh scripts/lorra_x_boundary_llama3_8b.sh
sh scripts/lorra_x_boundary_qwen2_7b.sh
Evaluate defense against single-turn attack in HarmBench
sh scripts/eval/eval_cb.sh $model_path
Evaluate defense against ActorAttack
sh scripts/eval/multi_round_eval.sh $model_path
Evaluate defense against RedQueen attack
sh scripts/eval/red_queen_eval.sh $model_path
sh scripts/eval/red_queen_eval_llama.sh $model_path # for llama-3
Evaluate over-refusal rate
sh scripts/eval/overrefusal_eval.sh $model_path data/test/OKTest.json
sh scripts/eval/overrefusal_eval.sh $model_path data/test/PHtest.json
sh scripts/eval/overrefusal_eval.sh $model_path data/test/ORbench_test300.json
sh scripts/eval/overrefusal_eval.sh $model_path data/test/xstest_v2_prompts.json
If you want to speed up inference, especially when using a reasoning model like deepseek-R1, you can set "--vlm_acc true" in your evaluation scripts to use vllm as the backend of inference.
pip install vllm==0.7.3
For the R1 distilled models, we set the max_new_tokens to 8192 for evaluating single-turn safety and over-refusal, and 2048 for evaluation under multi-turn attacks.
After the harmful representations are erased, the LLM has a certain probability of generating gibberish due to its inability to produce harmful content. We can use a rule-based detector to identify gibberish and replace it with a refusal response. This post-processing generally does not affect normal outputs. We provide a demo in R1_X_Boundary_demo.py
.
Leveraged the part of code framework of Circuit Breaker.
@misc{lu2025xboundarye,
title={X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability},
author={Xiaoya Lu and Dongrui Liu and Yi Yu and Luxin Xu and Jing Shao},
year={2025},
eprint={2502.09990},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2502.09990},
}