Skip to content

The code repo of paper "X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability"

Notifications You must be signed in to change notification settings

AI45Lab/X-Boundary

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability

Paper | Code | Models

In this paper, we comprehensively compare existing defense methods in multi-turn attack scenarios and reveal their shortcomings in balancing the robustness of defense and LLM usability. We analyze this issue from the perspective of LLMs' feature space, and conclude that previous methods fail to learn a precise boundary that distinguishes safe and harmful representations without an explicit formulation. To address this issue, we propose the X-Boundary to push harmful representations away from safe representations through explicit loss functions and obtain a clear distinction boundary. Such distinction boundary enables the consequential removal of harmful representations without disrupting safe ones, thereby achieving a balance between robustness against multi-turn jailbreaks and LLM usability.

alt text

Snapshot of Results

alt text

Installation

conda create -n xboun python=3.10
conda activate xboun
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

Training

sh scripts/lorra_x_boundary_llama3_8b.sh

sh scripts/lorra_x_boundary_qwen2_7b.sh

Evaluation

Evaluate defense against single-turn attack in HarmBench

sh scripts/eval/eval_cb.sh $model_path

Evaluate defense against ActorAttack

sh scripts/eval/multi_round_eval.sh $model_path

Evaluate defense against RedQueen attack

sh scripts/eval/red_queen_eval.sh $model_path

sh scripts/eval/red_queen_eval_llama.sh $model_path # for llama-3

Evaluate over-refusal rate

sh scripts/eval/overrefusal_eval.sh $model_path data/test/OKTest.json

sh scripts/eval/overrefusal_eval.sh $model_path data/test/PHtest.json

sh scripts/eval/overrefusal_eval.sh $model_path data/test/ORbench_test300.json

sh scripts/eval/overrefusal_eval.sh $model_path data/test/xstest_v2_prompts.json

If you want to speed up inference, especially when using a reasoning model like deepseek-R1, you can set "--vlm_acc true" in your evaluation scripts to use vllm as the backend of inference.

pip install vllm==0.7.3

For the R1 distilled models, we set the max_new_tokens to 8192 for evaluating single-turn safety and over-refusal, and 2048 for evaluation under multi-turn attacks.

Inference with X-Boundary adapter and gibberish filter

After the harmful representations are erased, the LLM has a certain probability of generating gibberish due to its inability to produce harmful content. We can use a rule-based detector to identify gibberish and replace it with a refusal response. This post-processing generally does not affect normal outputs. We provide a demo in R1_X_Boundary_demo.py.

alt text

Acknowledge

Leveraged the part of code framework of Circuit Breaker.

Citation

@misc{lu2025xboundarye,
      title={X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability}, 
      author={Xiaoya Lu and Dongrui Liu and Yi Yu and Luxin Xu and Jing Shao},
      year={2025},
      eprint={2502.09990},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2502.09990}, 
}

About

The code repo of paper "X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published