Mamba-YOLO-World

Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection

Haoxuan Wang, Qingdong He, Jinlong Peng, Hao Yang, Mingmin Chi, Yabiao Wang

✨ News

🎉 accepted by ICASSP 2025 (Oral)

2024-10-30: 🤗 We provide the Model Weights and Visualization Results on HuggingFace.

2024-09-24: 🚀 We provide all the Model Weights for community.

2024-09-14: 💎 We provide the Mamba-YOLO-World source code for community.

2024-09-12: We provide the Visualization Results of ZERO-SHOT Inference on LVIS generated by Mamba-YOLO-World and YOLO-World for comparison.

Introduction

This repo contains the PyTorch implementation, pre-trained weights, and pre-training/fine-tuning code for Mamba-YOLO-World.

We present Mamba-YOLO-World, a novel YOLO-based OVD model employing the proposed MambaFusion-PAN as its neck architecture.
We introduce a State Space Model-based feature fusion mechanism consisting of a Parallel-Guided Selective Scan algorithm and a Serial-Guided Selective Scan algorithm, with O(N+1) complexity and globally guided receptive fields.
Experiments demonstrate that our model outperforms the original YOLO-World while maintaining comparable parameters and FLOPs. Additionally, it surpasses existing state-of-the-art OVD methods with fewer parameters and FLOPs.

📷 Visualization Results

We adopt the pre-trained Mamba-YOLO-World-S, Mamba-YOLO-World-M, Mamba-YOLO-World-L, YOLO-World-v2-S, YOLO-World-v2-M, YOLO-World-v2-L and conduct zero-shot inferences on the LVIS-val2017 (COCO-val2017 images with the LVIS vocabulary). Specifically, the LVIS vocabulary contains 1203 categories.
All visualization results are available at: https://pan.quark.cn/s/450070c03c58 (if you use Quark) and https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main/zeroshot_pictures_COCO_Comparison (the same for HuggingFace users). You are welcome to download them and make a comparison between our Mamba-YOLO-World and the original YOLO-World across small (S), medium (M) and large (L) size variants.
The visualization results demonstrate that our Mamba-YOLO-World significantly outperforms YOLO-World (even YOLO-World-v2, the latest version of YOLO-World) in terms of accuracy and generalization across all size variants.

Model Zoo

Zero-shot Evaluation on LVIS-minival dataset

model	Pre-train Data	AP^mini	AP_r	AP_c	AP_f	weights on Quark	weights on HuggingFace
Mamba-YOLO-World-S	O365+GoldG	27.7	19.5	27.0	29.9	https://pan.quark.cn/s/dce0710ffcec	https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main
Mamba-YOLO-World-M	O365+GoldG	32.8	27.0	31.9	34.8	https://pan.quark.cn/s/dce0710ffcec	https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main
Mamba-YOLO-World-L	O365+GoldG	35.0	29.3	34.2	36.8	https://pan.quark.cn/s/dce0710ffcec	https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main

Zero-shot Evaluation on COCO dataset

model	Pre-train Data	AP	AP₅₀	AP₇₅	weights on Quark	weights on HuggingFace
Mamba-YOLO-World-S	O365+GoldG	38.0	52.9	41.0	https://pan.quark.cn/s/dce0710ffcec	https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main
Mamba-YOLO-World-M	O365+GoldG	43.2	58.8	46.6	https://pan.quark.cn/s/dce0710ffcec	https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main
Mamba-YOLO-World-L	O365+GoldG	45.4	61.3	49.4	https://pan.quark.cn/s/dce0710ffcec	https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main

Fine-tuning Evaluation on COCO dataset

model	Pre-train Data	AP	AP₅₀	AP₇₅	weights on Quark	weights on HuggingFace
Mamba-YOLO-World-S	O365+GoldG	46.4	62.5	50.5	https://pan.quark.cn/s/dce0710ffcec	https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main
Mamba-YOLO-World-M	O365+GoldG	51.4	68.2	56.1	https://pan.quark.cn/s/dce0710ffcec	https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main
Mamba-YOLO-World-L	O365+GoldG	54.1	71.1	59.0	https://pan.quark.cn/s/dce0710ffcec	https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main

Getting started

1. Installation

Mamba-YOLO-World is developed based on torch==2.0.0,mamba-ssm==2.1.0, triton==2.1.0,supervision==0.20.0, mmcv==2.0.1, mmyolo==0.6.0 and mmdetection==3.3.0.

You need to link the mmyolo under third_party directory.

2. Preparing Data

We provide the details about the pre-training data in docs/data.

Evaluation

./tools/dist_test.sh configs/mamba2_yolo_world_s.py  CHECKPOINT_FILEPATH  num_gpus_per_node
./tools/dist_test.sh configs/mamba2_yolo_world_m.py  CHECKPOINT_FILEPATH  num_gpus_per_node
./tools/dist_test.sh configs/mamba2_yolo_world_l.py  CHECKPOINT_FILEPATH  num_gpus_per_node

Pre-training

./tools/dist_train.sh configs/mamba2_yolo_world_s.py  num_gpus_per_node  --amp
./tools/dist_train.sh configs/mamba2_yolo_world_m.py  num_gpus_per_node  --amp
./tools/dist_train.sh configs/mamba2_yolo_world_l.py  num_gpus_per_node  --amp

Fine-tuning

./tools/dist_train.sh configs/mamba2_yolo_world_s_mask-refine_finetune_coco.py  num_gpus_per_node --amp 
./tools/dist_train.sh configs/mamba2_yolo_world_m_mask-refine_finetune_coco.py  num_gpus_per_node --amp 
./tools/dist_train.sh configs/mamba2_yolo_world_l_mask-refine_finetune_coco.py  num_gpus_per_node --amp

Demo

image_demo.py: inference with images or a directory of images
video_demo.py: inference on videos.

Acknowledgement

We sincerely thank mmyolo, mmdetection, YOLO-World, Mamba and VMamba for providing their wonderful code to the community!

Citation

@inproceedings{wang2025mamba,
  title={Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection},
  author={Wang, Haoxuan and He, Qingdong and Peng, Jinlong and Yang, Hao and Chi, Mingmin and Wang, Yabiao},
  booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2025},
  organization={IEEE}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mamba-YOLO-World

Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection

✨ News

🎉 accepted by ICASSP 2025 (Oral)

Introduction

📷 Visualization Results

Model Zoo

Zero-shot Evaluation on LVIS-minival dataset

Zero-shot Evaluation on COCO dataset

Fine-tuning Evaluation on COCO dataset

Getting started

1. Installation

2. Preparing Data

Evaluation

Pre-training

Fine-tuning

Demo

Acknowledgement

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
configs		configs
demo		demo
docs		docs
output		output
repo		repo
tools		tools
yolo_world		yolo_world
LICENSE		LICENSE
README.md		README.md

License

Xuan-World/Mamba-YOLO-World

Folders and files

Latest commit

History

Repository files navigation

Mamba-YOLO-World

Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection

✨ News

🎉 accepted by ICASSP 2025 (Oral)

Introduction

📷 Visualization Results

Model Zoo

Zero-shot Evaluation on LVIS-minival dataset

Zero-shot Evaluation on COCO dataset

Fine-tuning Evaluation on COCO dataset

Getting started

1. Installation

2. Preparing Data

Evaluation

Pre-training

Fine-tuning

Demo

Acknowledgement

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages