2024-10-30
: 🤗 We provide the Model Weights and Visualization Results on HuggingFace.
2024-09-24
: 🚀 We provide all the Model Weights for community.
2024-09-14
: 💎 We provide the Mamba-YOLO-World source code for community.
2024-09-12
: We provide the Visualization Results of ZERO-SHOT Inference on LVIS generated by Mamba-YOLO-World and YOLO-World for comparison.
This repo contains the PyTorch implementation, pre-trained weights, and pre-training/fine-tuning code for Mamba-YOLO-World.
-
We present Mamba-YOLO-World, a novel YOLO-based OVD model employing the proposed MambaFusion-PAN as its neck architecture.
-
We introduce a State Space Model-based feature fusion mechanism consisting of a Parallel-Guided Selective Scan algorithm and a Serial-Guided Selective Scan algorithm, with O(N+1) complexity and globally guided receptive fields.
-
Experiments demonstrate that our model outperforms the original YOLO-World while maintaining comparable parameters and FLOPs. Additionally, it surpasses existing state-of-the-art OVD methods with fewer parameters and FLOPs.
- We adopt the pre-trained Mamba-YOLO-World-S, Mamba-YOLO-World-M, Mamba-YOLO-World-L, YOLO-World-v2-S, YOLO-World-v2-M, YOLO-World-v2-L and conduct zero-shot inferences on the LVIS-val2017 (COCO-val2017 images with the LVIS vocabulary). Specifically, the LVIS vocabulary contains 1203 categories.
- All visualization results are available at: https://pan.quark.cn/s/450070c03c58 (if you use Quark) and https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main/zeroshot_pictures_COCO_Comparison (the same for HuggingFace users). You are welcome to download them and make a comparison between our Mamba-YOLO-World and the original YOLO-World across small (S), medium (M) and large (L) size variants.
- The visualization results demonstrate that our Mamba-YOLO-World significantly outperforms YOLO-World (even YOLO-World-v2, the latest version of YOLO-World) in terms of accuracy and generalization across all size variants.
model | Pre-train Data | APmini | APr | APc | APf | weights on Quark | weights on HuggingFace |
---|---|---|---|---|---|---|---|
Mamba-YOLO-World-S | O365+GoldG | 27.7 | 19.5 | 27.0 | 29.9 | https://pan.quark.cn/s/dce0710ffcec | https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main |
Mamba-YOLO-World-M | O365+GoldG | 32.8 | 27.0 | 31.9 | 34.8 | https://pan.quark.cn/s/dce0710ffcec | https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main |
Mamba-YOLO-World-L | O365+GoldG | 35.0 | 29.3 | 34.2 | 36.8 | https://pan.quark.cn/s/dce0710ffcec | https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main |
model | Pre-train Data | AP | AP50 | AP75 | weights on Quark | weights on HuggingFace |
---|---|---|---|---|---|---|
Mamba-YOLO-World-S | O365+GoldG | 38.0 | 52.9 | 41.0 | https://pan.quark.cn/s/dce0710ffcec | https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main |
Mamba-YOLO-World-M | O365+GoldG | 43.2 | 58.8 | 46.6 | https://pan.quark.cn/s/dce0710ffcec | https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main |
Mamba-YOLO-World-L | O365+GoldG | 45.4 | 61.3 | 49.4 | https://pan.quark.cn/s/dce0710ffcec | https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main |
model | Pre-train Data | AP | AP50 | AP75 | weights on Quark | weights on HuggingFace |
---|---|---|---|---|---|---|
Mamba-YOLO-World-S | O365+GoldG | 46.4 | 62.5 | 50.5 | https://pan.quark.cn/s/dce0710ffcec | https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main |
Mamba-YOLO-World-M | O365+GoldG | 51.4 | 68.2 | 56.1 | https://pan.quark.cn/s/dce0710ffcec | https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main |
Mamba-YOLO-World-L | O365+GoldG | 54.1 | 71.1 | 59.0 | https://pan.quark.cn/s/dce0710ffcec | https://huggingface.co/Xuan-World/Mamba-YOLO-World/tree/main |
Mamba-YOLO-World is developed based on torch==2.0.0
,mamba-ssm==2.1.0
, triton==2.1.0
,supervision==0.20.0
, mmcv==2.0.1
, mmyolo==0.6.0
and mmdetection==3.3.0
.
You need to link the mmyolo under third_party
directory.
We provide the details about the pre-training data in docs/data.
./tools/dist_test.sh configs/mamba2_yolo_world_s.py CHECKPOINT_FILEPATH num_gpus_per_node
./tools/dist_test.sh configs/mamba2_yolo_world_m.py CHECKPOINT_FILEPATH num_gpus_per_node
./tools/dist_test.sh configs/mamba2_yolo_world_l.py CHECKPOINT_FILEPATH num_gpus_per_node
./tools/dist_train.sh configs/mamba2_yolo_world_s.py num_gpus_per_node --amp
./tools/dist_train.sh configs/mamba2_yolo_world_m.py num_gpus_per_node --amp
./tools/dist_train.sh configs/mamba2_yolo_world_l.py num_gpus_per_node --amp
./tools/dist_train.sh configs/mamba2_yolo_world_s_mask-refine_finetune_coco.py num_gpus_per_node --amp
./tools/dist_train.sh configs/mamba2_yolo_world_m_mask-refine_finetune_coco.py num_gpus_per_node --amp
./tools/dist_train.sh configs/mamba2_yolo_world_l_mask-refine_finetune_coco.py num_gpus_per_node --amp
image_demo.py
: inference with images or a directory of imagesvideo_demo.py
: inference on videos.
We sincerely thank mmyolo, mmdetection, YOLO-World, Mamba and VMamba for providing their wonderful code to the community!
@inproceedings{wang2025mamba,
title={Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection},
author={Wang, Haoxuan and He, Qingdong and Peng, Jinlong and Yang, Hao and Chi, Mingmin and Wang, Yabiao},
booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={1--5},
year={2025},
organization={IEEE}
}