Yongwei Chen¹ • Yushi Lan¹ • Shangchen Zhou¹ • Tengfei Wang² • Xingang Pan¹
¹S-lab, Nanyang Technological University
²Shanghai Artificial Intelligence Laboratory
CVPR 2025
sar3d.mp4
- 🔄 Autoregressive Modeling
- ⚡️ Ultra-fast 3D Generation (<1s)
- 🔍 Detailed Understanding
We've tested SAR3D on the following environments:
Rocky Linux 8.10 (Green Obsidian)
- Python 3.9.8
- PyTorch 2.2.2
- CUDA 12.1
- NVIDIA H200
Ubuntu 20.04
- Python 3.9.16
- PyTorch 2.0.0
- CUDA 11.7
- NVIDIA A6000
- Clone the repository
git clone https://github.com/cyw-3d/SAR3D.git
cd SAR3D
- Set up environment
conda env create -f environment.yml
- Download pretrained models 📥
The pretrained models will be automatically downloaded to the checkpoints
folder during first run.
You can also manually download them from our model zoo:
Model | Description | Link |
---|---|---|
VQVAE | Base VQVAE model | vqvae-ckpt.pt |
Generation | Image-conditioned model | image-condition-ckpt.pth |
Generation | Text-conditioned model | text-condition-ckpt.pth |
- Run inference 🚀
To test the model on your own images:
- Place your test images in the
test_files/test_images
folder - Run the inference script:
bash test_image.sh
To test the model on your own text prompts:
- Place your test prompts in the
test_files/test_text.json
file - Run the inference script:
bash test_text.sh
The dataset is available for download at Hugging Face.
The dataset consists of 8 splits containing preprocessed data based on G-buffer Objaverse, including:
- Rendered images
- Depth maps
- Camera poses
- Text descriptions
- Normal maps
- Latent embeddings
The dataset covers over 170K unique 3D objects, augmented to more than 630K data pairs. A data.json file is provided that maps object IDs to their corresponding categories.
After downloading and unzipping the dataset, you should have the following structure:
/dataset-root/
├── 1/
├── 2/
├── ...
├── 8/
│ └── 0/
│ ├── raw_image.png
│ ├── depth_alpha.jpg
│ ├── c.npy
│ ├── caption_3dtopia.txt
│ ├── normal.png
│ ├── ...
│ └── image_dino_embedding_lrm.npy
└── dataset.json
The following scripts allow you to train both image-conditioned and text-conditioned models using the dataset stored in the specified <DATA_DIR>
location.
For image-conditioned model training:
bash train_image.sh <MODEL_DEPTH> <BATCH_SIZE> <GPU_NUM> <VQVAE_PATH> <OUT_DIR> <DATA_DIR>
For text-conditioned model training:
bash train_text.sh <MODEL_DEPTH> <BATCH_SIZE> <GPU_NUM> <VQVAE_PATH> <OUT_DIR> <DATA_DIR>
- Inference and Training Code for Image-conditioned Generation
- Dataset Release
- Inference Code for Text-conditioned Generation
- Training Code for Text-conditioned Generation
- VQVAE training code
- Code for Understanding
If you find this work useful for your research, please cite our paper:
@inproceedings{chen2024sar3d,
title={SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE},
author={Chen, Yongwei and Lan, Yushi and Zhou, Shangchen and Wang, Tengfei and Pan, Xingang},
booktitle={CVPR},
year={2025}
}