Skip to content
/ SAR3D Public

Official repository for "SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE"

License

Notifications You must be signed in to change notification settings

cyw-3d/SAR3D

Repository files navigation

SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE

Yongwei Chen¹  •  Yushi Lan¹  •  Shangchen Zhou¹  •  Tengfei Wang²  •  Xingang Pan¹

¹S-lab, Nanyang Technological University
²Shanghai Artificial Intelligence Laboratory

CVPR 2025

sar3d.mp4

🌟 Features

  • 🔄 Autoregressive Modeling
  • ⚡️ Ultra-fast 3D Generation (<1s)
  • 🔍 Detailed Understanding

🛠️ Installation & Usage

Prerequisites

We've tested SAR3D on the following environments:

Rocky Linux 8.10 (Green Obsidian)
  • Python 3.9.8
  • PyTorch 2.2.2
  • CUDA 12.1
  • NVIDIA H200
Ubuntu 20.04
  • Python 3.9.16
  • PyTorch 2.0.0
  • CUDA 11.7
  • NVIDIA A6000

Quick Start

  1. Clone the repository
git clone https://github.com/cyw-3d/SAR3D.git
cd SAR3D
  1. Set up environment
conda env create -f environment.yml
  1. Download pretrained models 📥

The pretrained models will be automatically downloaded to the checkpoints folder during first run.

You can also manually download them from our model zoo:

Model Description Link
VQVAE Base VQVAE model vqvae-ckpt.pt
Generation Image-conditioned model image-condition-ckpt.pth
Generation Text-conditioned model text-condition-ckpt.pth
  1. Run inference 🚀

To test the model on your own images:

  1. Place your test images in the test_files/test_images folder
  2. Run the inference script:
bash test_image.sh

To test the model on your own text prompts:

  1. Place your test prompts in the test_files/test_text.json file
  2. Run the inference script:
bash test_text.sh

📚 Training

Dataset

The dataset is available for download at Hugging Face.

The dataset consists of 8 splits containing preprocessed data based on G-buffer Objaverse, including:

  • Rendered images
  • Depth maps
  • Camera poses
  • Text descriptions
  • Normal maps
  • Latent embeddings

The dataset covers over 170K unique 3D objects, augmented to more than 630K data pairs. A data.json file is provided that maps object IDs to their corresponding categories.

After downloading and unzipping the dataset, you should have the following structure:

/dataset-root/
├── 1/
├── 2/
├── ...
├── 8/
│   └── 0/
│       ├── raw_image.png
│       ├── depth_alpha.jpg
│       ├── c.npy
│       ├── caption_3dtopia.txt
│       ├── normal.png
│       ├── ...
│       └── image_dino_embedding_lrm.npy
└── dataset.json

Training Commands

The following scripts allow you to train both image-conditioned and text-conditioned models using the dataset stored in the specified <DATA_DIR> location.

For image-conditioned model training:

bash train_image.sh <MODEL_DEPTH> <BATCH_SIZE> <GPU_NUM> <VQVAE_PATH> <OUT_DIR> <DATA_DIR>

For text-conditioned model training:

bash train_text.sh <MODEL_DEPTH> <BATCH_SIZE> <GPU_NUM> <VQVAE_PATH> <OUT_DIR> <DATA_DIR>

📋 Roadmap

  • Inference and Training Code for Image-conditioned Generation
  • Dataset Release
  • Inference Code for Text-conditioned Generation
  • Training Code for Text-conditioned Generation
  • VQVAE training code
  • Code for Understanding

📝 Citation

If you find this work useful for your research, please cite our paper:

@inproceedings{chen2024sar3d,
    title={SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE},
    author={Chen, Yongwei and Lan, Yushi and Zhou, Shangchen and Wang, Tengfei and Pan, Xingang},
    booktitle={CVPR},
    year={2025}
}

About

Official repository for "SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published