Skip to content

soCromp/tabby

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tabby: Tabular Data Synthesis with Language Models

Use a pre-trained LLM to generate high-fidelity tabular data!

logo

🧠 Blog Post
📄 Paper
🐈 Demo notebook
🤗 HuggingFace: sonicc/tabby-distilgpt2-diabetes

Quick start

Follow the instructions below to install the necessary packages, then try things out with our simple pre-trained demo Tabby, or training and sampling your own Tabby!

Environment Setup

Run the following:

conda create --name tabby python=3.11.4
conda activate tabby
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
conda install tqdm matplotlib jupyter pandas scikit-learn jupyter
pip install transformers==4.44.2 accelerate datasets ucimlrepo openml bitsandbytes wandb openpyxl huggingface_hub

Demo

We have created the demo.ipynb notebook as a quick and easy way to try out Tabby synthesis, no fine-tuning required!

Working with Tabby

The data-prep.ipynb notebook will download and format datasets as they were prepared in our paper. If you wish to add a new dataset, just make sure it has similar file stricture, config file and train/val/test splits as those created in this notebook.

The file trainplain.py is the main point for interacting with our codebase. It allows you to train both Tabby and non-Tabby models, with or without Plain training. Here is the command used to train our demo model:

python trainplain.py -p ./ckpt -d diabetes -mh -t -n 10

The -p flag indicates where to save the model checkpoints, training logs and samples, -d flag specificies the dataset, -mh means that MoE will be applied to the LM head, -t to perform training (defaults to Plain) and -n specifies the number of samples to produce after training. There are many other flags to specify various facets of training and sampling; to list them all run

python trainplain.py --help

Credit: Some code in the src directory is from the Great repository.

Citation

If you use Tabby, please cite:

@article{cromp2025tabby,
  title={Tabby: Tabular Data Synthesis with Language Models},
  author={Sonia Cromp, Satya Sai Srinath Namburi GNVV, Mohammed Alkhudhayri, Catherine Cao, Samuel Guo, Nicholas Roberts, Frederic Sala},
  journal={arXiv preprint arXiv:2405.01147},
  year={2025},
  url={https://arxiv.org/abs/2405.01147}
}

Contact

Thanks for checking out Tabby! We'd love to hear any thoughts you might have. Feel free to contact me (Sonia Cromp) at [cromp@wisc.edu].

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published