Welcome to the official repository for Big-Math, a large-scale, high-quality dataset designed specifically for RL training (PPO, GRPO, etc.) with large language models (LLMs).
This repository provides tools for reformulating multiple-choice questions and implementing rule-based and model-based filtering as described in the Big-Math paper.
Find the dataset on HuggingFace at https://huggingface.co/datasets/SynthLabsAI/Big-Math-RL-Verified
Warning
This repo is intended for research purposes, and is thus under constant development. Please expect major changes to the design. The primary goal of the big-math repo is to share the filtering and reformulation code for creating the Big-MATH dataset and to speed the development of future datasets. The Big-Math dataset is intended only for RL training of LLMs, it does not contain rollouts
This repo consists of 2 main directories: signals and reformulation.
This folder contains code used to generate signals on a dataset. The below signals can be generated either using rule-based methods or model-based methods:
Signal | Rule-Based | Model-Based |
---|---|---|
Hyperlink Detection | ✅ | |
Language Identification | ✅ | |
Semantic Duplicate | ✅ | |
Multiple Choice Question | ✅ | ✅ |
Multi-Part Question | ✅ | ✅ |
True/False Question | ✅ | ✅ |
Yes/No Question | ✅ | ✅ |
Proof Detection | ✅ | ✅ |
Model Solve Rate | ✅ |
This folder contains code used to reformulate multiple choice problems to open-ended questions.
- python 3.10+
- install with packages in
signals/requirements.txt
to generate signals on a dataset - install with packages in
reformulation/requirements.txt
to reformulate multiple choicen questions into open-ended questions
- Clone the repository:
bash git clone https://github.com/SynthLabsAI/big-math.git cd big-math
- Install dependencies
bash pip install -r signals/requirements.txt -r reformulation/requirements.txt
See the reformulation readme for an explanation of files and usage.
See the signals readme for an explanation of files and usage.
@misc{albalak2025bigmathlargescalehighqualitymath,
title={Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models},
author={Alon Albalak and Duy Phung and Nathan Lile and Rafael Rafailov and Kanishk Gandhi and Louis Castricato and Anikait Singh and Chase Blagden and Violet Xiang and Dakota Mahan and Nick Haber},
year={2025},
eprint={2502.17387},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.17387},
}
This project is licensed under the MIT License. See the LICENSE for details.