TrainCheck

TrainCheck is a lightweight, extensible tool for runtime monitoring of “silent” bugs in deep‑learning training pipelines. Instead of waiting for a crash or a bad model, TrainCheck:

Automatically instruments your existing training scripts (e.g., from pytorch/examples or huggingface/transformers/examples), inserting tracing hooks with minimal code changes.
Learns precise invariants–precise properties that should hold during training across API calls and model updates-by analyzing executions of known-good runs.
Catches silent issues early–by checking invariants on new or modified training jobs, alerting you immediately if something didn't happen as expected (e.g., model weight inconsistency, mixed precision not applied successfully, unexpected tensor shapes). On violation, TrainCheck flags the point of divergence—so users can diagnose silent issues before they derail your model.

Under the hood, TrainCheck decomposes into three CLI tools:

Instrumentor (traincheck-collect) Wraps target training programs with lightweight tracing logic. It produces an instrumented version of the target program that logs API calls and model states without altering training semantics.
Inference Engine (traincheck-infer) Consumes one or more trace logs from successful runs to infer low‑level invariants.
Checker (traincheck-check) Runs alongside or after new training jobs to verify that each recorded event satisfies the inferred invariants.

Status

TrainCheck is under active development. Features may be incomplete and the documentation is evolving—if you give it a try, please join our 💬 Discord server or file a GitHub issue for support. Currently, the Checker operates in a semi‑online mode: you invoke it against the live, growing trace output to catch silent bugs as they appear. Fully automatic monitoring is on the roadmap, and we welcome feedback and contributions from early adopters.

Try TrainCheck

Install
Follow the Installation Guide to get TrainCheck set up on your machine.
Explore
Work through our "5‑Minute Experience with TrainCheck" tutorial. You’ll learn how to:
- Instrument a training script and collect a trace
- Automatically infer low‑level invariants
- Run the Checker in semi‑online mode to uncover silent bugs

Documentation

Please visit TrainCheck Technical Doc.

🕵️‍♀️ OSDI AE members, please see TrainCheck AE Guide.

Name		Name	Last commit message	Last commit date
Latest commit History 1,196 Commits
.github		.github
docs		docs
eval_scripts		eval_scripts
scripts		scripts
tests		tests
traincheck		traincheck
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
bugs.md		bugs.md
call_graph.json		call_graph.json
empirical_study.ipynb		empirical_study.ipynb
pyproject.toml		pyproject.toml
regression_test.py		regression_test.py
torch_wrapper.py		torch_wrapper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TrainCheck

Status

Try TrainCheck

Documentation

About

Releases

Packages

Contributors 8

Languages

License

OrderLab/TrainCheck

Folders and files

Latest commit

History

Repository files navigation

TrainCheck

Status

Try TrainCheck

Documentation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 8

Languages

Packages