Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription

About • Contributing • How To Use • Citations • Acknowledgments • License

About

This GitHub repository contains the implementation of the Sheet Music Transfomrmer (SMT), a novel model for Optical Music Recognition (OMR) beyond monophonic level transcription. Unlike traditional approaches that primarily leverage monophonic transcription techniques for complex score layouts, the SMT model overcomes these limitations by offering a robust image-to-sequence solution for transcribing polyphonic musical scores directly from images.

Contributing

If you plan to contribute, please read our Contributing Guidelines first. It contains information on how to set up the environment, coding conventions, how to submit pull requests, and how to write clear commit messages.

Feel free to open an issue if you have a question, idea, or discover a problem.

Project setup

This project contains the uv Python dependencies manager. To start a new virtual environment, simply run:

uv sync

Docker

If you are using Docker to run experiments, create an image with the provided Dockerfile:

docker build -t <your_tag> .
docker run -itd --rm --gpus all --shm-size=8gb -v <repository_path>:/workspace/ <image_tag>
docker exec -it <docker_container_id> /bin/bash

I just want to use the SMT!

Using the SMT for transcribing scores is very easy, thanks to the HuggingFace Transformers 🤗 library. Just implement the following code and you will have the SMT up and running for transcribing excerpts!

import torch
import cv2
from data_augmentation.data_augmentation import convert_img_to_tensor
from smt_model import SMTModelForCausalLM

image = cv2.imread("sample.jpg")
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SMTModelForCausalLM.from_pretrained("<model_reference>").to(device)

predictions, _ = model.predict(convert_img_to_tensor(image).unsqueeze(0).to(device), 
                               convert_to_str=True)

print("".join(predictions).replace('<b>', '\n').replace('<s>', ' ').replace('<t>', '\t'))

Please, replace the <model_reference> with the implementation of the SMT it suits your purpose. Currently, this project hosts two models:

Data

The datasets created to run the experiments are publicly available for replication purposes.

Eveything is implemented through the HuggingFace Datasets 🤗 library, so loading any of these datasets can be done through just one line of code:

import datasets

dataset = datasets.load_dataset('antoniorv6/<dataset-name>')

The dataset has two columns: image which contains the original image of the music excerpt for input, and the transcription, which contains the corresponding bekern notation ground-truth that represents the content of this input.

Train

These experiments run under the Weights & Biases API and the JSON config. To replicate an experiment, run the following code:

wandb login
uv run train.py --config_path <config-path>

The config files are located in the config/ folder, depending on the executed config file, a specific experiment will be run.

You can make your own config files to train the SMT on your own data!

[!TIP] I highly recommend to use your datasets in the same format provided in the HuggingFace Datasets specification to work with this model. If not, I suggest to make your own data.py file from scratch. Please refer to the Data section to know how to structure your dataset.

Benchmarks

Currently, the available model (SMT NeXt) reports the following metrics:

Corpus		SMT NeXt
	CER	SER	LER
GrandStaff (Ideal)	3.9	5.1	13.1
GrandStaff (Camera)	5.3	6.2	13.5
Quartets	1.3	1.4	5.6

Citations

@InProceedings{RiosVila:ICDAR:2024,
	author="R{\'i}os-Vila, Antonio
		and Calvo-Zaragoza, Jorge
		and Paquet, Thierry",
	title="Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription",
	booktitle="Document Analysis and Recognition - ICDAR 2024",
	year="2024",
	publisher="Springer Nature Switzerland",
	address="Cham",
	pages="20--37",
	isbn="978-3-031-70552-6"
}

Acknowledgments

This work is part of the I+D+i PID2020-118447RA-I00 (MultiScore) project, funded by MCIN/AEI/10.13039/501100011033. Computational resources were provided by the Valencian Government and FEDER funding through IDIFEDER/2020/003.

License

This work is under a MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
config		config
data_augmentation		data_augmentation
graphics		graphics
smt_model		smt_model
test		test
vocab		vocab
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
ExperimentConfig.py		ExperimentConfig.py
LICENSE		LICENSE
README.md		README.md
data.py		data.py
eval_functions.py		eval_functions.py
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
smt_trainer.py		smt_trainer.py
train.py		train.py
utils.py		utils.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription

About

Contributing

Project setup

Docker

I just want to use the SMT!

Data

Train

Benchmarks

Citations

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Languages

License

antoniorv6/SMT

Folders and files

Latest commit

History

Repository files navigation

Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription

About

Contributing

Project setup

Docker

I just want to use the SMT!

Data

Train

Benchmarks

Citations

Acknowledgments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Languages

Packages