🔊 [ICLR 2025] SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes

🚀 SSLAM is a self-supervised learning framework designed to enhance audio representation quality for both polyphonic(multiple overlapping sounds) and monophonic soundscapes. Unlike traditional SSL models that focus on monophonic data, SSLAM introduces a novel source retention loss and audio mixture training, significantly improving performance on real-world polyphonic audio.

🔗 Paper | ICLR 2025 Poster: Video & Slides | Open Review | 🤗 Models | Models(Google Drive)

📋 Table of Contents

Why SSLAM?
Key Features
Results
Inference Mode
Training Mode
Checklist
Acknowledgements
Citation

🔍Why SSLAM?

🔊 Real-world audio is polyphonic—multiple overlapping sound sources are common in everyday environments.
❌ Existing SSL models focus on monophonic audio, limiting their ability to generalize to real-world scenarios. Their benchmarks are primarily monophonic, and their pre-training does not account for polyphonic environments.
💡 SSLAM bridges this gap by introducing self-supervised learning from audio mixtures, enabling robust learning across both monophonic and polyphonic soundscapes.

🎼Key Features

✅ Self-Supervised Learning from Audio Mixtures (SSLAM) – improving robustness to real-world polyphonic audio (multiple overlapping sounds).
✅ Source Retention Loss – ensures the integrity of each sound source even in complex mixtures.
✅ SOTA Performance – Achieves +3.9% mAP improvement on AudioSet-2M and +9.1% on polyphonic datasets.

📊Results

1. Standard Audio-SSL Benchmark Datasets

2. Polyphonic Datasets

🔍️Inference Mode

Note: If you are already using EAT in your evaluation/inference pipeline, you can simply replace the weights with SSLAM weights, as the inference and evaluation code is identical to EAT.

If not, follow the steps below for installation:

📥Minimal Installation for Inference/Evaluation

To simplify installation and avoid dependency conflicts, we've included a cloned copy of fairseq (SSLAM_Inference/cloned_fairseq_copy/fairseq) in the repository instead of requiring a direct fairseq installation.

conda create --prefix /path/to/sslam_infer_minimal_env -y python=3.9.13

/path/to/sslam_infer_minimal_env/bin/pip install -r SSLAM_Inference/requirements_sslam_infer_minimal.txt

Important: Update the fairseq path in these files:

SSLAM_Inference/evaluation/eval.py
SSLAM_Inference/feature_extract/feature_extract.py
SSLAM_Inference/inference/inference.py

Look for the fairseq_path variable and update it to point to the included clone:

fairseq_path = '/absolute/path/to/SSLAM/SSLAM_Inference/cloned_fairseq_copy/fairseq/'

📦Model Weights

Model Type	Link
Pre-Trained	Download
AS2M Fine-Tuned (50.2 mAP)	Download

🚀Using SSLAM

We provide scripts to use SSLAM in the following ways:

1. Audio Feature (Representation) Extraction Using SSLAM Encoder

cd SSLAM_Inference/scripts
bash feature_extract.sh

2. Inference on Single Audio WAV File

cd SSLAM_Inference/scripts
bash inference.sh

3. Evaluation on AudioSet-2M Evaluation Set

cd SSLAM_Inference/scripts
bash evaluate_AS2M_finetuned.sh # Reported mAP: 50.2

📈Training Mode

We cover the self-supervised pre-training, fine-tuning and linear evaluation under this section.

📥Training Installation

For training its better to install the fairseq in editable mode,

conda create --prefix /path/to/sslam_env -y python=3.9.13 ## env used for training
/path/to/sslam_env/bin/python -m pip install pip==24.0 # downgrade pip
cd SSLAM/
git clone https://github.com/facebookresearch/fairseq.git

##IMPORTANT: Copy the Pre-Training/SSLAM_Stage2 directory to SSLAM/fairseq 
## so that the resultant path is SSLAM/fairseq/SSLAM_Stage2/.
cd fairseq/

## install all requirements apart from fairseq
/path/to/sslam_env/bin/pip install -r SSLAM_Stage2/requirements_sslam.txt
## install fairseq in editable mode
/path/to/sslam_env/bin/pip install --editable ./

🗄️Data Preparation

We utilised AudioSet-2M (full set) for pre-training. For this phase, only the train.tsv file is required. Refer to train.tsv for AudioSet-20K to prepare the train.tsv file for your downloaded copy of AudioSet-2M.

🚀Pre-Training

Note: This repository focuses solely on Stage 2 pre-training, which introduces our novel SSLAM pre-training strategy.

To begin Stage 2, you'll need a Stage 1 checkpoint. In our complete pre-training process, Stage 1 mirrors the approach in EAT and achieves similar performance. For convenience, we use the EAT checkpoint as the Stage 1 checkpoint.

Download the EAT epoch 10 checkpoint using the link provided by the EAT repository: EAT-base_epoch10_pt.pt.

Only the contents of the models/ folder and a few parameters in the pre-training script differ between Stage 1 and Stage 2.

cd SSLAM/fairseq/SSLAM_Stage2/scripts/
bash pretrain_stage2.sh

📌Checklist

Inference Mode
Pre-Training

🙏Acknowledgements

Our code is primarily based on EAT and data2vec 2.0 with additional concepts and components adapted from AudioMAE.

📜Citation

If you find our work useful, please cite it as:

@inproceedings{alex2025sslam,
  title={{SSLAM}: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes},
  author={Tony Alex and Sara Atito and Armin Mustafa and Muhammad Awais and Philip J B Jackson},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=odU59TxdiB}
}

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Pre-Training/SSLAM_Stage2		Pre-Training/SSLAM_Stage2
SSLAM_Inference		SSLAM_Inference
assets		assets
data_manifests/manifest_as20k		data_manifests/manifest_as20k
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔊 [ICLR 2025] SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes

📋 Table of Contents

🔍Why SSLAM?

🎼Key Features

📊Results

1. Standard Audio-SSL Benchmark Datasets

2. Polyphonic Datasets

🔍️Inference Mode

📥Minimal Installation for Inference/Evaluation

📦Model Weights

🚀Using SSLAM

1. Audio Feature (Representation) Extraction Using SSLAM Encoder

2. Inference on Single Audio WAV File

3. Evaluation on AudioSet-2M Evaluation Set

📈Training Mode

📥Training Installation

🗄️Data Preparation

🚀Pre-Training

📌Checklist

🙏Acknowledgements

📜Citation

About

Releases

Packages

Languages

License

ta012/SSLAM

Folders and files

Latest commit

History

Repository files navigation

🔊 [ICLR 2025] SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes

📋 Table of Contents

🔍Why SSLAM?

🎼Key Features

📊Results

1. Standard Audio-SSL Benchmark Datasets

2. Polyphonic Datasets

🔍️Inference Mode

📥Minimal Installation for Inference/Evaluation

📦Model Weights

🚀Using SSLAM

1. Audio Feature (Representation) Extraction Using SSLAM Encoder

2. Inference on Single Audio WAV File

3. Evaluation on AudioSet-2M Evaluation Set

📈Training Mode

📥Training Installation

🗄️Data Preparation

🚀Pre-Training

📌Checklist

🙏Acknowledgements

📜Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages