Download Data | ArXiv Preprint | Reproducibility | Notebooks | Pre-encoded UNI embeddings | Citation


SurGen is split into two sub-cohorts:
- SR386 – Primary colorectal cancer (427 WSIs) with five-year survival data.
- SR1482 – Colorectal cancer cases (593 WSIs) including metastatic lesions (liver, lung, peritoneum), with full biomarker data.
Each WSI is stored in Zeiss .CZI
format. For convenience, precomputed patch embeddings (extracted using the UNI foundation model) are also available.
The SurGen dataset is hosted on the EBI FTP server. You can download the Whole Slide Images (WSIs) for both sub-cohorts (SR386 and SR1482) using wget
, an FTP client, or you can download directly from the EBI website.
For most, the easiest way to download the WSIs is via wget
:
wget -r -np -nH --cut-dirs=6 ftp://ftp.ebi.ac.uk/biostudies/fire/S-BIAD/285/S-BIAD1285/Files/SR386_WSIs/
wget -r -np -nH --cut-dirs=6 ftp://ftp.ebi.ac.uk/biostudies/fire/S-BIAD/285/S-BIAD1285/Files/SR1482_WSIs/
This will download the respective data into SR386_WSIs/
and SR1482_WSIs/
folders in your current directory.
-np
no parent (prevents downloading higher-level directories).-nH
no host (ignores 'ftp.ebi.ac.uk' in the local directory structure).--cut-dirs=6
ensures you get a clean directory structure without extra nested folders.
If you prefer to use FTP, follow these steps:
-
Open a terminal and connect to the FTP server:
ftp ftp.ebi.ac.uk
- When prompted, enter
anonymous
as the username. - Press Enter for the password.
- When prompted, enter
-
Navigate to the SR386 directory:
cd /biostudies/fire/S-BIAD/285/S-BIAD1285/Files/SR386_WSIs
Or for the SR1482 directory:
cd /biostudies/fire/S-BIAD/285/S-BIAD1285/Files/SR1482_WSIs
-
Enable binary mode to correctly transfer
.CZI
files:binary
-
Download all
.CZI
files:prompt mget *.czi
-
Close the ftp connection:
exit
You can also use an FTP GUI client such as FileZilla or Cyberduck:
- Host:
ftp.ebi.ac.uk
- Username:
anonymous
- Port:
21
- Path:
/biostudies/fire/S-BIAD/285/S-BIAD1285/Files/
The reproducibility directory contains step-by-step instructions to replicate the results shown in our DataNote paper. These include:
- Details on environment setup and required dependencies.
- Scripts for processing the WSIs and generating patch-level features.
- Guidelines for reproducing slide-level prediction results.
This ensures that all experiments can be reliably reproduced by other researchers using the provided dataset and embeddings.
The notebooks directory provides interactive examples for exploring the SurGen dataset and pre-extracted features:
simple_load_wsi_tile.ipynb
– Demonstrates how to interact with.CZI
files in Python, including reading and viewing from WSIs.patch_feature_extraction.ipynb
– Shows how to extract patch-level features using Hugging Face models, this example uses the UNI foundation model.zarr_examined.ipynb
– Explains the layout and usage of pre-extracted SurGen features stored in Zarr format, making it easier to integrate with downstream analysis pipelines.
These notebooks provide a practical starting point for using the dataset and applying it to various computational pathology tasks.
If you find this dataset or repository useful, please consider citing the following:
@article{myles2025surgen,
title={SurGen: 1020 H\&E-stained Whole Slide Images With Survival and Genetic Markers},
author={Myles, Craig and Um, In Hwa and Marshall, Craig and Harris-Birtill, David and Harrison, David J},
journal={arXiv preprint arXiv:2502.04946},
year={2025}
}
@inproceedings{myles2024leveraging,
title={Leveraging foundation models for enhanced detection of colorectal cancer biomarkers in small datasets},
author={Myles, Craig and Um, In Hwa and Harrison, David J and Harris-Birtill, David},
booktitle={Annual Conference on Medical Image Understanding and Analysis},
pages={329--343},
year={2024},
organization={Springer}
}