This repo contains the MEDI nextflow pipeline(s) for recovering food abundances and nutrient composition from metagenomic shotgun sequencing data.
It contains capabilities for the following individual functionalities:
- Mapping items in FOODB to all currently available items in NCBI NIH databases and prioritizing hits
- Downloading, consolidating, ANI distance calculation using minhasing, for all full and partial assemblies and annotation of NCBI Taxonomy IDs for use with Kraken2.
- Building the Kraken2 and BRACKEN hashes/databases including decoy sequences.
- Perform quantification of food DNA and per-portion nutrient composition for metagenomic samples, starting from raw FASTQ files.
Warning
MEDI is built and tested on Linux only which should be the most common configuration due to the large resource requirements (500GB RAM).
You will need a working miniforge or miniconda to start. YOu can follow the installation instructions here. After create an environment with the included conda environment file.
Start by cloning the repository and cd-ing into it.
git clone https://github.com/gibbons-lab/medi
cd medi
conda env create -n medi -f medi.yml
After that activate the environment.
conda activate medi
Tip
This will only be necessary if the provided binary does not work.
You can test this by running ./bin/kraken2-report
if this returns
"malformed taxonomy file" you are good. If you get errors about the
ELF class or a missing libc version you will need to recompile.
Kraken2 does not support generating reports on filtered output files by default. We provide a pre-compiled report generator from a Kraken2 fork. If this does not work you can compile it using the provided Makefile:
conda activate medi
make report
This will compile the report generator for your platform and replace the binary.
And you are done. If you are running this on a HPC cluster or a cloud provider, you moght need to adjust your nextflow settings for your setup.
All pipelines support a --threads
parameter that defines the maximum number of threads
to use for any single process.
If you want to run steps in the install directory no additional steps are needed.
After setting up the conda environment there are two ways to make MEDI steps run in arbitrary data directories.
First you can create a variable holding the MEDI installation and use this.
export MEDI=/my/install/location
cd /my/data/directory
nextflow run $MEDI/quant.nf --db=/my/medi_db
Second, you can create symlinks to the step and and the bin
directory.
cd /my/data/directory
ln -s /my/install/location/bin
ln -s /my/install/location/quant.nf
nextflow run $MEDI/quant.nf --db=/my/medi_db
Both will work the same.
Here are the full steps to build the database and run it on your data.
Note
If you are asking yourself why we don't just provide the built database for download please see the comments here.
nextflow run -resume database.nf
This will boostrap the database from nothing, downloading all required files and performing
the matching against the current versions of NCBI Genbank and Nucleotide. You can speed up the querying by obtaining and NCBI API key and adding it to your .Rprofile
with
options(reutil.api.key="XXXXXX")
Please also the the troubleshooting guide in case you encounter issues.
After running the previous step continue with
nextflow run build_kraken.nf --max_db_size=500
Here --max_db_size
denotes the maximum size of the database in GB. The default
will use no reduction but you can set this to a lower level which will create a
smaller but less accurate hash. Note that for good performance you will need as much
RAM as what you choose here.
Note that this step of the pipeline will not work with the -resume
option. The add_*
need
to finish completely or the pipeline needs to be restarted from the beginning. Should this
work and the later steps crash, you can trigger just the hash building using the
--rebuild
option which will rebuild the database but not attempt to add sequences again.
Warning
Do not run this step with the -resume
Nextflow option as this will result in a
corrupted database. In case, all adding sequences worked and you only want to run the
build step again, use the --rebuild=true
option.
For your own sequencing data create a directory and use either of the setups described above.
Then create a data
directory and within that a raw
folder containing your unprocessed
demultiplexed FASTQ files. So it should look like:
|- quant.nf // optional
|- bin/ // optional
|- data/
|-- raw/
After that you can run MEDI with
nextflow run quant.nf -resume --db=/path/to/medi_db
Where /path/to/medi_db/
should be the output directory from step (3). Usually medi/data/medi_db
.
Tip
For HPC clusters it can be beneficial to tune the --batchsize
parameter to obtain
massive performance benefits from memory-mapping. See here for
more info. For single machines we recommend --batchsize 1
Those are all found in the output directory, usually data/
if not specified otherwise.
filename | description |
---|---|
food_abundance.csv | Food abundances for each sample in the data set. Samples with only NA entries had no detected food reads. |
food_content.csv | Estimates for nutrient and compound abundances normalized to a representative portion of 100g. |
multiqc_report.html | Quality control metrics for the sequencing data. |
By default MEDI also reports bacteria, archaea, and host abundances with the same
filtering and decoy steps as for the food quantification. Those can be found in the
output directory (usually data/
) and are names [RANK]_counts.csv
where [RANK]
denotes
the taxonomic rank (S - species, G - genus, D - domain).
- [] see if we can provide a reduced DB for download
- make execution more flexible
- add resource limits for individual steps for HPC clusters