Skip to content

Data Storage on Xanadu

mianahom edited this page Oct 12, 2023 · 48 revisions

Data management guidelines for the Xanadu cluster

  1. What storage is available to me on Xanadu?
  2. How do I transfer data to and from Xanadu?
  3. I'm running out of space! What should I do?
  4. How do I know how much space I'm using anyway?

What storage is available to me on Xanadu?

  • Each user's home directory is allotted 2TB of storage.
  • Principal investigators may request a lab directory with additional storage.
  • Projects using Nanopore data generated in the CGI can be housed in a separate directory. Inquire if you are generating these data.
  • When you are no longer actively using data, it can be archived. Please request archive space when you are ready.
  • There is also space dedicated to shared resources. The CBC maintains local copies of many databases, such as NCBI's nr database, as well as many reference genomes, their annotations and aligner indexes (see a list here) and you can request new ones.

How do I transfer data to and from Xanadu?

  • We have a whole tutorial section devoted to this on our website!

I'm running out of space! What should I do?

Many Xanadu users will be able to save space by following these guidelines:

Keep files compressed.

  • Compress raw sequence data (fastq and fasta format) using gzip. Most bioinformatic applications can handle compressed data, and those that are limited by read/write speed will actually run faster using compressed data. Try checking the software you commonly use for options to specify input or output as gzipped files.

     # to compresss a fastq file:
     gzip sequences.fastq
     # result: sequences.fastq.gz
    
     # decompress: 
     gzip -d sequences.fastq.gz
     # result: sequences.fastq
    
     # look at the first 10 lines:
     zcat sequences.fastq.gz | head
    
     # look at the last 10 lines:
     zcat sequences.fastq.gz | tail
    
     # using grep, print all lines using starting with SbfI cut site:
     zgrep "^TCGAGG" sequences.fastq.gz
     # OR:
     zcat sequences.fastq.gz | grep "^TCGAGG"
     # OR: 
     bioawk -c fastx '$seq ~ /^TCGAGG/' sequences.fastq.gz
     	# must install bioawk first
  • Compress SAM files (reference sequence alignments) into BAM files. Lots of software writes uncompressed SAM by default, but BAM files can be 5-10x smaller, depending on the type of data. For a slightly more advanced approach, try piping some of these steps together as in our variant calling tutorial.

     # load samtools
     module load samtools/1.9
    
     # compress to bam
     samtools view -S -b sequences.sam > sequences.bam
     rm sequences.sam
    
     # index for random access
     samtools index sequences.bam
     # outputs companion file: sequences.bam.bai
    
     # get reads from a genomic region:
     samtools view sequences.bam chr10:1000-2000
  • Compress VCF, BED files using bgzip and index them using tabix. Like sequence data, most bioinformatic software can handle compressed versions of common file formats. bgzip is essentially gzip, with minor modifications that allow files to be indexed by tabix for fast random access. And actually, any tabular data with genomic coordinates (sorted by coordinate order!) can be compressed and indexed by bgzip/tabix.

     # load htslib
     module load htslib/1.9
    
     # compress variants using bgzip
     bgzip variants.vcf
     # output: variants.vcf.gz
    
     # use zcat, zgrep, less as above, it's still just a gzipped file
     zcat variants.vcf.gz | head
    
     # use tabix to index
     tabix -p vcf variants.vcf.gz
     # outputs a companion index file: variants.vcf.gz.tbi
    
     # retrieve a genomic region:
     tabix variants.vcf.gz chr10:1000-2000
     # bcftools can also manipulate compressed, indexed vcf files in many useful ways. 
  • In fact, most plain-text data files (e.g. tab-delimited and csv) can be gzip-compressed. R can read gzipped files like any other file, with no special options needed.

Delete unneeded files.

  • Delete copies of databases, reference genomes and indexes: Did you know that the CBC maintains copies of commonly used databases, indexes, and reference genomes (along with their annotations and pre-computed indexes)? Instead of downloading your own copy (or five) you can point to these. You can find a list of the most recent versions of these here and you can request that we add new ones or update them. We also retain older versions for reproducibility. These are available here: /isg/shared/databases/. If you need a genome or database to be represented in your project directory, you can use symlinks, as detailed below.
  • Delete intermediate files from analysis pipelines. Does your NGS sequence analysis pipeline go something like this? Trim fastqs -> align fastqs -> sort SAM -> mark duplicates in SAM -> add read groups to SAM -> compress to bam -> index? If you do this, you may have generated 5 copies of your entire dataset. Delete all those intermediate files! Or, as above, try writing a pipeline that doesn't write them to disk in the first place.
  • Delete (or don't create) extra copies of data. If you're doing multiple analyses on a single dataset (e.g. pipeline1 and pipeline2) and each pipeline needs a copy of the data in its own project directory, you don't actually need to copy it. You can use symbolic links (a.k.a. symlinks).
     # the original copy of your data may live here: ~/rawdata/project1/
     # but you need the data to be here: ~/pipeline1/datadir and here ~/pipeline2/datadir
     # use ln to create symbolic links:
     ln -s ~/rawdata/project1 ~/pipeline1/datadir
     ln -s ~/rawdata/project1 ~/pipeline2/datadir
     # now each project directory has a symlink pointing back at ~/rawdata/project1/
     # the original data are represented 3 times in your directory structure, but there is only a single copy using disk space. 

Remove old data from Xanadu.

  • Archive old data. If you're no longer actively analyzing a dataset, you can request archive space to store it in. Archived data will not count against your storage allocation. You should only do this if you have no plans to use the data in the near future, as archive space has slow read and write speeds and is unsuitable to use for data analysis. It's best to compress the entire project into a single tarball before archiving it as the archive storage doesn't handle complex file directories well.

     # to create a tarball and compress a project:
     tar -czvf oldproject.tar.gz ~/old_project_directory
     	# c creates; z compresses; v is 'verbose'; f specifies tarball name. 
    
     # to extract it:
     tar -xvf oldproject.tar.gz
     	# x extracts

    If the project directory is large in size then you can take advantage of multithreading option in pigz. Applying pigz on the above example.

     # to create a tarball and compress a project:
         tar cf - ~/old_project_directory | pigz -9 -p 4 > oldproject.tar.gz
     	#tar flags: c creates; f specifies tarball name.
                 #pigz flags: -9 : Opting for better compression, -p 4 : Requesting 4 cores for compression
    
     # pigz does not support multithreading decompression so to extract it:
     tar -xvf oldproject.tar.gz
     	# x extracts

    Once you've created oldproject.tar.gz you can transfer it to an archival filesystem and remove it from your home or lab directory. There are two archival directories (similar to the home and lab directories): /archive/users and /archive/projects. Once you have been granted a space there, you should use rsync to copy the data. rsync is a robust way of transferring data that can be restarted if it is interrupted and checks the integrity of a file after transfer, making it superior to cp and mv(which you should never use to move data to archive space).

     # to copy the data:
     rsync -avzh --inplace --progress --stats /home/CAM/nreid/oldproject.tar.gz /linuxshare/users/nreid/
     # this will copy oldproject.tar.gz to the nreid archive space
     # after transferring to archive, you can remove the tarball from you working directory
     rm /home/CAM/nreid/oldproject.tar.gz
  • Delete old data. If you have published a dataset, uploaded it to the Sequence Read Archive at the NCBI, and have no concrete plans to revisit it in the near future, consider removing it entirely from Xanadu.

Create symbolic links instead of duplicating files.

CAUTION:

  1. Do not create symlinks to directories.
  2. Make sure to change the permissions of the directory and files as recommended below.

What are symbolic links?

Symlinks or symbolic links are analogous to shortcuts that are created in windows or mac operating systems.

Symlinks help to avoid duplication of data while analyzing your data through multiple pipelines or software, particularly in cases when users would like to have data files in the analyzing directory (could be for convenience or due to demand of a particular application). Copying data sometimes can be a time-consuming process if the size is >5GB in size. This does not implicate that copying of files <5GB is OK, ideally users should avoid copying data as much as possible.

The example below demonstrates how to create symlinks of multiple files located in the source data directory. In the example, symlinks will be created in the analysis directory.

Let's examine the contents of the source directory /labs/CBC/day1/array/fastqData

#list contents of source directory
ls -l /labs/CBC/day1/array/fastqData/

-rwxrwxr-x 1 vsingh wegrzynlab 25362757 Jan  6  2021 sample01.fastq.gz
-rwxrwxr-x 1 vsingh wegrzynlab 25330891 Jan  6  2021 sample02.fastq.gz
-rwxrwxr-x 1 vsingh wegrzynlab 25354638 Jan  6  2021 sample03.fastq.gz
...
-rwxrwxr-x 1 vsingh wegrzynlab 25348974 Jan  6  2021 sample45.fastq.gz
-rwxrwxr-x 1 vsingh wegrzynlab  1722422 Jan  6  2021 sample46.fastq.gz
# There are 47 fastq files,This is a truncated display
    
# change permission of your source directory and files from -rwxrwxr-x to -r-xr-xr-x
chmod -R 555 /labs/CBC/day1/array/fastqData

# Lets check the files if desired permission is set.
ls -l /labs/CBC/day1/array/fastqData

-r-xr-xr-x 1 vsingh wegrzynlab 25328825 Jan  6  2021 sample00.fastq.gz
-r-xr-xr-x 1 vsingh wegrzynlab 25362757 Jan  6  2021 sample01.fastq.gz
-r-xr-xr-x 1 vsingh wegrzynlab 25330891 Jan  6  2021 sample02.fastq.gz
...
-r-xr-xr-x 1 vsingh wegrzynlab 25348974 Jan  6  2021 sample45.fastq.gz
-r-xr-xr-x 1 vsingh wegrzynlab  1722422 Jan  6  2021 sample46.fastq.gz

The change in permission to -r-xr-xr-x is strongly recommended as it will prevent the accidental deletion of the files.

The destination directory is /labs/CBC/analysis/rawdata/.

Now, let's create symlinks

ANALYSIS_DIRECTORY="/labs/CBC/analysis/rawdata/"

# Create variable FILELIST that hold the name of all input fastq.gz files with their path
FILELIST=$(ls -1 -d /labs/CBC/day1/array/fastqData/*.fastq.gz)

# Loop that iterate through all the input fastq.gz  and create their symlink
for EACH in ${FILELIST}
do
    LINKNAME=$( echo ${EACH} | awk -F/ '{print $NF}')
    ln -s ${EACH} ${ANALYSIS_DIRECTORY}/${LINKNAME}
done

Time to check if symlinks are created

ls -l /labs/CBC/analysis/rawdata/
lrwxrwxrwx 1 vsingh wegrzynlab 48 Sep 15 11:42 sample00.fastq.gz -> /labs/CBC/day1/array/fastqData/sample00.fastq.gz
lrwxrwxrwx 1 vsingh wegrzynlab 48 Sep 15 11:42 sample01.fastq.gz -> /labs/CBC/day1/array/fastqData/sample01.fastq.gz
lrwxrwxrwx 1 vsingh wegrzynlab 48 Sep 15 11:42 sample02.fastq.gz -> /labs/CBC/day1/array/fastqData/sample02.fastq.gz
...
lrwxrwxrwx 1 vsingh wegrzynlab 48 Sep 15 11:42 sample45.fastq.gz -> /labs/CBC/day1/array/fastqData/sample45.fastq.gz
lrwxrwxrwx 1 vsingh wegrzynlab 48 Sep 15 11:42 sample46.fastq.gz -> /labs/CBC/day1/array/fastqData/sample46.fastq.gz

The l in permission field lrwxrwxrwx shows that these are symlinks, also demonstrated by the last field sample46.fastq.gz -> /labs/CBC/day1/array/fastqData/sample46.fastq.gz displaying where the source file is located for each symlink.

You can use these symlinks in your analysis as regular files.

How to delete symlinks?

You can remove symlinks like any other file with rm. However there is one tricky area to be aware of. If you have created a symlink to a directory instead of a file, it's easy to accidentally the contents of that directory.

For example, consider the directory files that you create a symlink to:

ln -s files/ linkdir

If later you wish to remove linkdir, you must type:

rm linkdir

to remove the symlink. If instead you include a trailing slash:

rm linkdir/

You will get an error: rm: cannot remove 'links/': Is a directory. If you use TAB autocompletion that trailing slash will be added automatically. If you then try:

rm -r linkdir/

which we usually use to remove directories, the symlink will not be deleted, but but every file in the linked directory will be, which you probably did not want to do.

So, to sum up, when deleting symlinks to directories:

WRONG methods

rm linkdir/
rm -r linkdir/

CORRECT method

rm linkdir

How do I know how much space I'm using anyway?

Use du. Go to your home directory and type:

du -sh

This will give the size of all data housed in that directory. Type:

du -sh *

To list storage for each subdirectory.

For lab directories check this site, which is updated daily.