seqio
is a python library (3.3+ only) for highly optimized reading/writing of raw (i.e. unaligned) NGS data in a variety of formats:
- FASTA
- FASTQ
- Interleaved FASTQ
- SAM/BAM/CRAM
- Common compression formats (gz, bz2, xz), which are handled automatically (via xphyle)
seqio
is currently in development (pre-alpha).
seqio
achieves significant performance gains over similar libraries (e.g. screed
) in three ways:
- Implements critical portions of code in C (via Cython).
- Maintains data as bytes by default, rather than perform expensive string encoding/decoding operations.
- Uses system-level compression/decompression (via xphyle) when possible.
The generated sequences can be either immutable or mutable. With immutable sequences, slicing always returns a new sequence, and modifications to the sequence and qualities are not supported. With mutable sequences, modifications are applied in-place and each modification is recorded as an Edit
.
# Here we will filter out reads with low mean quality
# or which contain an adapter sequence, and write the
# retained reads to an interleaved, compressed fastq file
from seqio import reader, writer
fq = ('reads{}.fq.gz'.format(i) for i in (1,2))
adapter = b'AGATCGAAT'
writer = writer('output.fq.gz', interleaved=True)
for read1, read2 in reader(*fq):
# qualities can also be retrieved as a list of
# phred-scale integers
mean_qual = sum(read1.qualities_phred(base=33)) / len(read1)
# the following is equivalent and a bit faster:
from seqio.util import mean_quality
mean_qual = mean_quality(read1)
# for most operations (e.g. comparisons), byte array
# sequence and qualities can be used directly
has_adapter = read1.sequence.startswith(adapter)
# the following is equivalent and will take care of
# string conversion for us, but is slightly slower
from seqio.util import has_prefix
has_adapter = has_prefix(read1, 'AGATCGAAT')
# names are stored as unicode strings
print(read1.name)
# sequences and qualities are stored as byte arrays,
# so explicit conversion to string is required
print("{}\n{}\n\n{}\n{}".format(
read1.sequence_str, read1.qualities_str,
read2.sequence_str, read2.qualities_str))
print("Mean read1 quality: {:.3d}".format(mean_qual))
print("Contains adapter? {}".format(has_adapter))
if mean_qual >= 30 and not has_adapter:
writer.write(read1, read2)
These are all installable via pip:
- Cython 0.25+ (only required at build-time)
- xphyle 1.0+
- pysam (only required for SAM/BAM/CRAM support)
- Look at dnaio - maybe combine forces? https://github.com/marcelm/dnaio/
- Add FAST5 support (using poretools)
- Add xphyle plugins for sequence-specific compression formats:
- Treat BAM and CRAM as compression formats using native samtools/sambamba if available
- fqzcomp, quip, scalce http://www.nature.com.ezproxy.nihlibrary.nih.gov/nmeth/journal/v13/n12/pdf/nmeth.4037.pdf
- https://github.com/COMBINE-lab/quark
- https://github.com/rcanovas/libCSAM
- https://github.com/Zhuzxlab/LW-FQZip2
- Casava/BCL support might be useful to some; will wait to see if this is requested.
- Check out similar library in Rust: https://github.com/onecodex/needletail
- FaStore support? https://github.com/refresh-bio/FaStore
- Check out skbio.io: http://scikit-bio.org/docs/0.4.1/index.html
- Look at using pybam instead of pysam
- Look at using the numpy-based data structures in biotite