-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More than 65535 cigar opcodes #40
Comments
We should implement this feature. I would prefer to keep |
We do not use CSI indexes within HTSJDK and Picard, so I would have to think about this in our production pipeline, even as we are moving to CRAM. My apologies for not having a definitive yes/no answer at the moment. |
Do you trust and use the bin field though, or recompute it on the fly? |
Looking at our code, it currently reads the |
So looking through our code, we could also not use the |
As various CIGAR-related loops and API functions use int, the practical limit without further changes is 2^31-1 rather than 2^32-1. As bam_init1() uses calloc(), the unused field is initialised to 0 and we don't need to bother re-zeroing it in sam_read1() et al. Until the BAM format is extended to deal with it (samtools/hts-specs#40), trying to write >64K CIGAR operations to a BAM file remains an error. Implements and fixes #437. (Update htslib/sam.h copyright notice for changes in 2015 and 2016. MinGW does not have EOVERFLOW.)
This is now an active issue with ONT data, I've provided a dataset in FASTA that will routinely run into this limit here when aligned with BWA-MEM: |
Thanks for the real world examples Nick. The htslib fix John made above solves this for SAM and CRAM, but not BAM due to file format limitations. I verified this on a CRAM file with 178801 cigar operations and the round trip from SAM -> CRAM -> SAM is identical in the latest develop branch. Following discussions on samtools-devel several years back https://sourceforge.net/p/samtools/mailman/message/30668698/ I implemented this out of necessity in Staden "io_lib", so scramble can read and write BAM records too with more than 64k entries too. I'm not advocating using that, but it's a practical demonstration that the BAM spec could (and should?) be updated if there is pressure. In the interim, have you thought of just using CRAM or compressed SAM? Note that you can use CRAM without needing to specify a reference, which may be more efficient with lots of reference differences anyway:
|
A query on the proposed extra FLAG field. Is this only the BAM format we are changing, or is it SAM also? If SAM, do we therefore expose it? Eg a record with 100,000 cigar ops could have the SAM FLAG as >= 32768. It feels like this is an internal BAM limitation and it should be in the BAM part of the spec and absent in the SAM part, as this is a binary encoding issue only. |
I would define the FLAG bit as reserved for BAM use as described, and say the bit should be clear in SAM. |
Given that we're likely to see more and more of this going forward, would it be crazy to consider a non-backwards compatible change to the BAM spec along with rev'ing the major version number? I think that would be much preferable in the long run to having a flag field that is reserved for BAM use and whose meaning is questionable for SAM/CRAM use. The latter means that all programs that convert between formats need to be smart about modifying the records prior to writing. I would much prefer to see this done with a version bump in the spec tied to widening the |
We have maintained backward and nearly forward compatibility of BAM for over 8 years. In my opinion, we should keep it this way as long as there is a solution -- I am always on the compatibility side. Quite a few tools specifically depend on an old version of samtools or use bamtools, native golang, javascript etc to parse BAMs. @jkbonfield's solution will keep them usable, which I think is very important. |
I guess I'm not sure I follow. Won't all those implementations break given @jkbonfield's solution and a BAM file that actually contains > 65535 cigar operations for at least one record? Unless those implementations (perhaps conditionally on a flag bit) know to use the 16-bits of The only kind of backwards compatibility it yields is that files that don't contain any cigars with ncigar > 65535 would still parse normally. I realize that's not a small fraction of files out there, but the point is that existing implementations would fail in bizarre ways given long cigars with reusing the At the very least I think it would make sense to version the specification, and re-write the section that contains the current definition of |
Yes, forward compatibility is only achievable when there are less than 65536 CIGAR operations, but this is the whole point: we want BAMs without long CIGARs usable by all the old tools, including tools relying on 3rd-party BAM parsing libraries written in other languages.
I don't think this is possible without breaking compatibility. If I am right,
|
Yes the order was incorrect to be a 32-bit integer (neither big nor little nor even pdp endian). It's all in that email discussion from the original post. I think that also shows it was Heng's idea originally, not mine. I just implemented it. :) Frankly if we were going to break binary compatibility for ALL data, then we'd change lots of BAM things (eg swapping cigar and name around so we don't have to play games to avoid unaligned memory accesses), but it's just a fools errand. May as well start again with a new format or work on improving the next CRAM iteration than changing the BAM format in a non-compatible manner. The only problem I see with using a flag is that you never really know whether you'll hit a record that can't cope until you find it, unlike chanigng the format. Hence I'd say we should bump the minor version number and leave the major version as is, keeping the file backwards compatible and stating that any tools that explicitly want to conform to the new minor version have to cope with the additional flag. I did produce a test file using Scramble to see how samtools/htslib copes; it was badly. It started decoding the remaining CIGAR ops as sequence, emitted a bunch of binary qualities and then returned 0 exit status for good measure. So we know this will break things horribly IF we encounter more than 64k cigar ops, but it's already hopelessly broken in that scenario anyway (re: impossible). |
I feel a bit stupid asking this but how can I find out the version of SAM
spec's I'm reading? I know @hd has a VN tag but looking at several copies
of SAMv1.pdf from my archives I haven't seen a minor revision number
*specified* in the document even though it goes through regular updates.
2014 has examples (not specification) of HD with VN:1.5 && VN:1.3, 2016
has just VN:1.5 (in examples) but the document has a lot of changes since
2014.
Could we clearly state in the first 3 header lines what the SAM/BAM version
number is for the document.
Thanks, Colin
…On 9 March 2017 at 06:36, James Bonfield ***@***.***> wrote:
Yes the order was incorrect to be a 32-bit integer (neither big nor little
nor even pdp endian). It's all in that email discussion from the original
post. I think that also shows it was Heng's idea originally, not mine. I
just implemented it. :)
Frankly if we were going to break binary compatibility for ALL data, then
we'd change lots of BAM things (eg swapping cigar and name around so we
don't have to play games to avoid unaligned memory accesses), but it's just
a fools errand. May as well start again with a new format or work on
improving the next CRAM iteration than changing the BAM format in a
non-compatible manner.
The only problem I see with using a flag is that you never really know
whether you'll hit a record that can't cope until you find it, unlike
chanigng the format. Hence I'd say we should bump the minor version number
and leave the major version as is, keeping the file backwards compatible
and stating that any tools that explicitly want to conform to the new minor
version have to cope with the additional flag.
I did produce a test file using Scramble to see how samtools/htslib copes;
it was badly. It started decoding the remaining CIGAR ops as sequence,
emitted a bunch of binary qualities and then returned 0 exit status for
good measure. So we know this will break things horribly *IF* we
encounter more than 64k cigar ops, but it's already hopelessly broken in
that scenario anyway (re: impossible).
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#40 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABcc-l6gN6u6I84BfcfMVjgiNoCDmt0Iks5rjy1kgaJpZM4ChFgT>
.
|
Agreed it's unclear. It would be friendlier to be in the main title of the document. I suspect the version with examples containing different HD lines was an accident, but clearly this is best avoided by putting the actual version at the top rather than in an example. Largely the changes boil down to two things - substantive changes in file contents and/or meanings, and minor language adjustments or clarifications. The latter do not change the minor version number as the file format is unchanged, however it would be good if there was some sub-revision counter somewhere so you know which era of document you are reading. Note also that SAM auxiliary tags have now been moved to their own document. Previously adding new tags would likely have meant increasing the SAM minor version number. It gets more confusing though with BAM. BAM has a magic number "BAM\1" and no minor version number at all. I assume it is expected to be gleaned from the "HD" header, but this isn't a required field. For the purposes of this proposed change, it will be challenging unless we go down the route of losing forwards compatibility and switch to "BAM\2". The SAM specification should not, IMO, be adjusted to indicate the size change of ncigar or add the flag as this is purely an internal BAM representation and not a change needed to the SAM specification itself (which is entirely textual). Mind you, the same is true for the bin field too - it's entirely a BAM optimisation and not in SAM. |
A consumer of BAM files that is not programmed to handle the long CIGARS
will break when it gets to the first record with a long CIGAR. I feel
that's a bit late and very nasty.
Changing the magic number might be an easy way to tag the BAM as having the
extended length CIGAR option. Producers of extended length CIGARs just need
to set the new magic number. Those doing SAM to BAM that don't know at
outset if they have long CIGARs could take an extra option to allow long
CIGARS and generate the new magic string. These programs need to be
updated anyway.
Consumers won't recognise the new magic number unless they've been updated
to handle it.
…On 9 March 2017 at 17:48, James Bonfield ***@***.***> wrote:
Agreed it's unclear. It would be friendlier to be in the main title of the
document. I suspect the version with examples containing different HD lines
was an accident, but clearly this is best avoided by putting the actual
version at the top rather than in an example.
Largely the changes boil down to two things - substantive changes in file
contents and/or meanings and minor language adjustments or clarifications.
The latter do not change the minor version number as the file format is
unchanged, however it would be good if there was some sub-revision counter
somewhere so you know which era of document you are reading. Note also that
SAM auxiliary tags have now been moved to their own document. Previously
adding new tags would likely have meant increasing the SAM minor version
number.
It gets more confusing though with BAM. BAM has a magic number "BAM\1" and
no minor version number at all. I assume it is expected to be gleaned from
the "HD" header, but this isn't a required field.
For the purposes of this proposed change, it will be challenging unless we
go down the route of losing forwards compatibility and switch to "BAM\2".
The SAM specification should not, IMO, be adjusted to indicate the size
change of ncigar or add the flag as this is purely an internal BAM
representation and not a change needed to the SAM specification itself
(which is entirely textual). Mind you, the same is true for the bin field
too - it's entirely a BAM optimisation and not in SAM.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#40 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABcc-l4cVeRPfJB323lupgW--0328RURks5rj8r4gaJpZM4ChFgT>
.
|
@colinhercus The problem with setting the magic number that way is that producers have to know in advance whether they will produce such a record, and for a lot of tools that may not be the case either. E.g. a generic tool that converts SAM to BAM won't know until it hits the read. There are other options we could consider, rather than just going with the one solution proposed. E.g. we might consider that if you have > 65k cigar operations that the right way to handle it is to push the remaining cigar operations into an optional tag, and either terminate the cigar field with a soft-clip that covers the remaining bases, or a new symbolic cigar operator that would a) inform up to date tools that there is more cigar elsewhere, and b) cause existing tools to terminate with an error because they don't recognize the cigar operator. Re: ordering of bits and the
|
If we were to bump the BAM version number then I'd say the default should be the current BAM\1 value (unless the input data was BAM\2, in which case retain that) with command line switches to specifically request the newer format. That keeps the vast bulk of data in a well supported format. It's still problematic though. There is certainly merit in using auxiliary fields to work around this issue in a completely compatible manner - a nice idea. However it migrates the problem from being BAM specific to also being SAM and CRAM oriented unless we simply patch the issue in BAM only: when BAM encounters a SAM cigar it cannot handle it stores it in a new tag, and on decode it reverses this so that it is transparent to the users. This can also make it transparent to the APIs. Regarding bin, it cannot be relied on in some cases at least; eg when using CSI as you have an extra large chromosome. That means it's unreliable and best off computed on the fly using whatever index algorithm you have. We've had subtle bugs as well (cough, my bugs in cram!) which filled it out incorrectly in a few boundary cases. |
@timfennell
"@colinhercus <https://github.com/colinhercus> The problem with setting the
magic number that way is that producers have to know *in advance* whether
they will produce such a record, and for a lot of tools that may not be the
case either. E.g. a generic tool that converts SAM to BAM won't know until
it hits the read."
This is why I suggested adding an option to these programs so that they
knew in advance. Obviously if the option wasn't set then an error "CIGAR
Exceeds 65535" would be reported. Some producers could default to the new
format and always produce bam\2 and their BAMs wouldn't be accepted by
consumers that don't recognise the new format.
The advantage is that consumers that haven't been updated will not
recognise the magic number and will not accept the file rather than
crashing when they hit the first read with the new format.
They same undefined behaviour will happen with any consumer that doesn't
understand the new format, what ever format you choose and it's unrealistic
to think all consumers will be updated promptly.
…On 9 March 2017 at 23:49, Tim Fennell ***@***.***> wrote:
@colinhercus <https://github.com/colinhercus> The problem with setting
the magic number that way is that producers have to know *in advance*
whether they will produce such a record, and for a lot of tools that may
not be the case either. E.g. a generic tool that converts SAM to BAM won't
know until it hits the read.
There are other options we could consider, rather than just going with the
one solution proposed. E.g. we might consider that if you have > 65k cigar
operations that the right way to handle it is to push the remaining cigar
operations into an optional tag, and either terminate the cigar field with
a soft-clip that covers the remaining bases, or a new symbolic cigar
operator that would a) inform up to date tools that there is more cigar
elsewhere, and b) cause existing tools to terminate with an error because
they don't recognize the cigar operator.
Re: ordering of bits and the bin field, I think I wasn't clear enough. I
was suggesting two things:
1. If there are situations on which the bin field cannot be relied
upon *in all cases*, the sanger folks already don't seem to trust it,
and it can be re-computed from the record, I would prefer going forward to
not have bin in the format description.
2. I was suggesting re-writing the document to collapse those two
*descriptions* of 32-bit chunks of memory to a single *description* of
a 64-bit chunk of memory, and then to document more clearly how each byte
or word is used, following the solution you guys outlined to use two 16-bit
words for ncigar.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#40 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABcc-rFXquCdCViVPjyFR9gIc20W4QmQks5rkB90gaJpZM4ChFgT>
.
|
I am reviving this thread again. With the release of NA12878 ultra-long nanopore reads, long CIGARs are becoming more frequent. After some thoughts, I now prefer the following solution as is suggested by @tfenne. When converting an alignment with >65535 CIGAR operations to a record in BAM file, we mark the alignment as unmapped (flag 0x4), set An alternative to [EDIT for clarity] |
@lh3 you said in SAM, but I presume you mean in BAM here?: "For an alignment with a long CIGAR in SAM, we mark the alignment as unmapped (flag 0x4), set ..." i.e. There is no change when such long mapped reads are stored in SAM format which has no limits on the number of CIGAR operators. [update - fixed] |
@peterjc see edit for clarity. |
@lh3 that's better now, thanks! |
Thumbs up to this idea. I think it's a workable way of addressing the problem in BAM. For SAM and CRAM no change needs to be made. I agree CG:B,I is appropriate as this field should never be present in SAM or CRAM, only BAM (we could even state that in the specification) in which case the normal BAM specific encoding is the natural way to store it. As per the SAM spec:
By setting flag 0x4 we aren't technically breaking the specification, but we are adding some meaning to those fields as the flag bit field may be removed, resurrecting those other fields once more. We'd need some footnote to explain this caveat. |
unmapped will not work since various formats (CRAM?) will throw away
mapping information if they see an unmapped read...so you'll lose
reference, position, strands...not good.
So I'm seeing a problem where we first write to bam and then convert to
cram. There will need to be some logic introduced to capture the difference
in behavior. Can't we leave the read as mapped (which is more correct) but
use a null CIGAR? Is that legal?
…On Wed, Jul 5, 2017 at 11:36 AM, James Bonfield ***@***.***> wrote:
Thumbs up to this idea.
I think it's a workable way of addressing the problem in BAM. For SAM and
CRAM no change needs to be made. I agree CG:B,I is appropriate as this
field should never be present in SAM or CRAM, only BAM (we could even state
that in the specification) in which case the normal BAM specific encoding
is the natural way to store it.
As per the SAM spec:
If [flag] 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR,
MAPQ and bits 0x2, 0x100 and 0x800
By setting flag 0x4 we aren't technically *breaking* the specification,
but we are adding some meaning to those fields as the bit field that wasn't
previously there. We'd need some footnote to explain this caveat.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#40 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACnk0vsLodH9GyQAiRZztG_Y3CRPMARJks5sK61wgaJpZM4ChFgT>
.
|
!? Are you saying that IGV already uses the CIGAR in the |
No, it doesn't. I mean IGV recognizes EDIT: displaying something, albeit clearly wrong, is better than displaying nothing on the screen. |
I think the main thing is depth, mpileup and similar tools. They'll omit the soft-clipped read, essentially acting as if the long reads had been filtered out. Mostly this won't give a different call, but it may. Realistically it'll only likely completely lose differences on poor quality calls, but it may also turn a call with qual 31 to qual 29 and make it vanish on a filter of QUAL >= 30; such is the nature of using hard thresholds. The chances are slim and probably irrelevant, but you can never tell when it mattered. As Heng points out there, there are several other pitfalls that have the same impact, such as running an older copy of the aligner or simply an inappropriate one for long reads. Basically using old tools is equivalent to filtering out the long reads first, but with the added nuance of (probably) not realising you're doing it. Using BAM2 is the equivalent of all your old tools failing hard, forcing you to either upgrade if available or manually remove the long cigars and generate BAM1 before running the old tools again. |
My preference is to use entirely long cigars or no long cigars to avoid silently dropping just the long alignments. |
Update to my proposal with inputs from James and Shaun:
With this solution:
Of course, you can still make up examples that lead to silent loss of data, but as I said above, there are much easier ways to make much bigger mistakes, for example, by using an old mapper. We should not be obsessed with "absolute robustnes at any cost". Some cost like the risk of fragmenting the BAM ecosystem without careful planning is too high for us to afford. The above is the best solution. |
@sjackman I don't object. I will leave header tag and/or clipping every read to you and others after we settle on full clipping. We will move forward quickly after that. |
Meta-status:
|
EBI had archiveBAM. It was proposed in early 2011 but seems not used any more? ArchiveBAM was researcher oriented. Even though it didn't fly, I think the same idea might work in the clinical world where we care robustness a lot more. We could have clinicalSAM, a strict subset of SAM with additional restrictions such as mandated HD-SO, SQ-M5, RD-PL, no non-standard tag, etc. We ought to implement a tool that validates and converts to clinicalSAM. Ga4gh can then set it as a standard for clinical sequence data. Research and clinical worlds are quite different. From the discussion here, it seems to me that a single standard won't work well for both worlds at the same time. ClinicalSam might be one of the potential solutions. Thoughts?
Derek from PacBio (author of bamtools and the most significant contributor to pbbam) is largely neutral. Generally, vendors and archives rely on public standards and implementations. Their interest is to make their tools/data seamlessly interchangeable with the rest of world. It is good to hear from them, but this is not critical to the discussion in my opinion. |
Not speaking officially for PacBio, of course. But yes, you nailed it, Heng. Our interest is mostly that the issue is solved and standardized. Something we can code to and ship data/software out with minimal concern. Unless/until I hear differently from upstairs, the specific implementation details are less critical. |
In preparation for the Thursday conference call I've produced a summary of the two main proponents for a solution to this problem. It's long and had a round of review so it's a google doc. https://docs.google.com/document/d/1PzH9k5umvraI_ounpcKcUdawj7Ufs-N-jul77xDs1dI The aim of writing this document is primarily to list the solutions and their associated risks so that we can discuss our preferred implementation on Thursday from a basis of knowledge rather than gut feelings or unsupported conjecture. Assessing risk likelihood and severity are of course subjective, although I've tried to be neutral in this, but irrespective of that just listing the potential problems and ways of mitigating against them is important so we can make our implementation as robust as possible. Also note that most likelihoods are conditional - conditional on not updating software and obviously conditional on being one of the affected people with extra-long cigars present in their data. |
I have commented on an earlier version of this document. Some of my concerns have been addressed. Here are some different views on estimating likelihood and severity as I can't comment on the document any more. The expected number of ultra-long read users who actually make a mistake by using older software will be 10s at most. In the next year or two, the expected number of users who may see BAM2, even if unnecessarily, is of 100s at least, likely 1000s, a couple of orders of magnitude higher. I strongly disagree with the severity rating because many of them can't be measured and are too subjective. You can consider the following scenarios:
You can objectively compare the severity of these scenarios, at least the first 3. 1 is the lowest. Now we consider what may happen when you break the BAM ecosystem.
How do you compare a broken pipeline vs changing a mapper? What do you think waiting for several hours to downgrade BAM2 vs using an old version of software? The severity of point 5–8 is highly subjective and varies a lot between users. To me, the severity of 1 is the lowest and 2–4 is lower than 5–8, but the document clearly thinks differently. |
A quick summary for the benefit of those not on the conference call: We discussed this at length on the GA4GH conference call and I'm pleased to say, 4 years after initial problem(!), came to a unanimous agreement between the 3 SAM spec maintainers (albeit definitely not unanimous within the audience). Our solution is to go ahead with CG tag (#227). While the concept has now been accepted, we still need to review the text in the specification itself, and some code PRs need updating to incorporate the "N" cigar idea. Hopefully this will not be a lengthy process. (Keeping this issue open until spec PR #227 is merged.) |
@jkbonfield who are the current three spec maintainers? Kudos to them for a unanimous decision. |
@nh13: See MAINTAINERS.md. |
This commit addresses samtools#40. It added optional tag `CG` and explained the workaround to store alignments with >65535 CIGAR operations in BAM files. The proposal is implemented in samtools/htslib#560.
Resolved in #227 some time ago. |
htsjdk supports this now samtools/htsjdk@a00b958 |
There was discussion on the samtools-dev mailing list about this last year:
http://sourceforge.net/p/samtools/mailman/message/30672431/
The main crux of the discussion there was to reuse the 16bit bin field to act as the top 16 bits of ncigar, possibly using the top flag bit as an indicator.
There are some other discussions (internal) regarding this, including possibly removing bin completely given that it has no meaning with CSI indices, so this issue is largely just a note to track the issue and collate ideas/fixes.
Note that the problem is definitely a real one. I have hit this with real projects, caused when a user merged consensus sequences from an assembly into a BAM file, but it is also not too far away with actual sequence reads too. Newer technologies (PacBio, ONT, maybe more) offer substantially longer read lengths but also with higher indel rates leading to substantially more cigar opcodes. A 320Kb with with a 10% indel rate would lead to 2 changes (D/I to M) every 10 bases, giving 64k cigar ops. (Those aren't figures for real technologies, but it's not inconceivable.)
The text was updated successfully, but these errors were encountered: