-
Notifications
You must be signed in to change notification settings - Fork 2
Add "chromosome" column to edges #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Note that in cases where we have duplication of chromosomes, we might (or might not) want to give them different chromosome IDs. Essentially, the chromosome ID is serving as a marker of which segment of genome you recombine with, normally (and this is essentially a continuous thing, especially in cases such as #15 ). It may be that it is better to have a separate node for each chromosome instead, and group the nodes together somehow into a haploid genome. |
It would be great to support this! I prefer the option of adding chromosome information to the nodes table instead of the edges. Since genetic material transfer between different chromosomes is not a common event to simulate, I think adding a column to the edges just for this purpose isn't efficient. On the other hand, if we are simulating multiple chromosomes, we will need to keep track of which nodes correspond to which chromosome anyway, in order to calculate properties like chromosome length. Having separate nodes for each chromosome is intuitive to me since we effectively do that with tsinfer already (inferring trees for each chromosome arm, then stitching them together). |
I'm not sure adding a single integer column to the edge table will impact efficiency, to be honest. But let's see what Jerome and Ben think. There was discussion about this in tskit, e.g. at tskit-dev/msprime#848 (comment) If we do use nodes, we will need another layer (another table?) lying between the individual and the nodes tables, which ties the nodes together into a haploid genome. I guess this could be an integer column in the nodes table. |
Fair point; that discussion is helpful, thanks for the link. One downside of using the nodes table is that we already have the headache of how to store duplicate sample nodes in local tree sequences! The problem there is similar: we have to decide whether to add complexity to the edges table or the nodes table of local tree sequences. |
After some thought, I am fixed on implementing chromosomes as 2 extra columns in the edges table, rather than as separate nodes. The reasons for this is mainly that a "genome" (consisting of multiple chromosomes) is a coherent thing that is mostly passed about as a unit. Separate chromosomes cannot survive and lead an independent life of their own: they mostly have to be part of a whole genome. Thus the The downside, however, is that during meiosis, the chromosomes can be treated as independent units, which means that e.g. autopolyploidy may be a little tricky to simulate (see #15 (comment)). I think we can get around this by specific MRCA hacks, though, and the alternative is much worse. So I'm closing this. |
I suddenly thought, we might well want to have material moving between different chromosomes.
The "obvious" way to do that is to have a
parent_chromosome
andchild_chromosome
field in the "edges" table. If these are all-1
, we can assume we are using the "default" chromosome (whatever that is), and can then convert it to the normal tskit format.Edit: an alternative option would be for each node to represent a different chromosome. I'm not sure which is better.
Further edit: we have decided on extra edges columns
The text was updated successfully, but these errors were encountered: