Initial gene-ε implementation #975

tomwhite · 2022-12-07T10:47:20Z

This is a draft of a gene-ε implementation (#692).

It passes validation tests showing that it produces the same results as the R implementation on simulated and real data. These datasets are too large to add to the repo though (>50MB), so they should be downloaded or generated in tests.

Also, I'm not sure yet what the best way to organise the tests is - for coverage we'd like to run some of these tests as unit tests. We run validation tests in CI for PC Relate, to check that the analysis can run with the latest versions of the tools, but I'm not sure if we need to do that for gene-ε.

On the question of computing an LD matrix - this change assumes that you have one to feed into the genee function.

Implementation-wise, this uses sgkit's windowing machinery to calculate per-gene stats (in particular #974). The dataframe approach suggested in https://github.com/pystatgen/sgkit/issues/692#issuecomment-961844677 is an alternative we could add later, depending on performance and scalability. I think there would be a way to convert sgkit window variables into group keys that could be used in a dataframe group by.

This needs docs too - I'm submitting this as an early draft in case anyone wants to have a look.

jeromekelleher · 2022-12-07T12:58:21Z

Implementation looks great - very concise and straightforward. A couple of things struck me on a quick scan:

What's the implication of the dependency on chiscore? Is on conda-forge? How much of a headache is having it going to be?
The test files (I saw some > 20MB) seem far too big for unit test data. Can we make smaller versions for unit tests? @timothymillar seem to have come up with some good workflows here for comparing with R packages.

hammer · 2022-12-07T15:34:37Z

Awesome!

@ravwojdyla also has done some work to compare to R packages

tomwhite · 2022-12-07T15:39:38Z

What's the implication of the dependency on chiscore? Is on conda-forge? How much of a headache is having it going to be?

It could be a headache, because although chiscore is pure-python, it has a dependency on chi2comb, which has native code. It doesn't have Python 3.10 packages, and hasn't been updated on conda-forge for a few years...

hammer · 2022-12-07T15:46:43Z

All roads lead to @horta! Maybe he knows of an alternative to chiscore for this use case?

tomwhite · 2022-12-12T14:52:03Z

All roads lead to @horta! Maybe he knows of an alternative to chiscore for this use case?

Indeed! I've opened limix/chi2comb-py#8 to request Python 3.10 wheels.

I think it might be a good idea to only require chiscore if you are using genee. Dask does this, for example. This is what it might look like here: pystatgen/sgkit@cf0294e

tomwhite · 2022-12-13T15:44:44Z

In the latest update I've used the simulation code in the genee repo to generate a smaller dataset, suitable for unit tests.

I've removed the validation tests now that there are unit tests. These can be added separately.

I've also added some documentation - it's pretty barebones and I'd welcome any improvements or expansion by anyone who understands the stats better than I do 😄

There's still a difference with the reference implementation regarding a case where the first mixture component with the largest variance is used if it's >50% of SNPs. I'm not sure how to code (or test) this. Perhaps it could be done later.

horta · 2022-12-14T09:39:44Z

Does someone wants access to chiscore repository to push and merge? I'm running behind in a lot of stuff...

tomwhite · 2022-12-14T11:24:27Z

Does someone wants access to chiscore repository to push and merge? I'm running behind in a lot of stuff...

Thanks, that would be great @horta. I think we need to build 3.10 wheels for both chiscore and chi2comb-py.

horta · 2022-12-15T10:15:00Z

Thanks! I've added @tomwhite to chicomb* and chi2score* repos.

codecov-commenter · 2022-12-19T12:28:38Z

Codecov Report

Merging #975 (58bbf8a) into main (0a51f90) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main      #975   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           41        43    +2     
  Lines         4294      4382   +88     
=========================================
+ Hits          4294      4382   +88

Impacted Files	Coverage Δ
sgkit/__init__.py	`100.00% <100.00%> (ø)`
sgkit/stats/genee.py	`100.00% <100.00%> (ø)`
sgkit/stats/genee_momentchi2py.py	`100.00% <100.00%> (ø)`
sgkit/stats/ld.py	`100.00% <100.00%> (ø)`
sgkit/io/dataset.py	`100.00% <0.00%> (ø)`
sgkit/io/vcf/__init__.py	`100.00% <0.00%> (ø)`
sgkit/io/vcf/vcf_reader.py	`100.00% <0.00%> (ø)`
sgkit/io/vcf/vcf_writer.py	`100.00% <0.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

tomwhite · 2022-12-19T12:48:47Z

39e3be9 includes fixes to skip running on Python 3.10. Fixing chiscore and chi2comb-py to produce 3.10 wheels will take some time, and we probably don't want to gate this PR on that.

tomwhite · 2022-12-19T16:36:59Z

All roads lead to @horta! Maybe he knows of an alternative to chiscore for this use case?

I found momentchi2, a pure-Python alternative to chiscore. I've managed to get it to pass the unit and validation tests, see 5fc9529.

I think for simplicity we should just use that.

jeromekelleher · 2023-01-09T10:04:12Z

Doesn't look like momentchi2 is on conda-forge, so it's still not headache free I'm afraid.

How much actual code are we using from these repos I wonder? If it comes down to something fairly simple it may be easier to just reimplement locally...

tomwhite · 2023-01-09T10:44:58Z

There are pure python wheels on PyPi, so it should be relatively easy to install from a conda env, no?

How much actual code are we using from these repos I wonder? If it comes down to something fairly simple it may be easier to just reimplement locally...

The actual methods are quite small, and we could have a copy of the hbe method for example, but we'd have to copy tests and maybe some utility code too.

jeromekelleher · 2023-01-09T14:28:21Z

There are pure python wheels on PyPi, so it should be relatively easy to install from a conda env, no?

Conda-forge prefers you to have only conda-forge dependencies, so if we were to do this the right way we'd have to package this upstream package for conda-forge too.

You could probably work around it, but it does complicate packaging.

benjeffery · 2023-01-10T11:15:23Z

The actual methods are quite small, and we could have a copy of the hbe method for example, but we'd have to copy tests and maybe some utility code too.

If there are no license issues this is probably less work in the long run, for us and the users.

tomwhite · 2023-01-16T14:19:46Z

Added a copy of hbe to avoid the momentchi2 dependency.

jeromekelleher

LGTM

See comment about copyright notice though

sgkit/stats/genee_momentchi2py.py

tomwhite · 2023-01-24T10:03:19Z

This still needs a changelog entry. Unless there are any objections I'll rebase, squash, and add a changelog entry before merging - hopefully today or tomorrow.

jeromekelleher

LGTM

tomwhite force-pushed the genee-2022 branch from cf0294e to a103b32 Compare December 13, 2022 15:44

tomwhite marked this pull request as ready for review December 13, 2022 15:44

tomwhite mentioned this pull request Dec 13, 2022

Add gene-ε validation tests #977

Open

tomwhite force-pushed the genee-2022 branch from a103b32 to 454db39 Compare December 13, 2022 16:51

tomwhite force-pushed the genee-2022 branch from 0d54dfc to 39e3be9 Compare December 19, 2022 12:31

tomwhite force-pushed the genee-2022 branch from fca4dfd to 5fc9529 Compare December 19, 2022 16:08

tomwhite added this to the 0.6.0 milestone Jan 3, 2023

jeromekelleher approved these changes Jan 17, 2023

View reviewed changes

sgkit/stats/genee_momentchi2py.py Outdated Show resolved Hide resolved

This was referenced Jan 18, 2023

Conda forge package deanbodenham/momentchi2py#1

Open

Copyright clarification deanbodenham/momentchi2py#2

Open

jeromekelleher approved these changes Jan 25, 2023

View reviewed changes

Extract map_windows_as_dataframe helper function

fd6ae5a

Add gene-ε

dd78c0f

tomwhite force-pushed the genee-2022 branch from 0b280fa to dd78c0f Compare January 25, 2023 14:29

tomwhite merged commit fb22e01 into sgkit-dev:main Jan 25, 2023

tomwhite deleted the genee-2022 branch January 25, 2023 14:43

tomwhite mentioned this pull request Jan 29, 2024

genee review: sgkit implementation limitations #1180

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial gene-ε implementation #975

Initial gene-ε implementation #975

tomwhite commented Dec 7, 2022

jeromekelleher commented Dec 7, 2022 •

edited

Loading

hammer commented Dec 7, 2022

tomwhite commented Dec 7, 2022

hammer commented Dec 7, 2022

tomwhite commented Dec 12, 2022

tomwhite commented Dec 13, 2022

horta commented Dec 14, 2022

tomwhite commented Dec 14, 2022

horta commented Dec 15, 2022

codecov-commenter commented Dec 19, 2022 •

edited

Loading

tomwhite commented Dec 19, 2022

tomwhite commented Dec 19, 2022

jeromekelleher commented Jan 9, 2023

tomwhite commented Jan 9, 2023

jeromekelleher commented Jan 9, 2023

benjeffery commented Jan 10, 2023

tomwhite commented Jan 16, 2023

jeromekelleher left a comment

tomwhite commented Jan 24, 2023

jeromekelleher left a comment

Initial gene-ε implementation #975

Initial gene-ε implementation #975

Conversation

tomwhite commented Dec 7, 2022

jeromekelleher commented Dec 7, 2022 • edited Loading

hammer commented Dec 7, 2022

tomwhite commented Dec 7, 2022

hammer commented Dec 7, 2022

tomwhite commented Dec 12, 2022

tomwhite commented Dec 13, 2022

horta commented Dec 14, 2022

tomwhite commented Dec 14, 2022

horta commented Dec 15, 2022

codecov-commenter commented Dec 19, 2022 • edited Loading

Codecov Report

tomwhite commented Dec 19, 2022

tomwhite commented Dec 19, 2022

jeromekelleher commented Jan 9, 2023

tomwhite commented Jan 9, 2023

jeromekelleher commented Jan 9, 2023

benjeffery commented Jan 10, 2023

tomwhite commented Jan 16, 2023

jeromekelleher left a comment

Choose a reason for hiding this comment

tomwhite commented Jan 24, 2023

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher commented Dec 7, 2022 •

edited

Loading

codecov-commenter commented Dec 19, 2022 •

edited

Loading