Skip to content

Initial gene-ε implementation #975

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 25, 2023
Merged

Conversation

tomwhite
Copy link
Collaborator

@tomwhite tomwhite commented Dec 7, 2022

This is a draft of a gene-ε implementation (#692).

It passes validation tests showing that it produces the same results as the R implementation on simulated and real data. These datasets are too large to add to the repo though (>50MB), so they should be downloaded or generated in tests.

Also, I'm not sure yet what the best way to organise the tests is - for coverage we'd like to run some of these tests as unit tests. We run validation tests in CI for PC Relate, to check that the analysis can run with the latest versions of the tools, but I'm not sure if we need to do that for gene-ε.

On the question of computing an LD matrix - this change assumes that you have one to feed into the genee function.

Implementation-wise, this uses sgkit's windowing machinery to calculate per-gene stats (in particular #974). The dataframe approach suggested in https://github.com/pystatgen/sgkit/issues/692#issuecomment-961844677 is an alternative we could add later, depending on performance and scalability. I think there would be a way to convert sgkit window variables into group keys that could be used in a dataframe group by.

This needs docs too - I'm submitting this as an early draft in case anyone wants to have a look.

@jeromekelleher
Copy link
Collaborator

jeromekelleher commented Dec 7, 2022

Implementation looks great - very concise and straightforward. A couple of things struck me on a quick scan:

  • What's the implication of the dependency on chiscore? Is on conda-forge? How much of a headache is having it going to be?
  • The test files (I saw some > 20MB) seem far too big for unit test data. Can we make smaller versions for unit tests? @timothymillar seem to have come up with some good workflows here for comparing with R packages.

@hammer
Copy link
Contributor

hammer commented Dec 7, 2022

Awesome!

@ravwojdyla also has done some work to compare to R packages

@tomwhite
Copy link
Collaborator Author

tomwhite commented Dec 7, 2022

What's the implication of the dependency on chiscore? Is on conda-forge? How much of a headache is having it going to be?

It could be a headache, because although chiscore is pure-python, it has a dependency on chi2comb, which has native code. It doesn't have Python 3.10 packages, and hasn't been updated on conda-forge for a few years...

@hammer
Copy link
Contributor

hammer commented Dec 7, 2022

All roads lead to @horta! Maybe he knows of an alternative to chiscore for this use case?

@tomwhite
Copy link
Collaborator Author

All roads lead to @horta! Maybe he knows of an alternative to chiscore for this use case?

Indeed! I've opened limix/chi2comb-py#8 to request Python 3.10 wheels.

I think it might be a good idea to only require chiscore if you are using genee. Dask does this, for example. This is what it might look like here: pystatgen/sgkit@cf0294e

@tomwhite
Copy link
Collaborator Author

In the latest update I've used the simulation code in the genee repo to generate a smaller dataset, suitable for unit tests.

I've removed the validation tests now that there are unit tests. These can be added separately.

I've also added some documentation - it's pretty barebones and I'd welcome any improvements or expansion by anyone who understands the stats better than I do 😄

There's still a difference with the reference implementation regarding a case where the first mixture component with the largest variance is used if it's >50% of SNPs. I'm not sure how to code (or test) this. Perhaps it could be done later.

@tomwhite tomwhite marked this pull request as ready for review December 13, 2022 15:44
@horta
Copy link
Collaborator

horta commented Dec 14, 2022

Does someone wants access to chiscore repository to push and merge? I'm running behind in a lot of stuff...

@tomwhite
Copy link
Collaborator Author

Does someone wants access to chiscore repository to push and merge? I'm running behind in a lot of stuff...

Thanks, that would be great @horta. I think we need to build 3.10 wheels for both chiscore and chi2comb-py.

@horta
Copy link
Collaborator

horta commented Dec 15, 2022

Thanks! I've added @tomwhite to chicomb* and chi2score* repos.

@codecov-commenter
Copy link

codecov-commenter commented Dec 19, 2022

Codecov Report

Merging #975 (58bbf8a) into main (0a51f90) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main      #975   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           41        43    +2     
  Lines         4294      4382   +88     
=========================================
+ Hits          4294      4382   +88     
Impacted Files Coverage Δ
sgkit/__init__.py 100.00% <100.00%> (ø)
sgkit/stats/genee.py 100.00% <100.00%> (ø)
sgkit/stats/genee_momentchi2py.py 100.00% <100.00%> (ø)
sgkit/stats/ld.py 100.00% <100.00%> (ø)
sgkit/io/dataset.py 100.00% <0.00%> (ø)
sgkit/io/vcf/__init__.py 100.00% <0.00%> (ø)
sgkit/io/vcf/vcf_reader.py 100.00% <0.00%> (ø)
sgkit/io/vcf/vcf_writer.py 100.00% <0.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@tomwhite
Copy link
Collaborator Author

39e3be9 includes fixes to skip running on Python 3.10. Fixing chiscore and chi2comb-py to produce 3.10 wheels will take some time, and we probably don't want to gate this PR on that.

@tomwhite
Copy link
Collaborator Author

All roads lead to @horta! Maybe he knows of an alternative to chiscore for this use case?

I found momentchi2, a pure-Python alternative to chiscore. I've managed to get it to pass the unit and validation tests, see 5fc9529.

I think for simplicity we should just use that.

@tomwhite tomwhite added this to the 0.6.0 milestone Jan 3, 2023
@jeromekelleher
Copy link
Collaborator

Doesn't look like momentchi2 is on conda-forge, so it's still not headache free I'm afraid.

How much actual code are we using from these repos I wonder? If it comes down to something fairly simple it may be easier to just reimplement locally...

@tomwhite
Copy link
Collaborator Author

tomwhite commented Jan 9, 2023

There are pure python wheels on PyPi, so it should be relatively easy to install from a conda env, no?

How much actual code are we using from these repos I wonder? If it comes down to something fairly simple it may be easier to just reimplement locally...

The actual methods are quite small, and we could have a copy of the hbe method for example, but we'd have to copy tests and maybe some utility code too.

@jeromekelleher
Copy link
Collaborator

There are pure python wheels on PyPi, so it should be relatively easy to install from a conda env, no?

Conda-forge prefers you to have only conda-forge dependencies, so if we were to do this the right way we'd have to package this upstream package for conda-forge too.

You could probably work around it, but it does complicate packaging.

@benjeffery
Copy link
Collaborator

The actual methods are quite small, and we could have a copy of the hbe method for example, but we'd have to copy tests and maybe some utility code too.

If there are no license issues this is probably less work in the long run, for us and the users.

@tomwhite
Copy link
Collaborator Author

Added a copy of hbe to avoid the momentchi2 dependency.

Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

See comment about copyright notice though

@tomwhite
Copy link
Collaborator Author

This still needs a changelog entry. Unless there are any objections I'll rebase, squash, and add a changelog entry before merging - hopefully today or tomorrow.

Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tomwhite tomwhite merged commit fb22e01 into sgkit-dev:main Jan 25, 2023
@tomwhite tomwhite deleted the genee-2022 branch January 25, 2023 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants