-
Notifications
You must be signed in to change notification settings - Fork 35
Initial gene-ε implementation #975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Implementation looks great - very concise and straightforward. A couple of things struck me on a quick scan:
|
Awesome! @ravwojdyla also has done some work to compare to R packages |
It could be a headache, because although chiscore is pure-python, it has a dependency on chi2comb, which has native code. It doesn't have Python 3.10 packages, and hasn't been updated on conda-forge for a few years... |
All roads lead to @horta! Maybe he knows of an alternative to |
Indeed! I've opened limix/chi2comb-py#8 to request Python 3.10 wheels. I think it might be a good idea to only require |
In the latest update I've used the simulation code in the genee repo to generate a smaller dataset, suitable for unit tests. I've removed the validation tests now that there are unit tests. These can be added separately. I've also added some documentation - it's pretty barebones and I'd welcome any improvements or expansion by anyone who understands the stats better than I do 😄 There's still a difference with the reference implementation regarding a case where the first mixture component with the largest variance is used if it's >50% of SNPs. I'm not sure how to code (or test) this. Perhaps it could be done later. |
Does someone wants access to chiscore repository to push and merge? I'm running behind in a lot of stuff... |
Thanks, that would be great @horta. I think we need to build 3.10 wheels for both chiscore and chi2comb-py. |
Thanks! I've added @tomwhite to chicomb* and chi2score* repos. |
Codecov Report
@@ Coverage Diff @@
## main #975 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 41 43 +2
Lines 4294 4382 +88
=========================================
+ Hits 4294 4382 +88
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
39e3be9 includes fixes to skip running on Python 3.10. Fixing chiscore and chi2comb-py to produce 3.10 wheels will take some time, and we probably don't want to gate this PR on that. |
I found I think for simplicity we should just use that. |
Doesn't look like momentchi2 is on conda-forge, so it's still not headache free I'm afraid. How much actual code are we using from these repos I wonder? If it comes down to something fairly simple it may be easier to just reimplement locally... |
There are pure python wheels on PyPi, so it should be relatively easy to install from a conda env, no?
The actual methods are quite small, and we could have a copy of the |
Conda-forge prefers you to have only conda-forge dependencies, so if we were to do this the right way we'd have to package this upstream package for conda-forge too. You could probably work around it, but it does complicate packaging. |
If there are no license issues this is probably less work in the long run, for us and the users. |
Added a copy of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
See comment about copyright notice though
This still needs a changelog entry. Unless there are any objections I'll rebase, squash, and add a changelog entry before merging - hopefully today or tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This is a draft of a gene-ε implementation (#692).
It passes validation tests showing that it produces the same results as the R implementation on simulated and real data. These datasets are too large to add to the repo though (>50MB), so they should be downloaded or generated in tests.
Also, I'm not sure yet what the best way to organise the tests is - for coverage we'd like to run some of these tests as unit tests. We run validation tests in CI for PC Relate, to check that the analysis can run with the latest versions of the tools, but I'm not sure if we need to do that for gene-ε.
On the question of computing an LD matrix - this change assumes that you have one to feed into the
genee
function.Implementation-wise, this uses sgkit's windowing machinery to calculate per-gene stats (in particular #974). The dataframe approach suggested in https://github.com/pystatgen/sgkit/issues/692#issuecomment-961844677 is an alternative we could add later, depending on performance and scalability. I think there would be a way to convert sgkit window variables into group keys that could be used in a dataframe group by.
This needs docs too - I'm submitting this as an early draft in case anyone wants to have a look.