Skip to content

DXSM data simulation #127

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

DXSM data simulation #127

wants to merge 11 commits into from

Conversation

willdumm
Copy link
Contributor

@willdumm willdumm commented Mar 28, 2025

This PR adds a couple of functions to framework that allow simulation using DASM and DNSM models, and refactors the code that uses selection factors to adjust codon probabilities into separate functions that are re-used between the simulation code and the loss computation code.

One outstanding TODO that came up was how to handle selection factor predictions for the parent AA. When we evaluate a crepe on a sequence, we set wild type AA selection factors to NaN, reflecting the intuition that they're not used. However, in training we do use the wild type selection factor to modify the neutral codon probabilities. Simulation requires making some choice about what selection factor we should use to modify synonymous codon probabilities. Currently, I do the same thing we do in training, just using the prediction from the model.

@willdumm willdumm requested a review from Copilot March 28, 2025 22:59
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces new simulation tests and updates several modules to improve the calculation of neutral codon probabilities and model selection factors for the DXSM data simulation. Key changes include:

  • Adding a new test suite in tests/test_simulation.py for simulation functionality.
  • Implementing a new function neutral_codon_probs_of_seq in netam/molevol.py and using it in dasm and framework modules.
  • Updating the selection_factors_of_aa_str method in netam/models.py to include a zap_diagonal parameter.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/test_simulation.py New tests for validating simulation predictions and output consistency.
netam/molevol.py Added neutral_codon_probs_of_seq with error reporting via print statements and clamping logic.
netam/models.py Extended selection_factors_of_aa_str to include a zap_diagonal parameter to control output behavior.
netam/framework.py Updated to leverage new molevol routines for improved codon probability adjustments.
netam/dasm.py Replaced duplicated logic with calls to neutrals function; added padding to neutral probabilities.
netam/codon_table.py Introduced STOP_CODON_INDICATOR and STOP_CODON_ZAPPER to facilitate codon masking operations.
Comments suppressed due to low confidence (3)

netam/molevol.py:434

  • [nitpick] Consider using a logging framework instead of print statements for error diagnostics to improve maintainability and integration with production logging systems.
print("Found a non-finite neutral_codon_prob")

netam/models.py:626

  • [nitpick] Update the function docstring for selection_factors_of_aa_str to document the purpose and effects of the new 'zap_diagonal' parameter.
def selection_factors_of_aa_str(self, aa_sequence: Tuple[str, str], zap_diagonal=True) -> Tensor:

netam/dasm.py:80

  • [nitpick] Verify that padding neutral codon probabilities with the fixed value 1e-8 is appropriate for all cases and does not interfere with downstream normalization or probability calculations.
F.pad(neutral_codon_probs, (0, 0, 0, pad_len), value=1e-8)

@willdumm willdumm marked this pull request as ready for review April 2, 2025 22:14
@willdumm willdumm requested review from Copilot and matsen April 2, 2025 22:14
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the simulation functionality by adding new functions and tests for DASM and DNSM models, while refactoring codon probability adjustments for better reuse.

  • Added comprehensive tests in tests/test_simulation.py for neutral probability, selection factor evaluation, and sequence sampling.
  • Introduced new functions in netam/molevol.py and netam/framework.py to handle neutral codon probabilities and codon sampling, and updated netam/models.py to adjust the selection factor computation.
  • Updated dasm.py and codon_table.py to integrate the new changes and constants for stop codon processing.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/test_simulation.py New tests covering neutral probabilities, selection factors, and sequence sampling.
netam/molevol.py Added neutral_codon_probs_of_seq and adjust_codon_probs_by_aa_selection_factors with extra error handling.
netam/models.py Updated selection_factors_of_aa_str to support a new zap_diagonal parameter.
netam/framework.py Introduced codon_probs_of_parent_seq and sample_sequence_from_codon_probs to support simulation.
netam/dasm.py Refactored update_neutral_probs to use the new neutrals function from molevol.py.
netam/codon_table.py Added STOP_CODON_INDICATOR and STOP_CODON_ZAPPER constants to centralize stop codon handling.
Comments suppressed due to low confidence (2)

netam/molevol.py:423

  • Double-check that slicing the repeated mask with len(nt_parent) produces the correct nucleotide-level mask, since each amino acid normally expands to 3 nucleotides and this may lead to misalignment if not handled carefully.
nt_mask = mask.repeat_interleave(3)[: len(nt_parent)]

netam/models.py:648

  • [nitpick] Update the docstring for selection_factors_of_aa_str to document the new zap_diagonal parameter and clarify its purpose to avoid confusion.
if zap_diagonal and self.hyperparameters["output_dim"] >= 20:

@matsen
Copy link
Contributor

matsen commented Apr 2, 2025

Re todo, was I wrong about #94 ?

@willdumm
Copy link
Contributor Author

willdumm commented Apr 2, 2025

You were not wrong at the time with DDSM, but I think that isn't how things are done now? See e.g. here:

log_preds = (

If I'm reading it correctly, after we squish things with the selection factors, we fix the probability of the parent codon so that the codon probability distribution sums to 1. That leaves all the other synonymous target codons whose probabilities were modified by the selection factor of the parent amino acid.

Incidentally that stop codon zapper should take care of stop codons, but despite that I did once see stop codons simulated in one of my tests...

@willdumm
Copy link
Contributor Author

willdumm commented Apr 3, 2025

To summarize our discussion earlier, we noticed that the DASM paper specifies that no selection factor should be applied to probabilities for codons that code for the parent codon. So, we will try setting the wild type selection factor to 1 in the model forward function.

Copy link
Contributor

@matsen matsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is awesome. Let's talk it over tomorrow.

A tensor of shape (B, L, out_features) representing the log level
of selection for each possible amino acid at each site.
"""
result = self.predict(self.represent(amino_acid_indices, mask))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems important and I don't totally get it. Let's talk it over.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think I get it but still we can chat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants