DXSM data simulation #127

willdumm · 2025-03-28T22:58:22Z

This PR adds a couple of functions to framework that allow simulation using DASM and DNSM models, and refactors the code that uses selection factors to adjust codon probabilities into separate functions that are re-used between the simulation code and the loss computation code.

One outstanding TODO that came up was how to handle selection factor predictions for the parent AA. When we evaluate a crepe on a sequence, we set wild type AA selection factors to NaN, reflecting the intuition that they're not used. However, in training we do use the wild type selection factor to modify the neutral codon probabilities. Simulation requires making some choice about what selection factor we should use to modify synonymous codon probabilities. Currently, I do the same thing we do in training, just using the prediction from the model.

Copilot

Pull Request Overview

This PR introduces new simulation tests and updates several modules to improve the calculation of neutral codon probabilities and model selection factors for the DXSM data simulation. Key changes include:

Adding a new test suite in tests/test_simulation.py for simulation functionality.
Implementing a new function neutral_codon_probs_of_seq in netam/molevol.py and using it in dasm and framework modules.
Updating the selection_factors_of_aa_str method in netam/models.py to include a zap_diagonal parameter.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tests/test_simulation.py	New tests for validating simulation predictions and output consistency.
netam/molevol.py	Added neutral_codon_probs_of_seq with error reporting via print statements and clamping logic.
netam/models.py	Extended selection_factors_of_aa_str to include a zap_diagonal parameter to control output behavior.
netam/framework.py	Updated to leverage new molevol routines for improved codon probability adjustments.
netam/dasm.py	Replaced duplicated logic with calls to neutrals function; added padding to neutral probabilities.
netam/codon_table.py	Introduced STOP_CODON_INDICATOR and STOP_CODON_ZAPPER to facilitate codon masking operations.

Comments suppressed due to low confidence (3)

netam/molevol.py:434

[nitpick] Consider using a logging framework instead of print statements for error diagnostics to improve maintainability and integration with production logging systems.

print("Found a non-finite neutral_codon_prob")

netam/models.py:626

[nitpick] Update the function docstring for selection_factors_of_aa_str to document the purpose and effects of the new 'zap_diagonal' parameter.

def selection_factors_of_aa_str(self, aa_sequence: Tuple[str, str], zap_diagonal=True) -> Tensor:

netam/dasm.py:80

[nitpick] Verify that padding neutral codon probabilities with the fixed value 1e-8 is appropriate for all cases and does not interfere with downstream normalization or probability calculations.

F.pad(neutral_codon_probs, (0, 0, 0, pad_len), value=1e-8)

Copilot

Pull Request Overview

This PR enhances the simulation functionality by adding new functions and tests for DASM and DNSM models, while refactoring codon probability adjustments for better reuse.

Added comprehensive tests in tests/test_simulation.py for neutral probability, selection factor evaluation, and sequence sampling.
Introduced new functions in netam/molevol.py and netam/framework.py to handle neutral codon probabilities and codon sampling, and updated netam/models.py to adjust the selection factor computation.
Updated dasm.py and codon_table.py to integrate the new changes and constants for stop codon processing.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tests/test_simulation.py	New tests covering neutral probabilities, selection factors, and sequence sampling.
netam/molevol.py	Added neutral_codon_probs_of_seq and adjust_codon_probs_by_aa_selection_factors with extra error handling.
netam/models.py	Updated selection_factors_of_aa_str to support a new zap_diagonal parameter.
netam/framework.py	Introduced codon_probs_of_parent_seq and sample_sequence_from_codon_probs to support simulation.
netam/dasm.py	Refactored update_neutral_probs to use the new neutrals function from molevol.py.
netam/codon_table.py	Added STOP_CODON_INDICATOR and STOP_CODON_ZAPPER constants to centralize stop codon handling.

Comments suppressed due to low confidence (2)

netam/molevol.py:423

Double-check that slicing the repeated mask with len(nt_parent) produces the correct nucleotide-level mask, since each amino acid normally expands to 3 nucleotides and this may lead to misalignment if not handled carefully.

nt_mask = mask.repeat_interleave(3)[: len(nt_parent)]

netam/models.py:648

[nitpick] Update the docstring for selection_factors_of_aa_str to document the new zap_diagonal parameter and clarify its purpose to avoid confusion.

if zap_diagonal and self.hyperparameters["output_dim"] >= 20:

matsen · 2025-04-02T22:28:56Z

Re todo, was I wrong about #94 ?

willdumm · 2025-04-02T23:02:59Z

You were not wrong at the time with DDSM, but I think that isn't how things are done now? See e.g. here:

netam/netam/dasm.py

Line 206 in 92d29c9

log_preds = (

If I'm reading it correctly, after we squish things with the selection factors, we fix the probability of the parent codon so that the codon probability distribution sums to 1. That leaves all the other synonymous target codons whose probabilities were modified by the selection factor of the parent amino acid.

Incidentally that stop codon zapper should take care of stop codons, but despite that I did once see stop codons simulated in one of my tests...

willdumm · 2025-04-03T22:20:54Z

To summarize our discussion earlier, we noticed that the DASM paper specifies that no selection factor should be applied to probabilities for codons that code for the parent codon. So, we will try setting the wild type selection factor to 1 in the model forward function.

matsen

Wow, this is awesome. Let's talk it over tomorrow.

matsen · 2025-04-11T00:39:24Z

netam/models.py

+            A tensor of shape (B, L, out_features) representing the log level
+            of selection for each possible amino acid at each site.
+        """
+        result = self.predict(self.represent(amino_acid_indices, mask))


This seems important and I don't totally get it. Let's talk it over.

Oh, I think I get it but still we can chat.

willdumm added 2 commits March 28, 2025 15:54

refactor for simulation

32f548b

format

889f44e

willdumm requested a review from Copilot March 28, 2025 22:59

Copilot AI reviewed Mar 28, 2025

View reviewed changes

willdumm added 5 commits March 31, 2025 14:14

fixed simulation tests

002f670

format

9b34ce6

tests working

07d6c6e

reword docstring

e87e20e

enable dnsm simulation

1d12f33

willdumm marked this pull request as ready for review April 2, 2025 22:14

willdumm requested review from Copilot and matsen April 2, 2025 22:14

Copilot AI reviewed Apr 2, 2025

View reviewed changes

willdumm added 4 commits April 3, 2025 15:56

switch codon probs from model to linear space

4b4c541

format

efcacca

fix test and wildtype synonymous codons bug

c86013a

final bits

38d07c3

matsen reviewed Apr 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DXSM data simulation #127

DXSM data simulation #127

willdumm commented Mar 28, 2025 •

edited

Loading

Copilot AI left a comment

Copilot AI left a comment

matsen commented Apr 2, 2025

willdumm commented Apr 2, 2025 •

edited

Loading

willdumm commented Apr 3, 2025

matsen left a comment

matsen Apr 11, 2025

matsen Apr 11, 2025

DXSM data simulation #127

Are you sure you want to change the base?

DXSM data simulation #127

Conversation

willdumm commented Mar 28, 2025 • edited Loading

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

matsen commented Apr 2, 2025

willdumm commented Apr 2, 2025 • edited Loading

willdumm commented Apr 3, 2025

matsen left a comment

Choose a reason for hiding this comment

matsen Apr 11, 2025

Choose a reason for hiding this comment

matsen Apr 11, 2025

Choose a reason for hiding this comment

willdumm commented Mar 28, 2025 •

edited

Loading

willdumm commented Apr 2, 2025 •

edited

Loading