Shot noise based analysis #64

HazenBabcock · 2020-05-04T18:13:35Z

This is more of a proposal then a pull request. Thoughts and suggestions appreciated.

If you know your cameras offset and gain then you can convert camera ADU to real units. This could be very helpful because now, if you assume Poisson statistics and you have a reasonable estimator for the image background, you can now estimate the significance of every pixel in the image in units of sigma. This automatically corrects for things like peaks of the same height in areas with high background are not as reliable as those in areas with low background. Now you can also filter out the dim pixels with a metric that has some statistical basis, for example you could ignore all pixels that are less than 6 sigma in significance. This makes decoding much faster because now you only need to consider the 10-20% of pixel traces that have at least one pixel above the sigma threshold.

This seems to work reasonably well, at least on simulated data (242 barcodes, 22 bits, 7 values). However it only goes as far as the Decode step, I wasn't sure how to plug the results into AdaptiveBarcodeFilters, or whether this would even be a good idea. It's also possible that it wouldn't work that well on 1000 or 10k gene experiments where there are a lot more bits. Also note that I removed the deconvolution step, so Preprocess now only estimates the pixel significance and Optimize is only used for determining the chromatic correction factors.

A graph showing the performance of the SNB approach versus current MERlin at a range of sigma thresholds (4.0 - 12.0). The SNB numbers are the results from DecodeSNB. The current MERlin numbers are the results from AdaptiveFilterBarcodes. Note that the true positives and false positives are plotted on very different scales.

…lculating pixel significance from foreground and background image estimates.

…ling as this is not necessary if the gain is set correctly.

…tionPreprocess into Preprocess so that they can also be used by EstimatePixelSignificance.

…eline. Style code.

…nsions_in_decode Remove list comprehensions to improve performance.

…simulated images, a Gaussian with zero offset and sigma 1.0.

…ssible rounding errors.

pep8speaks · 2020-05-04T18:13:41Z

Hello @HazenBabcock! Thanks for opening this PR.

In the file merlin/analysis/decode.py:

Line 97:34: E251 unexpected spaces around keyword / parameter equals
Line 99:39: E251 unexpected spaces around keyword / parameter equals
Line 270:43: E251 unexpected spaces around keyword / parameter equals
Line 272:34: E251 unexpected spaces around keyword / parameter equals
Line 274:39: E251 unexpected spaces around keyword / parameter equals

In the file merlin/analysis/optimize.py:

Line 471:57: E251 unexpected spaces around keyword / parameter equals

In the file merlin/analysis/preprocess.py:

Line 71:81: E501 line too long (85 > 80 characters)

Comment last updated at 2020-05-04 18:13:49 UTC

codecov · 2020-05-04T18:18:34Z

Codecov Report

Merging #64 into master will decrease coverage by 1.40%.
The diff coverage is 44.91%.

@@            Coverage Diff             @@
##           master      #64      +/-   ##
==========================================
- Coverage   87.32%   85.91%   -1.41%     
==========================================
  Files          57       57              
  Lines        5103     5225     +122     
==========================================
+ Hits         4456     4489      +33     
- Misses        647      736      +89

Impacted Files	Coverage Δ
merlin/analysis/optimize.py	`68.72% <25.00%> (-6.96%)`	⬇️
merlin/util/imagefilters.py	`50.00% <35.71%> (-50.00%)`	⬇️
merlin/util/decoding.py	`67.25% <46.51%> (-8.31%)`	⬇️
merlin/analysis/preprocess.py	`65.57% <48.33%> (-19.31%)`	⬇️
merlin/analysis/decode.py	`79.54% <63.63%> (-3.65%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cfd8aba...99d5bbd. Read the comment docs.

HazenBabcock · 2020-05-07T18:00:32Z

A complete example is available here in the 2020-05-05-simulation.zip file.

emanuega · 2020-05-09T16:42:40Z

I like the approach overall of coming up with new metrics to distinguish between correctly and incorrectly called barcodes. I have a few comments that I hope can help as you think about the next iteration. First, it may be difficult to know the correct pixel significance threshold to use a priori for a given dataset. It's not clear that, for example, a 6 sigma threshold will always be better than a 4 sigma threshold or that a 6 sigma threshold will produce consistent results across a range of datasets. Additionally, I expect you are able to reject many of the pixels based on significance because the training data is fairly sparse. As the transcript density increases and a larger fraction of pixels contain a transcript, I expect the speed improvement would decrease. Another concern is that without deconvolution it is possible the density limit may be lower since you will have more overlap in the PSF from each transcript. Finally, it is likely still useful to account for the different intensities of different imaging rounds. For example, if the 750 and 650 laser powers are set so that the spots in the 650 image are twice as bright as in the 750 image, the unit vector will be distorted and won't be as close to the correct barcode as it would be when the pixel intensity vector is scaled to account for the difference in intensities. It seems in the current version, this change in laser power would also result in the 650 spots being more likely to be significant which might introduce some bias. I expect other statistics could be calculated that take into account estimates of the intensity distributions for '1' vs '0' bits.

I am interested in how well the pixel significance, or other statistics, corresponds with how accurately the barcode can be identified in real data. In real data it is possible the incorrectly identified barcodes can arise from nonspecific fluorescence background rather than just camera noise, so it is possible that incorrect barcodes could still correspond with pixels that are deemed significant. However, I think it is worth exploring more how much extra information the pixel significance yields to discriminate correct from incorrect barcodes in real data.

I typically try to evaluate how well a measured property distinguishes incorrect barcodes from correct barcodes by looking at the distribution of blank barcodes relative to that parameter. Blank barcodes do not code for any transcript, so any time they are detected it is known to be an error. For example, in the 2 dimensional histogram shown below (from Xia et al, PNAS, 2019), the blank barcodes are enriched in a region of large vector distance (the distance between the normalized pixel vector and the nearest normalized barcode in the codebook) and lower intensity (the L2 norm of the pixel vector).

This suggests that we can have higher barcode calling accuracy by taking the barcodes in the region of shorter vector distance and high intensity and excluding the barcodes at larger vector distances and lower voxel intensities. By setting a threshold on the fraction of blanks ("blank fraction threshold") in each of the histogram bins, we can tune the trade-off between detection efficiency and misidentification rate. For example, if we only select barcodes that fall within bins in the histogram with 0 blanks, we have high confidence that we are not miscalling many barcodes but we also likely exclude correctly called barcodes that are in bins of the histogram that have 1 or 2 blanks. By varying the blank fraction threshold, you can come up with a curve showing the trade-off between detection efficiency and misidentification rate as show below (RNA-encoding barcode misidentification rate estimated as (the mean count per blank control barcodes per cell) / (the mean count per RNA-encoding barcodes per cell).

The current filtering is based on a three-dimensional histogram of three parameters: vector distance, intensity, and area (the number of pixels assigned to the barcode). If pixel significance provides additional discriminating power between correctly and incorrectly identified barcodes, I would expect similar filtering using a four-dimensional histogram that additionally includes pixel significance (mean, min, or max) would shift the curve of detection efficiency vs misidentification rate upward, so that at a given misidentification rate more barcodes can be detected.

r3fang · 2020-05-22T15:02:09Z

Well, I am estimating barcode accuracy in a slightly different way. Based on the pixel intensity versus distance plot shown in Xia et al paper, intuitively, a "correct" pixel should have a higher intensity and lower distance to the codebook. Using a few randomly selected FOVs, I trained machine learning method (i.e. svm with linear or rbf kernl) to distinguish the corrected versus blank pixels. More importantly, this model is able to assign a probability or confidence for each pixel of being a "correct" ones. This basically converts two variables (intensity and distance) into a probability. During the step of extracting the barcodes, I calculated the log likelihood of a barcode as sum(log(1-p)) in which p is the probability for pixel assigned to the barcode. Finally I rank barcodes based on the likelihood and threshold the barcode using blank barcodes to 5% misidentification rate.

r3fang · 2020-05-22T15:07:40Z

I was hoping to assign a confidence level to every detected barcode instead of doing adaptive thresholding.

HazenBabcock · 2020-05-22T17:50:20Z

@r3fang Since you don't have ground truth it seems to me that your confidence model is based on the pipeline output?

r3fang · 2020-05-22T17:55:19Z

@r3fang Since you don't have ground truth it seems to me that your confidence model is based on the pipeline output?

the model was trained to separate the blank pixels (not barcodes) versus coding pixels. It is not the direct output from MERlin, I generated the training set on my own. The model trained on one dataset using one codebook seems to also work on another dataset set well.

r3fang · 2020-05-22T17:56:03Z

Off course, the blank pixels does not represent all false pixels.

emanuega and others added 15 commits April 17, 2020 16:12

Incremented version.

936bda0

Add utility functions for calculationg high/lowpass images and for ca…

bb97a6f

…lculating pixel significance from foreground and background image estimates.

Add clipping for small / negative background values. Remove sigma sca…

ea70af1

…ling as this is not necessary if the gain is set correctly.

Add EstimatePixelSignificance class. Move some methods from Deconvolu…

325e96a

…tionPreprocess into Preprocess so that they can also be used by EstimatePixelSignificance.

Initial implementation of 'complete' SNB pipeline, untested.

5edfbbb

Remove list comprehensions to improve performance.

dbbcd6a

Fix various bugs in the code such that it now works in the MERlin pip…

6208b11

…eline. Style code.

Fix bug introduced by trying to conform with a pycodestyle suggestion.

874bea6

Merge branch 'v0.1.7' into remove_list_comprehensions_in_decode

7f10d6c

Merge pull request emanuega#63 from HazenBabcock/remove_list_comprehe…

8e2cb01

…nsions_in_decode Remove list comprehensions to improve performance.

Actually apply camera parameters to image.

3a674f6

Merge branch 'v0.1.7' into shot_noise_based_analysis

9e58d47

Fix image sigma calculation. This now returns the expected answer on …

2f5a976

…simulated images, a Gaussian with zero offset and sigma 1.0.

Change histogram range. Convert image type to np.float32 to reduce po…

7b78c42

…ssible rounding errors.

Add comment about recommended threshold settings.

99d5bbd

HazenBabcock mentioned this pull request May 22, 2020

GPU option? #58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shot noise based analysis #64

Shot noise based analysis #64

HazenBabcock commented May 4, 2020 •

edited

Loading

pep8speaks commented May 4, 2020 •

edited by merlin-pep8speaks

Loading

codecov bot commented May 4, 2020

HazenBabcock commented May 7, 2020

emanuega commented May 9, 2020

r3fang commented May 22, 2020 •

edited

Loading

r3fang commented May 22, 2020

HazenBabcock commented May 22, 2020

r3fang commented May 22, 2020

r3fang commented May 22, 2020

Shot noise based analysis #64

Are you sure you want to change the base?

Shot noise based analysis #64

Conversation

HazenBabcock commented May 4, 2020 • edited Loading

pep8speaks commented May 4, 2020 • edited by merlin-pep8speaks Loading

Comment last updated at 2020-05-04 18:13:49 UTC

codecov bot commented May 4, 2020

Codecov Report

HazenBabcock commented May 7, 2020

emanuega commented May 9, 2020

r3fang commented May 22, 2020 • edited Loading

r3fang commented May 22, 2020

HazenBabcock commented May 22, 2020

r3fang commented May 22, 2020

r3fang commented May 22, 2020

HazenBabcock commented May 4, 2020 •

edited

Loading

pep8speaks commented May 4, 2020 •

edited by merlin-pep8speaks

Loading

r3fang commented May 22, 2020 •

edited

Loading