Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shot noise based analysis #64

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

HazenBabcock
Copy link
Contributor

@HazenBabcock HazenBabcock commented May 4, 2020

This is more of a proposal then a pull request. Thoughts and suggestions appreciated.

If you know your cameras offset and gain then you can convert camera ADU to real units. This could be very helpful because now, if you assume Poisson statistics and you have a reasonable estimator for the image background, you can now estimate the significance of every pixel in the image in units of sigma. This automatically corrects for things like peaks of the same height in areas with high background are not as reliable as those in areas with low background. Now you can also filter out the dim pixels with a metric that has some statistical basis, for example you could ignore all pixels that are less than 6 sigma in significance. This makes decoding much faster because now you only need to consider the 10-20% of pixel traces that have at least one pixel above the sigma threshold.

This seems to work reasonably well, at least on simulated data (242 barcodes, 22 bits, 7 values). However it only goes as far as the Decode step, I wasn't sure how to plug the results into AdaptiveBarcodeFilters, or whether this would even be a good idea. It's also possible that it wouldn't work that well on 1000 or 10k gene experiments where there are a lot more bits. Also note that I removed the deconvolution step, so Preprocess now only estimates the pixel significance and Optimize is only used for determining the chromatic correction factors.

A graph showing the performance of the SNB approach versus current MERlin at a range of sigma thresholds (4.0 - 12.0). The SNB numbers are the results from DecodeSNB. The current MERlin numbers are the results from AdaptiveFilterBarcodes. Note that the true positives and false positives are plotted on very different scales.

true_false_plot

emanuega and others added 15 commits April 17, 2020 16:12
…lculating pixel significance from foreground and background image estimates.
…ling as this is not necessary if the gain is set correctly.
…tionPreprocess into Preprocess so that they can also be used by EstimatePixelSignificance.
…nsions_in_decode

Remove list comprehensions to improve performance.
…simulated images, a Gaussian with zero offset and sigma 1.0.
@pep8speaks
Copy link

pep8speaks commented May 4, 2020

Hello @HazenBabcock! Thanks for opening this PR.

Line 97:34: E251 unexpected spaces around keyword / parameter equals
Line 99:39: E251 unexpected spaces around keyword / parameter equals
Line 270:43: E251 unexpected spaces around keyword / parameter equals
Line 272:34: E251 unexpected spaces around keyword / parameter equals
Line 274:39: E251 unexpected spaces around keyword / parameter equals

Line 471:57: E251 unexpected spaces around keyword / parameter equals

Line 71:81: E501 line too long (85 > 80 characters)

Comment last updated at 2020-05-04 18:13:49 UTC

@codecov
Copy link

codecov bot commented May 4, 2020

Codecov Report

Merging #64 into master will decrease coverage by 1.40%.
The diff coverage is 44.91%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #64      +/-   ##
==========================================
- Coverage   87.32%   85.91%   -1.41%     
==========================================
  Files          57       57              
  Lines        5103     5225     +122     
==========================================
+ Hits         4456     4489      +33     
- Misses        647      736      +89     
Impacted Files Coverage Δ
merlin/analysis/optimize.py 68.72% <25.00%> (-6.96%) ⬇️
merlin/util/imagefilters.py 50.00% <35.71%> (-50.00%) ⬇️
merlin/util/decoding.py 67.25% <46.51%> (-8.31%) ⬇️
merlin/analysis/preprocess.py 65.57% <48.33%> (-19.31%) ⬇️
merlin/analysis/decode.py 79.54% <63.63%> (-3.65%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cfd8aba...99d5bbd. Read the comment docs.

@HazenBabcock
Copy link
Contributor Author

A complete example is available here in the 2020-05-05-simulation.zip file.

@emanuega
Copy link
Owner

emanuega commented May 9, 2020

I like the approach overall of coming up with new metrics to distinguish between correctly and incorrectly called barcodes. I have a few comments that I hope can help as you think about the next iteration. First, it may be difficult to know the correct pixel significance threshold to use a priori for a given dataset. It's not clear that, for example, a 6 sigma threshold will always be better than a 4 sigma threshold or that a 6 sigma threshold will produce consistent results across a range of datasets. Additionally, I expect you are able to reject many of the pixels based on significance because the training data is fairly sparse. As the transcript density increases and a larger fraction of pixels contain a transcript, I expect the speed improvement would decrease. Another concern is that without deconvolution it is possible the density limit may be lower since you will have more overlap in the PSF from each transcript. Finally, it is likely still useful to account for the different intensities of different imaging rounds. For example, if the 750 and 650 laser powers are set so that the spots in the 650 image are twice as bright as in the 750 image, the unit vector will be distorted and won't be as close to the correct barcode as it would be when the pixel intensity vector is scaled to account for the difference in intensities. It seems in the current version, this change in laser power would also result in the 650 spots being more likely to be significant which might introduce some bias. I expect other statistics could be calculated that take into account estimates of the intensity distributions for '1' vs '0' bits.

I am interested in how well the pixel significance, or other statistics, corresponds with how accurately the barcode can be identified in real data. In real data it is possible the incorrectly identified barcodes can arise from nonspecific fluorescence background rather than just camera noise, so it is possible that incorrect barcodes could still correspond with pixels that are deemed significant. However, I think it is worth exploring more how much extra information the pixel significance yields to discriminate correct from incorrect barcodes in real data.

I typically try to evaluate how well a measured property distinguishes incorrect barcodes from correct barcodes by looking at the distribution of blank barcodes relative to that parameter. Blank barcodes do not code for any transcript, so any time they are detected it is known to be an error. For example, in the 2 dimensional histogram shown below (from Xia et al, PNAS, 2019), the blank barcodes are enriched in a region of large vector distance (the distance between the normalized pixel vector and the nearest normalized barcode in the codebook) and lower intensity (the L2 norm of the pixel vector).

image

This suggests that we can have higher barcode calling accuracy by taking the barcodes in the region of shorter vector distance and high intensity and excluding the barcodes at larger vector distances and lower voxel intensities. By setting a threshold on the fraction of blanks ("blank fraction threshold") in each of the histogram bins, we can tune the trade-off between detection efficiency and misidentification rate. For example, if we only select barcodes that fall within bins in the histogram with 0 blanks, we have high confidence that we are not miscalling many barcodes but we also likely exclude correctly called barcodes that are in bins of the histogram that have 1 or 2 blanks. By varying the blank fraction threshold, you can come up with a curve showing the trade-off between detection efficiency and misidentification rate as show below (RNA-encoding barcode misidentification rate estimated as (the mean count per blank control barcodes per cell) / (the mean count per RNA-encoding barcodes per cell).

image

The current filtering is based on a three-dimensional histogram of three parameters: vector distance, intensity, and area (the number of pixels assigned to the barcode). If pixel significance provides additional discriminating power between correctly and incorrectly identified barcodes, I would expect similar filtering using a four-dimensional histogram that additionally includes pixel significance (mean, min, or max) would shift the curve of detection efficiency vs misidentification rate upward, so that at a given misidentification rate more barcodes can be detected.

@HazenBabcock HazenBabcock mentioned this pull request May 22, 2020
@r3fang
Copy link

r3fang commented May 22, 2020

Well, I am estimating barcode accuracy in a slightly different way. Based on the pixel intensity versus distance plot shown in Xia et al paper, intuitively, a "correct" pixel should have a higher intensity and lower distance to the codebook. Using a few randomly selected FOVs, I trained machine learning method (i.e. svm with linear or rbf kernl) to distinguish the corrected versus blank pixels. More importantly, this model is able to assign a probability or confidence for each pixel of being a "correct" ones. This basically converts two variables (intensity and distance) into a probability. During the step of extracting the barcodes, I calculated the log likelihood of a barcode as sum(log(1-p)) in which p is the probability for pixel assigned to the barcode. Finally I rank barcodes based on the likelihood and threshold the barcode using blank barcodes to 5% misidentification rate.

@r3fang
Copy link

r3fang commented May 22, 2020

I was hoping to assign a confidence level to every detected barcode instead of doing adaptive thresholding.

@HazenBabcock
Copy link
Contributor Author

@r3fang Since you don't have ground truth it seems to me that your confidence model is based on the pipeline output?

@r3fang
Copy link

r3fang commented May 22, 2020

@r3fang Since you don't have ground truth it seems to me that your confidence model is based on the pipeline output?

the model was trained to separate the blank pixels (not barcodes) versus coding pixels. It is not the direct output from MERlin, I generated the training set on my own. The model trained on one dataset using one codebook seems to also work on another dataset set well.

@r3fang
Copy link

r3fang commented May 22, 2020

Off course, the blank pixels does not represent all false pixels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants