7
7
8
8
\usage {
9
9
correctedContact(data , iterations = 50 , exclude.local = 1 , ignore.low = 0.02 ,
10
- winsor.high = 0.02 , average = TRUE , dispersion = 0.05 )
10
+ winsor.high = 0.02 , average = TRUE , dist.correct = FALSE )
11
11
}
12
12
13
13
\arguments {
@@ -17,7 +17,7 @@ correctedContact(data, iterations=50, exclude.local=1, ignore.low=0.02,
17
17
\item {ignore.low }{a numeric scalar , indicating the proportion of low - abundance bins to ignore }
18
18
\item {winsor.high }{a numeric scalar indicating the proportion of high - abundance bin pairs to winsorize }
19
19
\item {average }{a logical scalar specifying whether counts should be averaged across libraries }
20
- \item {dispersion }{a numeric scalar for use in computing the average count in \ code {\ link { mglmOneGroup }} }
20
+ \item {dist.correct }{a logical scalar indicating whether to correct for distance effects }
21
21
}
22
22
23
23
\value {
@@ -26,6 +26,7 @@ A list with several components.
26
26
\item {\code {truth }: }{a numeric vector containing the true interaction probabilities for each bin pair }
27
27
\item {\code {bias }: }{a numeric vector of biases for all bins }
28
28
\item {\code {max }: }{a numeric vector containing the maximum fold - change change in biases at each iteration }
29
+ \item {\code {trend }: }{a numeric vector specifying the fitted value for the distance - dependent trend , if \code {dist.correct = TRUE }}
29
30
}
30
31
If \code {average = FALSE }, each component is a numeric matrix instead.
31
32
Each column of the matrix contains the specified information for each library in \code {data }.
@@ -34,31 +35,39 @@ Each column of the matrix contains the specified information for each library in
34
35
\details {
35
36
This function implements the iterative correction procedure described by Imakaev \emph {et al. } in their 2012 paper.
36
37
Briefly , this aims to factorize the count for each bin pair into the bias for the anchor bin , the bias for the target bin and the true interaction probability.
37
- The probability sums to 1 across all bin pairs for a given bin.
38
38
The bias represents the ease of sequencing / mapping / other for that genomic region.
39
39
40
40
The \code {data } argument should be generated by taking the output of \code {\link {squareCounts }} after setting \code {filter = 1 }.
41
41
Filtering should be avoided as counts in low - abundance bin pairs may be informative upon summation for each bin.
42
42
For example , a large count sum for a bin may be formed from many bin pairs with low counts.
43
- Removal of those bin pairs would result in the loss of per - bin information.
43
+ Removal of those bin pairs would result in loss of information.
44
44
45
+ For \code {average = TRUE }, if multiple libraries are used to generate \code {data }, an average count will be computed for each bin pairs across all libraries using \code {\link {mglmOneGroup }}.
46
+ The average count will then be used for correction.
47
+ Otherwise , correction will be performed on the counts for each library separately.
48
+
49
+ The maximum step size in the output can be used as a measure of convergence.
50
+ Ideally , the step size should approach 1 as iterations pass.
51
+ This indicates that the correction procedure is converging to a single solution , as the maximum change to the computed biases is decreasing.
52
+ }
53
+
54
+ \section {Additional parameter settings }{
45
55
Some robustness is provided by winsorizing out strong interactions with \code {winsor.high } to ensure that they do not overly influence the computed biases.
56
+ This is useful for removing spikes around repeat regions or due to PCR duplication.
46
57
Low - abundance bins can also be removed with \code {ignore.low } to avoid instability during correction , though this will result in \code {NA } values in the output.
47
58
48
59
Local bin pairs can be excluded as these are typically irrelevant to long - range interactions.
49
60
They are also typically very high - abundance and may have excessive weight during correction , if not removed.
50
61
This can be done by removing all bin pairs where the difference between the anchor and target indices is less than \code {exclude.local }.
51
62
52
- For \code {average = TRUE }, if multiple libraries are used to generate \code {data }, an average count will be computed for each bin pairs across all libraries using \code {\link {mglmOneGroup }} with the specified \code {dispersion }.
53
- The average count will then be used for correction.
54
- Otherwise , correction will be performed on the counts for each library separately.
55
-
56
- The maximum step size in the output can be used as a measure of convergence.
57
- Ideally , the step size should approach 1 as iterations pass.
58
- This indicates that the correction procedure is converging to a single solution , as the maximum change to the computed biases is decreasing.
63
+ If \code {dist.correct = TRUE }, abundances will be adjusted for distance - dependent effects.
64
+ This is done by computing residuals from the fitted distance - abundance trend , using the \code {filterTrended } function .
65
+ These residuals are then used for iterative correction , such that local interactions will not always have higher contact probabilities.
59
66
60
- % True signals are continuous variables and have limited use in count - based statistical frameworks.
61
- % You need to compute the bias for each one to get the offset.
67
+ Ideally , the probability sums to unity across all bin pairs for a given bin (ignoring \code {NA } entries ).
68
+ This is complicated by winsorizing of high - abundance interactions and removal of local interactions.
69
+ These interactions are not involved in correction , but are still reported in the output \code {truth }.
70
+ As a result , the sum may not equal unity , i.e. , values are not strictly interpretable as probabilities.
62
71
}
63
72
64
73
\examples {
@@ -84,6 +93,23 @@ stuff <- correctedContact(data, average=FALSE)
84
93
head(stuff $ truth )
85
94
head(stuff $ bias )
86
95
head(stuff $ max )
96
+
97
+ # Creating an offset matrix, for use in glmFit.
98
+ anchor.bias <- stuff $ bias [anchors(data , id = TRUE ),]
99
+ target.bias <- stuff $ bias [targets(data , id = TRUE ),]
100
+ offsets <- log(anchor.bias * target.bias )
101
+ difference <- log(stuff $ truth ) - (log(counts(data )) - offsets ) # effective function of offset in GLMs.
102
+ stopifnot(all(is.na(difference ) | difference < 1e-8 ))
103
+
104
+ # Adjusting for distance, and computing offsets with trend correction.
105
+ stuff <- correctedContact(data , average = FALSE , dist.correct = TRUE )
106
+ head(stuff $ truth )
107
+ head(stuff $ trend )
108
+ offsets <- log(stuff $ bias [anchors(data , id = TRUE ),]) +
109
+ log(stuff $ bias [targets(data , id = TRUE ),]) +
110
+ stuff $ trend / log2(exp(1 ))
111
+ difference <- log(stuff $ truth ) - (log(counts(data )) - offsets )
112
+ stopifnot(all(is.na(difference ) | difference < 1e-8 ))
87
113
}
88
114
89
115
\author {Aaron Lun }
0 commit comments