zinck
is a novel knockoff-based framework specifically designed for microbiome data analysis. Microbiome datasets are often high-dimensional, compositional, and zero-inflated, posing unique challenges for statistical analysis and interpretation. zinck
addresses these challenges by employing a flexible generative model that effectively captures the zero-inflation and complex dependence structure characteristic of microbial communities. This approach allows for simultaneous variable selection and false discovery rate (FDR) control, making zinck
particularly suited for taxonomic variable selection in microbiome studies.
Visit the Zinck Website for some reproducible analyses from our paper and to explore its workflow!
Before installing zinck
, ensure your system meets these requirements:
- R version 4.0.0 or higher (version 4.1.3 recommended)
- Rtools (for Windows users)
- A C++ compiler compatible with R
zinck
relies on several R packages. Most dependencies are automatically installed when you install zinck
. However, special attention is needed for rstan
and StanHeaders
due to version compatibility.
dplyr
reshape2
knockoff
glmnet
randomForest
caret
rstan
(version 2.21.7 or 2.21.8 recommended)StanHeaders
(version 2.26.25)stats
fitdistrplus
ggplot2
MLmetrics
phyloseq
GUniFrac
kosel
gridExtra
zinLDA
knitr
rmarkdown
To use zinck
, it's critical to install specific versions of rstan
and StanHeaders
. Follow these steps:
-
Remove Older Versions: If you have different versions of
rstan
orStanHeaders
installed, remove them using:remove.packages(c("rstan", "StanHeaders"))
-
Install
StanHeaders
2.26.25:packageurl <- "https://cran.r-project.org/src/contrib/Archive/StanHeaders/StanHeaders_2.26.25.tar.gz" install.packages(packageurl, repos=NULL, type="source")
-
Install
rstan
2.21.8:packageurl <- "https://cran.r-project.org/src/contrib/Archive/rstan/rstan_2.21.8.tar.gz" install.packages(packageurl, repos=NULL, type="source")
Note: Installation from source requires Rtools (Windows) or the appropriate development tools for other operating systems.
-
Verify Installation: Ensure that the correct versions of
rstan
andStanHeaders
are installed by running:packageVersion("rstan") # Should return 2.21.8 packageVersion("StanHeaders") # Should return 2.26.25
After ensuring that the correct versions of rstan
and StanHeaders
are installed, you can proceed with installing zinck
.
-
Install
devtools
Package: If not already installed, you needdevtools
to installzinck
from GitHub.if (!require("devtools")) install.packages("devtools")
-
Install
zinck
:devtools::install_github("ghoshstats/zinck", build_vignettes = TRUE)
Once zinck
is successfully installed, load it in R to begin your analysis:
library(zinck)
To get started with zinck, explore the package vignettes and documentation:
# View available vignettes
browseVignettes(package = "zinck")
To get started with zinck, explore the package documentation:
# View package documentation
?zinck
After installation, you can verify and get a feel for the package's capabilities through a practical example. This guide walks you through loading a dataset included with zinck, selecting samples, generating knockoffs, and performing variable selection. This example utilizes the count.Rdata
dataset included in the zinck package. Ensure it's loaded correctly into your R session.
# Load the dataset
load(data/count.Rdata)
# Simulation Design
generate_data <- function(p, seed) {
dcount <- count[, order(decreasing = TRUE, colSums(count, na.rm = TRUE), apply(count, 2L, paste, collapse = ''))] # Ordering the columns with decreasing abundance
# Randomly sampling patients from 574 observations
set.seed(seed)
norm_count <- count / rowSums(count)
col_means <- colMeans(norm_count > 0)
indices <- which(col_means > 0.2)
sorted_indices <- indices[order(col_means[indices], decreasing = TRUE)]
if (p %in% c(200, 300, 400)) {
dcount <- count[, sorted_indices][, 1:p]
sel_index <- sort(sample(1:nrow(dcount), 500))
dcount <- dcount[sel_index, ]
original_OTU <- dcount + 0.5
seq_depths <- rowSums(original_OTU)
Pi <- sweep(original_OTU, 1, seq_depths, "/")
n <- nrow(Pi)
col_abundances <- colMeans(Pi)
##### Generating continuous responses ######
set.seed(1)
signal_indices <- sample(1:min(p, 200), 30, replace = FALSE) # Randomly selecting 30 indices for signal injection
signals <- (2 * rbinom(30, 1, 0.5) - 1) * runif(30, 6, 12)
kBeta <- numeric(p)
kBeta[signal_indices] <- signals / sqrt(col_abundances[signal_indices])
eps <- rnorm(n, mean = 0, sd = 1)
Y <- Pi^2 %*% (kBeta / 2) + Pi %*% kBeta + eps
##### Generating binary responses #####
set.seed(1)
signals <- (2 * rbinom(30, 1, 0.5) - 1) * runif(30, 30, 50)
kBeta[signal_indices] <- signals / sqrt(col_abundances[signal_indices])
pr <- 1 / (1 + exp(-(Pi^2 %*% (kBeta / 2) + Pi %*% kBeta)))
Y_bin <- rbinom(n, 1, pr)
######### Generate a copy of X #########
X <- matrix(0, nrow = nrow(Pi), ncol = ncol(Pi))
nSeq <- seq_depths
# Loop over each row to generate the new counts based on the multinomial distribution
set.seed(1)
for (i in 1:nrow(Pi)) {
X[i, ] <- rmultinom(1, size = nSeq[i], prob = Pi[i, ])
}
} else {
print("Enter p within 200 to 400")
}
colnames(X) <- colnames(Pi)
return(list(Y = Y, X = X, Y_bin = Y_bin, signal_indices = signal_indices))
}
ntaxa = 200 # Change to p = 300, 400 accordingly.
X <- generate_data(p=ntaxa, seed=1)$X
Y1 <- generate_data(p=ntaxa, seed=1)$Y
# Fit the zinck model
fit <- fit.zinck(X,num_clusters = 18,method="ADVI",seed=1,boundary_correction = TRUE, prior_ZIGD = TRUE)
# Extract model parameters
beta <- fit[["beta"]]
theta <- fit[["theta"]]
# Generate knockoff features
X_tilde <- zinck::generateKnockoff(X,theta,beta,seed=1) ## getting the knockoff copy
# Perform variable selection
cts_rf <- suppressWarnings(zinck.filter(X,X_tilde,Y1,model="Random Forest",fdr=0.2,ntrees=5000,offset=1,mtry=400,seed=15,metric = "Accuracy", rftuning = TRUE))
index_est <- cts_rf[["selected"]]
index <- generate_data(p=ntaxa,seed=1)$signal_indices
# Evaluating model performance
TP <- sum(index_est %in% index) # True Positives
FP <- sum(!index_est %in% index) # False Positives
FN <- length(index) - TP # False Negatives
estimated_FDR <- FP / (FP + TP) # Evaluating the empirical False Discovery Rate
estimated_power <- TP / (TP + FN) # Evaluating the empirical Power or TPR
print(paste("Estimated FDR:", estimated_FDR))
print(paste("Estimated Power:", estimated_power))
Codes used to replicate both simulation and real data analyses are in the folder Codes/
.
There are two types of simulation studies:
(1) Non-parametric Simulations -- where the microbiome data generating process is unknown and involves subsetting from the CRC species level data (https://github.com/zellerlab/crc_meta), also saved in the repository as count.Rdata
. The number of taxa is varied from 200 to 400 for both continuous and binary outcome types. The empirical detection powers along with FDRs are recorded for a range of target FDR thresholds. The codes to replicate the analysis can be found in Codes/Non-parametric Simulations.R
.
(2) Parametric Simulations -- where the microbiome data generating process is known that is, either generated from a Dirichlet Multinomial (DM) setting or a Logistic Normal Multinomial (LNM) setting. The number of taxa is again varied from 200 to 400 and the empirical detection powers along with FDRs are recorded for a range of target FDR thresholds. The codes to replicate the analysis can be found in Codes/Parametric Simulations.R
.
(1) CRC data analyses -- The codes to reproduce the Feature Statistics and the Venn Diagram representing the number of biomarkers detected for the CRC
data can be found in Codes/CRC-analysis.R
.
(2) IBD data analyses -- The codes to reproduce the Feature Statistics and the Venn Diagram representing the number of biomarkers detected for the IBD
data can be found in Codes/IBD-analysis.R
.
Furthermore, the codes replicating the Leave-one-study-out prediction analysis for both CRC
and IBD
studies can be found in Codes/Prediction-analysis.R
The heatmaps to compare the quality of knockoffs for zinck
and other methods can be replicated using the codes in Codes/Heatmaps.R
.
For any issues or further assistance, consult the zinck pdf manual saved as zinck-manual.pdf
or visit the GitHub repository's Issues section.