Skip to content

JohnSpJun/scGSEA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scGSEA

Description: scGSEA is an extension of ssGSEA tailored for single-cell data analysis. It addresses the challenge of sparsity by employing a normalization method and scoring metric chosen to minimize any variability. By utilizing scGSEA, scientists can explore and interpret pathway activity and functional alterations within heterogeneous populations of cells.

Authors: John Jun; UCSD - Mesirov Lab, UCSD

Contact: Forum Link.

Parameters

Parameter Group Name Description Default Value
Input Files gex_file * File containing raw counts or mRNA abundance estimates
gene_set_database_file * Gene sets in GMT format
chip_file Chip file used for conversion to gene symbols
output_file_name * Basename to use for output file scGSEA_scores
Cell Grouping Data *
(Only use one of the two)
metacell_data_label Metadata label for cell grouping (metacell) information; clustering data seurat_clusters
metacell_data_file Metadata file for cell grouping (metacell) information; clustering data
Multi-threading n_cpu Number of CPUs to utilize for parallel computing 3

* Required

Input Files

  1. gex_file
    This is a file containing unnormalized gene expression data in raw read counts or estimated RNA abundance. The scGSEA module supports multiple input file formats including Seurat RDS, H5seurat, H5ad formats as well as 10x Market Exchange (MEX) and HDF5 (h5) formats. For a Seurat object, the $RNA@counts slot will be used. For an AnnData object, the raw.X slot will be used.

    • If you come across the following message in the stderr.txt file, please verify that the input file contains unnormalized raw counts data.  
      The raw counts matrix was not composed of integer values. This may represent an issue with the processing pipeline. Please be advised...
    • If you have used kallisto or salmon.alevin for alignment, please disregard the message about the raw counts data not being in integer format; the aforementioned tools generate estimated RNA abundances, which may consist of non-integer count values.
    • For 10x MEX file format, please compress the folder containing the three files (barcodes.tsv, matrix.mtx, features.tsv) and supply the .zip file.
  2. gene_set_database_file

    • This parameter’s drop-down allows you to select gene sets from the Molecular Signatures Database (MSigDB) on the GSEA website. This drop-down provides access to only the most current (2023) version of MSigDB. You can also upload your own gene set file(s) in GMT format.
    • If you want to use files from an earlier version of MSigDB you will need to download them from the archived releases on the GSEA website.
  3. chip_file
    This parameter’s drop-down allows you to select CHIP files from the Molecular Signatures Database (MSigDB) on the GSEA website. This drop-down provides access to only the most current version (2023) of MSigDB. How do I choose a chip file?

  4. output_file_name
    The prefix used for the name of the output GCT and CSV file. The default output prefix is scGSEA_scores. The output CSV and GCT files will contain a gene set x metacell matrix of enrichments scores.

Cell Grouping Data

  1. metacell_data_label
    The name of the metadata label for cell grouping information within the input Seurat/AnnData object. This label will be used to access the cell grouping information utilized for aggregating cells to create metacells. The default value for this parameter is seurat_clusters, which is the metadata label for the slot that stores cell-to-cluster mapping generated by the Seurat's FindClusters method. Otherwise, provide the appropriate metadata label for the slot that stores cell grouping information.
  2. metacell_data_file
    If your input file is 10x HDF5 or 10x MEX format, a separate cell grouping data file (tab-delimited .txt file) must be supplied here. The first column, "Name", would have cell names and the second column, "Metacell", would have metacell (cell group) names. The grouping information in this file is used to aggregate cells prior to computing scGSEA scores. Therefore, if you have 10X HDF5 or 10x MEX formatted files and do not have a metacell data file, please perform clustering using a clustering method of your choice.

Multi-Threading

  1. n_cpu
    The number of CPUs to utilize for parallel computing. scGSEA package parallelizes the computation of enrichment scores through dividing the computation into n_cpu number of subprocesses. The default value for this parameter is 3.

Output Files

  1. <output_file_name>.csv
    This is a gene set x metacell matrix consisted of scGSEA scores.
  2. <output_file_name>.gct
    This is a gene set x metacell matrix consisted of scGSEA scores.

For more details, please refer to the full documentation.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages