GENEVA (GENe Expression Variance Analysis) is a semi-automated framework for exploring public RNA-seq datasets. GENEVA (Gene Expression Variance Analysis) allows researchers to identify RNA-seq datasets that contain modulating conditions for a gene or a gene signature. For a given gene, GENEVA identifies the most relevant datasets by analyzing the variance of the gene expression. GENEVA visualizes the relevant datasets for detailed manual analysis. GENEVA is scalable and is agnostic to study designs.
GENEVA uses the uniformly processed RNA-seq data from the ARCHS4 website (https://amp.pharm.mssm.edu/archs4/download.html). As of Oct 9, 2020, The downloaded data include gene-level count data of 286650 samples from 9124 datasets (GEO series) and metadata. GENEVA transformed the gene count data into percentile rank data, which reduces the influences of library size, batch effects, and extreme values.
For any given gene (gene X), GENEVA prioritizes the datasets that have a large variance of the gene X expression. At the same time, GENEVA controls for the overall heterogeneity of the samples to prioritize datasets in which gene X is specifically modulated by experimental conditions rather than due to tissue type differences. In addition, GENEVA embeds the meta-data into numerical space and prioritizes datasets with high correlations between ACE2 expression and the metadata. This allows GENEVA to identify datasets in which gene X is regulated by experimental conditions rather than randomness or unexplained factors.
GENEVA concatenates the metadata of each sample into a single string, including the title, tissue type, and other characteristics (e.g. demographics, time points, treatment, genetic information, and disease status). GENEVA then calculates the pairwise Levenshtein distance between the strings that belong to the same study (GEO series). GENEVA applies multidimensional scaling to the pairwise Levenshtein distance and embedded the strings into 2-dimensional space for visualization and downstream analysis.
For a given gene in a given dataset, GENEVA calculates the variance of the gene (VARg). GENEVA measures the overall heterogeneity of the samples by calculating the average variance of all genes (VARm). GENEVA runs a regression using the expression of the gene as the dependent variable and the embedded metadata as independent variables (expression ~ first embed dimension + second embed dimension). The regression coefficient (R2) represents the association between the expression of the gene and the embedded metadata. The product between VARg and R2 represents the variance of the gene explained by the embedded metadata. The GENEVA score is defined as VARg × R2 / VARm.
If the user queries a single gene, the bar plot shows the expression level (rank transformed count data) of the gene. If the user queries a signature, the bar plot shows the enrichment score of the signature
GENEVA embeds the metadata of the samples into 2-dimensional numerical space. Samples with similar experimental conditions will be close to each other in the 2-D plot. The 2-D plot allows users to visually identify experimental groups, and correlate the experimental group with the expression value. The color of each sample reflects the expression level of the gene (rank transformed count data) or the enrichment of the signature.
For scientific and technical queries contact Butte Lab @UCSF
Manuscript in Preparation