Gene Expression Visualization Tool

Motivation

One use case for a visualization tool related to cancer data is to visualize gene expression data across normal and cancerous tissue. The end-user would be a scientist aiming to identify genes that are upregulated in cancerous tissue for potential drug targeting. A scientist may use this tool for multiple purposes. They may already have a gene in mind and want to observe the distribution of that gene's expression levels across multiple normal tissues and cancer indications. This could be visualized with multiple boxplots representing gene expression distributions for the gene of interest, with each boxplot representing a specific tissue or cancer indication. This would allow them to analyze whether or not the gene may serve as a good target. For example, if the gene of interest is overexpressed in a specific cancerous tissue and less expressed in normal tissue, it may serve as a good target for a therapeutic drug that would then be able to target only tissue that overexpresses the gene of interest. The scientist may also not have a specific gene in mind and could use the visualization tool for more exploratory purposes. They may filter which genes they would like to see based on criteria they have and then select a cancer indication they are interested in. The resulting visualization would show them across all genes that matched filter criteria, which genes are most overexpressed in that cancer type compared to normal tissue. This could be represented with a line plot that plots the ratio of average cancer expression to normal expression on one axis against the ratio of prevalence of cancer overexpression to normal expression. Thus, from the visualization, the scientist would be able to isolate only genes with the highest average cancer overexpression as well as highest prevalence of cancer overexpression. This would give them a solid foundation for further research into best gene targets to utilize for therapeutic drug synthesis against their cancer type.

Background

Data

This data, the TOIL dataset, is a project conducted by UCSC that recomputes expression levels across samples from both cancer and normal tissues. The original studies (TCGA and TARGET for tumor tissue and GTEx for normal tissue) computed their expression values through different processing pipelines, so this recomputing project was necessary to be able to compare values across studies. There are multiple datasets within TOIL, but the primary ones we will use are "RSEM norm_count", which contains the actual gene expression values, and "TCGA GTEX main categories" and "TCGA TARGET GTEX selected phenotypes" which contain metadata regarding each sample. The biggest bias consideration for the data is ensuring the expression results were harmonized correctly between the different studies. If this was done incorrectly, it would be useless to compare results between the different studies. However, UCSC Toil went to great lengths to create a pipeline that ensured all samples were recomputed in a single, reliable pipeline (at a cost of $1.30 per sample). Other than this, there are no other huge considerations from the data as all samples were collected by consenting patients, and the data cannot be traced back to the identity of the actual individuals that gave samples. There was not much cleaning or processing that had to be done since the data is already well organized. The one big thing we did was aggregate the different datasets (expression levels and metadata of samples) so that they are all accessible together. This was done in Python using Pandas DataFrames and concatenating the DataFrames together.
Raw Data

Demo Video

Report

Final Report

Acknowledgements



This visualization allows the user to select two genes to see how correlated they are. When graphed, the visualization will show both of the expression data on a scatterplot. In the scatterplot, red points represent TCGA data, yellow is TARGET data, and green is GTEX data, showing the different data sources. When the user clicks on a point, it will show the sample ID it comes from as well as exact expression counts for both genes.



This visualization allows the user to select the gene they would like to look at, as well as the disease they want to look at expression for. The visualization will then show gene expression counts for all normal tissue types (colored green) and compare it to the expression of the tumor tissue, highlighted in red. Hovering over each tissue types shows the exact maximum expression of that tissue. Clicking on any tissue will also highlight the correlating points in the scatterplot with the same tissue type in black.