TempO-SeqR data analysis tool: case study

How do I analyze my gene expression data?
 

If you have done RNA-Seq before, you might have encountered this issue: analysis of RNA-Seq gene expression data often requires substantial computational resources and specialist bioinformatics knowledge. We’re aware of this problem! We don’t just want to make the way to the data simpler and more userfriendly - using TempO-Seq, an extraction-free highly multiplexed gene expression platform - but also what you can do with the data once you have it. Therefore, we have developed TempO-SeqR. 

TempO-SeqR is a simple online tool for sequencing alignment, quality control and analysis of TempO-Seq data. The tool requires no coding knowledge and is Graphical User Interface (GUI)-based. 

 
What this means is that you can easily access, align, and analyse your TempO-Seq data in an afternoon without needing an informaticist.
 

To illustrate how easy to use TempO-SeqR is, here we present a case study describing step-by-step analysis of a colorectal cancer FFPE TempO-Seq dataset, going from raw data all the way to pathway analysis. The case study, like most data analysis pipelines for next generation sequencing data, is split in the following steps:

  1. Alignment

  2. Quality control

  3. Differential gene expression analysis


Please note: the type of analysis will always depend on your data. This is an example and can be used as a guideline, but other factors based on your experimental data may apply. 


 

Step 1: Alignment - more like a lookup table

 

Let’s start with aligning the data. TempO-Seq gene expression analysis is highly multiplexed, allowing for up to 6,144 samples per sequencing run. Each sample has a unique tag, which will allow “demultiplexing” by sample - this is not so different from regular RNA-Seq. Samples are demultiplexed by the Illumina software resulting in a folder of fastq files. Alignment is performed by using the TempO-SeqR alignment tool to browse to your folder containing the fastqs, selecting the fastqs that you want to align and click align, “that simple.”

Here is a preview of how this alignment set-up looks like on TempO-SeqR (Figure 1):

 
Figure 1. Preview of how the file alignment is setup on TempO-SeqR.

Figure 1. Preview of how the file alignment is setup on TempO-SeqR.

 

And, just as a comparison, the times for RNA-Seq and TempO-Seq data alignment on a cluster or laptop (Figure 2a and b):

 
Figure 2a. RNA-Seq vs. TempO-Seq alignment comparison on cluster.

Figure 2a. RNA-Seq vs. TempO-Seq alignment comparison on cluster.

Figure 2b. RNA-Seq vs. TempO-Seq alignment comparison on cluster.

Figure 2b. RNA-Seq vs. TempO-Seq alignment comparison on cluster.


Step 2: Quality control

 


Step 2.1: Correlation plots

After alignment of our sequencing data, we now have a raw “counts per gene per sample dataset” (matrix). Before we identify differentially expressed genes, let’s  first look at the quality of our dataset.

Correlation plots will tell you how similar two samples are. Between replicates, you want correlation plots to have a high R2. If this isn’t the case, there may have been some variability in preparation of the samples. Conversely, between biologically different samples, the correlation will be lower (and it should be - based on biological differences!).

To illustrate a high R2, we’ve loaded in some of our reference lysate TempO-Seq controls (brain universal reference lysates BURL1 and BURL2) in Figure 3. You can see within replicates there is extremely little variability while between different samples there is a lot of variability.

 
Figure 3. Scatter plots for reference lysate controls BURL1 and 2 show that variability within replicates (A) is low (i.e., correlation, or R-squared, is high), while variability between samples is higher (correlation, or R-squared, is lower) (B).

Figure 3. Scatter plots for reference lysate controls BURL1 and 2 show that variability within replicates (A) is low (i.e., correlation, or R-squared, is high), while variability between samples is higher (correlation, or R-squared, is lower) (B).

 



Step 2.2: Principal component analysis

Great – our correlations are looking good. Next, let’s look at principal components of our data. Principal component analysis (PCA) is a data dimensionality reduction method commonly used for identifying outliers and clusters in your data. Essentially, PCA will identifies components explaining the majority of the variability of your data – most likely, you want these to be biological, but you can also use this to identify grouping and potential outliers.

Here, we are showing PCA of all of our samples (Figure 4). You can see that our controls cluster together and separately from all experimental samples as expected.

 
Figure 4. PCA plot of all samples.

Figure 4. PCA plot of all samples.

 

In the she second PCA plot of only our case study colorectal cancer FFPE dataset (Figure 5) we see a clustering by tumor subtypes.

 
Figure 5. PCA of only colorectal cancer FFPE dataset..

Figure 5. PCA of only colorectal cancer FFPE dataset..

 


Based on the experimental design, we were expecting these samples to cluster. You will probably have a similar experimental setup.

 

 

Step 3: Finding differentially expressed genes between groups

 

Now that we’ve confirmed that different tumor subtypes cluster differently, let’s identify which genes are differentially expressed – this is likely the main point you wanted to address with your gene expression experiment. Luckily, TempO-SeqR makes this step really easy.

You can select your samples of interest for the two groups and carry out DE analysis. Our DE analysis is based on DESeq2. There are many others, but DESeq2 is generally one of the most commonly used ones, and you can read more on the documentation here: Love et. al, 2014. In the example below, we compare a normal tissue (circled in green in the PCA, Figure5) to a tumor section of FFPE tissue (circled in red in the PCA, Figure 5).

This will give you a list of differentially expressed genes. In Figure 6 below, the dots colored blue with a positive log2fold change are upregulated in the tumor while the blue dots with a negative log2fold change are downregulated.

 
Figure 6. Differential expression of genes.

Figure 6. Differential expression of genes.


Summary

 

In this case study, we have used an example dataset and performed TempO-SeqR analysis from raw data (alignment) to hypothesis generation. We hope you found this demonstration useful and it illustrates how you could use TempO-SeqR for your research. If you have any questions or would like to request a quotation for TempO-Seq incl. use of TempO-SeqR, please don’t hesitate to reach out.