RNA-seq Data

Note: this lab borrows heavily from a Carpentries incubator module on visualization.

RNA-Seq data

To start out with, let’s explore the data provided:

# A tibble: 32,428 × 20
   gene    sample  expre…¹ organ…²   age sex   infec…³ strain  time tissue mouse
   <chr>   <chr>     <dbl> <chr>   <dbl> <chr> <chr>   <chr>  <dbl> <chr>  <dbl>
 1 Asl     GSM254…    1170 Mus mu…     8 Fema… Influe… C57BL…     8 Cereb…    14
 2 Apod    GSM254…   36194 Mus mu…     8 Fema… Influe… C57BL…     8 Cereb…    14
 3 Cyp2d22 GSM254…    4060 Mus mu…     8 Fema… Influe… C57BL…     8 Cereb…    14
 4 Klk6    GSM254…     287 Mus mu…     8 Fema… Influe… C57BL…     8 Cereb…    14
 5 Fcrls   GSM254…      85 Mus mu…     8 Fema… Influe… C57BL…     8 Cereb…    14
 6 Slc2a4  GSM254…     782 Mus mu…     8 Fema… Influe… C57BL…     8 Cereb…    14
 7 Exd2    GSM254…    1619 Mus mu…     8 Fema… Influe… C57BL…     8 Cereb…    14
 8 Gjc2    GSM254…     288 Mus mu…     8 Fema… Influe… C57BL…     8 Cereb…    14
 9 Plp1    GSM254…   43217 Mus mu…     8 Fema… Influe… C57BL…     8 Cereb…    14
10 Gnb4    GSM254…    1071 Mus mu…     8 Fema… Influe… C57BL…     8 Cereb…    14
# … with 32,418 more rows, 9 more variables: ENTREZID <dbl>, product <chr>,
#   ensembl_gene_id <chr>, external_synonym <chr>, chromosome_name <chr>,
#   gene_biotype <chr>, phenotype_description <chr>,
#   hsapiens_homolog_associated_gene_name <chr>, l2exp <dbl>, and abbreviated
#   variable names ¹expression, ²organism, ³infection

Take a look at the distribution of the expression levels:

The data should be transformed. Transforming on the \(\log_{10}\) scale is easy in ggplot2,

but we often use a \(log_2\) scale when looking at expression data:

Building plots iteratively

We will now draw a scatter plot with two continuous variables and the geom_point() function. This graph will represent the \(\log_2\) fold changes of expression comparing time 8 versus time 0, and time 4 versus time 0. To this end, we first need to compute the means of the log-transformed expression values by gene and time, then the log fold changes by subtracting the mean log expressions between time 8 and time 0 and between time 4 and time 0.

We could also add color, but it begins to get a little messy with the large number of categories we have.

Over-plotting can be an issue with large datasets like this. One solution is using hexbin:geom_hex().

Box plots

We can use boxplots to visualize the distribution of gene expressions within each sample:

Adding points to box plots can give us a better understanding of the underlying distributions.

We could also use geom_violin to better see the distribution of points. Try coloring the points by time as well.

Line plots

Let’s calculate the mean expression per duration of the infection for the 10 genes having the highest log fold changes comparing time 8 versus time 0. We can do this either by coloring each gene or using facet_wrap.

Now we would like to split the line in each plot by the sex of the mice.

Let’s do something similar and create a plot that depicts how the average expression of each chromosome changes through the duration of infection.

The facet_wrap geometry extracts plots into an arbitrary number of dimensions to allow them to cleanly fit on one page. On the other hand, the facet_grid geometry allows you to explicitly specify how you want your plots to be arranged via formula notation (rows ~ columns; a . can be used as a placeholder that indicates only one row or column).

Let’s modify the previous plot to compare how the mean gene expression of males and females has changed through time: