Acknowledgements

Material created by Peter Mac Data Science.

Objectives

  • Demonstrate how to create stripcharts with ggplot2’s geom_jitter()
  • Demonstrate how faceting can be used to easily create multiple plots with facet_wrap()
  • Demonstrate how tidyr’s gather() can be used to convert from wide to long (tidy) format
  • Demonstrate dplyr’s select() to choose columns and filter() to choose rows
  • Demonstrate use of dplyr’s case_when() instead of multiple ifelse()s
  • Demonstrate the powerful pipe %>%

Introduction

In this tutorial, we will learn some R through creating stripcharts to visualise results from an RNA-seq experiment.

Stripcharts can be used to visualise the expression of genes by groups. They are used in RNA-seq analyses, for example, to check the expression of the top differentially expressed genes (or favourite genes). They enable us to see if the replicate samples within groups have similar expression values for genes and to compare expression values between groups. In RNA-seq we plot the normalised count values.

RNA-seq dataset

We will again use the published RNA-seq data from the Nature Cell Biology paper by Fu et al. 2015. This study examined expression in basal and luminal cells from mice at different stages (virgin, pregnant and lactating).

Here we will work with the normalised counts for all 12 samples.

Packages

We have already seen some tidyverse packages: readr for reading in files, ggplot2 for plotting and dplyr for manipulating tables. Today we will use those packages again, aswell as another really useful tidyverse package called tidyr that can be used to tidy data, such as convert from wide format into long (tidy) format.

Loading the data

First let’s open a new R script. From the top menu in RStudio: File > New File > R Script. Let’s save it as stripcharts.R.

library(tidyverse)

Next we load in our data, the RNA-seq normalised counts. The file we will use is tab-separated, so again we will use the read_tsv() function from the tidyverse readr package to read it in. We will store the contents of our file in an object called norm_counts.

norm_counts <- read_tsv("data/limma-voom_normalised_counts.tsv.gz")

Let’s take a look at the norm_counts data.

norm_counts

How many rows and columns are in norm_counts?

Stripcharts of multiple genes

We can make stripcharts to view the expression of multiple genes. Let’s plot the expression of the top 10 most significantly differentially expressed genes in luminal cells from the pregnant mice versus the luminal cells from the lactating mice. These are the genes with the smallest adjusted P values (adj.P.value). In the volcano plot tutorial we showed how to get the symbols for these top 10 genes. Here we will show how to create an object with the symbols manually. Remember, in R we use c() to combine multiple values. This top10_syms is an R data structure called a vector.

top10_syms <- c("Csn1s2b", "Slc25a1", "Slc34a2", "Atp2b2", "Acacb", "Slc30a2", "Elovl5", "Egf", "Ceacam10", "Pmvk")

Then we extract the normalised counts information for these genes from the norm_counts file.

Filtering rows with filter()

We can use dplyr’s filter() to filter rows. Let’s take a look at how filter() works.

To use filter() we specify the data and the column(s) we want to use to filter with our criteria.

If we wanted to filter for rows (genes) in the MCL1.DG sample that have expression above a threshold (e.g. 5) we would write below.

filter(norm_counts, MCL1.DG > 5)

If we wanted to filter for genes that have an expression value above 5 in both the basal virgin samples (MCL1.DG and MCL1.DH) we specify both columns and the criteria. Note that we need to specify the threshold for each column e.g MCL1.DG > 5 & MCL1.DH > 5 and not MCL1.DG & MCL1.DH > 5.

filter(norm_counts, MCL1.DG > 5 & MCL1.DH > 5)

To filter this dataset for the rows that contain the Csn1s2b gene we would write below. Note that we use a == when we want to test if a value matches exactly.

filter(norm_counts, SYMBOL == "Csn1s2b")

If we wanted to filter for 2 genes we could write that as below. Here we use “|” which means or as we want the row if the SYMBOL column contains Csn1s2b or Slc25a1.

filter(norm_counts, SYMBOL == "Csn1s2b" | SYMBOL == "Slc25a1")

Exercise

  • Try to filter for all rows containing casein in the GENENAME column. Hint: Take a look at the help for str_detect() a function from the tidyverse stringr package.

If we want to filter the rows for our top 10 genes. We use SYMBOL %in% top10_syms which means we want all the rows where the SYMBOL column contains one of the gene symbols in top10_syms. Remember, in R we use %in% to test if a value is in a set of values.

top10_counts <- filter(norm_counts, SYMBOL %in% top10_syms)

Take a look.

top10_counts

Selecting columns with select()

We don’t need the ENTREZID and GENENAME columns, we are only going to use the SYMBOL column and the counts so we can use select() to remove these columns.

filter() is used to choose rows, to choose columns we use select(). Let’s take a look at how select() works.

If we wanted to select the gene symbol column we would write below.

select(top10_counts, SYMBOL)

If we wanted the SYMBOL column and the MCL1.DG sample column we would write below.

select(top10_counts, SYMBOL, MCL1.DG)

We can select column ranges with :.

select(top10_counts, ENTREZID:GENENAME)

There are also useful helper functions you can use inside select(), such as contains(), starts_with() and ends_with().

Exercise

  • Try to select all (and only) the counts columns. Hint: There is more than one way to do it.

We can also use select() to remove columns by specifying a “-”" before the column name(s).

Let’s remove the ENTREZID and GENENAME columns so we only keep the SYMBOL column and the sample expression values.

top10_counts <- select(top10_counts, -ENTREZID, -GENENAME)
top10_counts

The base R way of selecting columns

As mentioned in the tidyverse tutorial here, in base R we can select columns using $ or [], for example, to select the SYMBOL column we could use top10_counts$SYMBOL or top10_counts[, "SYMBOL"]. The $ operator works well for single columns, but for multiple columns it quickly starts to get cumbersome as we need to use the [] operator and c() for combining the required columns. The column names also need quotation marks. For example, to access both the SYMBOL and GENENAME columns we would use top10_counts[, c("SYMBOL", "GENENAME")] whereas with tidyverse’s dplyr it is a lot more intuitive select(top10_counts, SYMBOL, GENENAME).

Converting wide into tidy format with gather()

To more easily plot with ggplot2 we need to change the data into “tidy data”. There are 3 rules of tidy data:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

In our expression table there should be just one column containing all expression values instead of multiple columns with counts for each sample. We can use tidyr’s gather() to easily change the format into long format.

gather() will reformat specified columns into two new columns, “key” and “value”. The “key” column will contain the specified column names, and the “value” column will contain the specified column values. For our data, our sample ids are the column names we will use and the expression values are the values in these columns. We tell gather what we want the new key and value columns to be called. We will give the key column the name “Sample” and the value column the name “Norm_counts” (as they are our normalised count values). gather() uses the same methods as select() to choose the columns so to say we want to reformat all the sample columns we could write below.

# change to tidy data format (all expression values in one column)
top10_counts <- gather(top10_counts, key=Sample, value=Norm_counts, starts_with("MCL"))

Take a look.

top10_counts

Take a closer look with View()

View(top10_counts)

We can also specify the columns we don’t want gather to reformat if that’s easier and let gather format the rest. For example, the code below would reformat all columns except SYMBOL i.e. all the MCL columns. It would produce the same result as when we specified the columns we wanted with starts_with("MCL").

gather(top10_counts, key=Sample, value=Norm_counts, -SYMBOL)

Exercise

  • Try running gather() on the norm_counts object and save it as an object called testing (i.e. run testing <- gather(norm_counts). What do you think of the output? Can you improve it so that there is a column with sample ids and a column with counts.
  • spread() is the opposite of gather(). Try running spread() on the top10_counts object and see if you can regenerate the table with samples in columns.

Adding column with mutate() and case_when()

We want to plot and compare the expression in the groups, so we use mutate()to add a column called “Group” to say what group each sample belongs to. Here we use dplyr’s case_when() to say if the same id matches are conditions then add the appropriate group name into the Group column. We could use ifelse() as we did in the volcano plot tutorial. However, when there are many conditions to test, as there are here, case_when() is easier to use.

top10_counts <- mutate(top10_counts, Group=case_when(
        Sample %in% c("MCL1.DG", "MCL1.DH")  ~ "basal virgin",
        Sample %in% c("MCL1.DI", "MCL1.DJ")  ~ "basal pregnant",
        Sample %in% c("MCL1.DK", "MCL1.DL")  ~ "basal lactate",
        Sample %in% c("MCL1.LA", "MCL1.LB")  ~ "luminal virgin",
        Sample %in% c("MCL1.LC", "MCL1.LD")  ~ "luminal pregnant",
        Sample %in% c("MCL1.LE", "MCL1.LF")  ~ "luminal lactate"
       ))

Take a look at the data again with View()

View(top10_counts)

Exercise

  • Try to use mutate() and case_when() to add a column called CellType. This column should contain the value basal if the Group column contains the word basal, or luminal if the Group contains luminal. Save it as an object called testing. Hint: Use str_detect() inside case_when().

Creating stripcharts with geom_jitter()

Now we can make a stripchart. We plot the Group on the X axis and the Norm_counts on the y axis. We will use + geom_jitter() to create a jitter plot. A jitter plot is similar to a scatter plot. Why do we not just use a scatter plot? Let’s take a look.

ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts)) +
  geom_point()

Some of the points are overlapping so we use geom_jitter() to add a small amount of random variation to the location of each point so they don’t overlap.

ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts)) +
  geom_jitter()

Creating multiple plots with facet_wrap()

The points are no longer overlapping, however, this is all the genes in one plot. ggplot2 has a really useful feature called faceting that we can use. facet_wrap() will create plots for every value in a column in our data. We would like a stripchart of expression values for each gene so we add + facet_wrap(~SYMBOL) to say we want to a plot for each value in the SYMBOL column. There is also facet_grid() which is most useful when you have two discrete variables, and all combinations of the variables exist in the data.

ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts)) +
  geom_jitter() +
  facet_wrap(~SYMBOL)

We can change the number of rows (nrows) or columns (ncols) to balance the plot. Let’s try 2 rows (nrow=2).

ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts)) +
  geom_jitter() +
  facet_wrap(~SYMBOL, nrow=2)

We can add colour=Group to say we want to colour by the groups (the Group column we added).

ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts, colour=Group)) +
  geom_jitter() +
  facet_wrap(~SYMBOL, nrow=2)

We can add + theme(axis.text.x = element_text(angle = 90) to make the x axis labels vertical so they don’t overlap.

ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts, colour=Group)) +
  geom_jitter() +
  facet_wrap(~SYMBOL, nrow=2) +
  theme(axis.text.x = element_text(angle = 90)) 

Ordering categories along an axis

The groups have been plotted in alphabetical order on the x axis, however, we may want to change the order. We may prefer to plot the groups in order of stage, for example, basal virgin, basal pregnant, basal lactate, luminal virgin, luminal pregnant, luminal lactate. In the volcano plot tutorial we showed how to change the order of items in the legend with breaks= into the scale layer. Let’s try that here.

First let’s make an object with the group order that we want.

group_order <- c("basal virgin", "basal pregnant", "basal lactate", "luminal virgin", "luminal pregnant", "luminal lactate")

Then let’s add this group_order into breaks=.

ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts, colour=Group)) +
  geom_jitter() +
  facet_wrap(~SYMBOL, nrow=2) +
  theme(axis.text.x = element_text(angle = 90)) +
  scale_colour_discrete(breaks=group_order)

That reordered the legend but notice that it didn’t reorder the groups on the x axis. If we want to reorder groups along an axis with ggplot we need to make the column with the groups into an R data type called a factor. Factors in R are a special data type used to specify categories. The names of the categories are called the factor levels. Factors are the only R data type with levels, other data types, such as character and numeric, do not. For information on R data types see here

We’ll add another column called “Group_f” where we’ll make the Group column into a factor and specify what order we want the levels of the factor.

top10_counts <- mutate(top10_counts, Group_f=factor(Group, levels=group_order))

Take a look with View().

View(top10_counts)

The Group and the Group_f column look the same but take a look by typing top10_counts.

top10_counts

Notice that the Group column has <chr> under the heading, that indicates is a character data type, while the Group_f column has <fct> under the heading, indicating it is a factor data type. The str() command that we saw in first plots tutorial is useful to check the data types in objects.

str(top10_counts)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   120 obs. of  5 variables:
 $ SYMBOL     : chr  "Slc25a1" "Pmvk" "Egf" "Slc30a2" ...
 $ Sample     : chr  "MCL1.DG" "MCL1.DG" "MCL1.DG" "MCL1.DG" ...
 $ Norm_counts: num  4.881 3.618 -0.89 0.951 -0.229 ...
 $ Group      : chr  "basal virgin" "basal virgin" "basal virgin" "basal virgin" ...
 $ Group_f    : Factor w/ 6 levels "basal virgin",..: 1 1 1 1 1 1 1 1 1 1 ...

str() shows us Group_f column is a Factor with 6 levels (categories).

How can we see what these levels are and what order they’re in? Are they in the order that we want?

To see the levels and their order we can use levels(). But we can’t use levels() on our top10_counts object as if we do we get “NULL”.

levels(top10_counts)
NULL

We need to give levels() just the values in the Group_f column.

Maybe we might think we could use select() to select the Group_f column and then check the levels. We’ll save the output in an object called testing so we can test what happens.

# Extract the Group_f column
testing <- select(top10_counts, Group_f)
testing
levels(testing)
NULL

We still get “NULL” with levels(). Instead of extracting the column, we need to extract the values out of column format and give just the values to levels(). We can do that with dplyr’s pull().

# Pull the values out of the Group_f column
testing <- pull(top10_counts, Group_f)
testing
  [1] basal virgin     basal virgin     basal virgin     basal virgin     basal virgin     basal virgin    
  [7] basal virgin     basal virgin     basal virgin     basal virgin     basal virgin     basal virgin    
 [13] basal virgin     basal virgin     basal virgin     basal virgin     basal virgin     basal virgin    
 [19] basal virgin     basal virgin     basal pregnant   basal pregnant   basal pregnant   basal pregnant  
 [25] basal pregnant   basal pregnant   basal pregnant   basal pregnant   basal pregnant   basal pregnant  
 [31] basal pregnant   basal pregnant   basal pregnant   basal pregnant   basal pregnant   basal pregnant  
 [37] basal pregnant   basal pregnant   basal pregnant   basal pregnant   basal lactate    basal lactate   
 [43] basal lactate    basal lactate    basal lactate    basal lactate    basal lactate    basal lactate   
 [49] basal lactate    basal lactate    basal lactate    basal lactate    basal lactate    basal lactate   
 [55] basal lactate    basal lactate    basal lactate    basal lactate    basal lactate    basal lactate   
 [61] luminal virgin   luminal virgin   luminal virgin   luminal virgin   luminal virgin   luminal virgin  
 [67] luminal virgin   luminal virgin   luminal virgin   luminal virgin   luminal virgin   luminal virgin  
 [73] luminal virgin   luminal virgin   luminal virgin   luminal virgin   luminal virgin   luminal virgin  
 [79] luminal virgin   luminal virgin   luminal pregnant luminal pregnant luminal pregnant luminal pregnant
 [85] luminal pregnant luminal pregnant luminal pregnant luminal pregnant luminal pregnant luminal pregnant
 [91] luminal pregnant luminal pregnant luminal pregnant luminal pregnant luminal pregnant luminal pregnant
 [97] luminal pregnant luminal pregnant luminal pregnant luminal pregnant luminal lactate  luminal lactate 
[103] luminal lactate  luminal lactate  luminal lactate  luminal lactate  luminal lactate  luminal lactate 
[109] luminal lactate  luminal lactate  luminal lactate  luminal lactate  luminal lactate  luminal lactate 
[115] luminal lactate  luminal lactate  luminal lactate  luminal lactate  luminal lactate  luminal lactate 
Levels: basal virgin basal pregnant basal lactate luminal virgin luminal pregnant luminal lactate
# Check the factor levels
levels(testing)
[1] "basal virgin"     "basal pregnant"   "basal lactate"    "luminal virgin"   "luminal pregnant"
[6] "luminal lactate" 

Now we can see what the factor levels are and their order.

However, in this case, to check the factor levels, it might be simpler to use the base R method, as we can use $ to accessing the values in a column. This takes the format object$columnname e.g. top10_counts$Group_f

top10_counts$Group_f
  [1] basal virgin     basal virgin     basal virgin     basal virgin     basal virgin     basal virgin    
  [7] basal virgin     basal virgin     basal virgin     basal virgin     basal virgin     basal virgin    
 [13] basal virgin     basal virgin     basal virgin     basal virgin     basal virgin     basal virgin    
 [19] basal virgin     basal virgin     basal pregnant   basal pregnant   basal pregnant   basal pregnant  
 [25] basal pregnant   basal pregnant   basal pregnant   basal pregnant   basal pregnant   basal pregnant  
 [31] basal pregnant   basal pregnant   basal pregnant   basal pregnant   basal pregnant   basal pregnant  
 [37] basal pregnant   basal pregnant   basal pregnant   basal pregnant   basal lactate    basal lactate   
 [43] basal lactate    basal lactate    basal lactate    basal lactate    basal lactate    basal lactate   
 [49] basal lactate    basal lactate    basal lactate    basal lactate    basal lactate    basal lactate   
 [55] basal lactate    basal lactate    basal lactate    basal lactate    basal lactate    basal lactate   
 [61] luminal virgin   luminal virgin   luminal virgin   luminal virgin   luminal virgin   luminal virgin  
 [67] luminal virgin   luminal virgin   luminal virgin   luminal virgin   luminal virgin   luminal virgin  
 [73] luminal virgin   luminal virgin   luminal virgin   luminal virgin   luminal virgin   luminal virgin  
 [79] luminal virgin   luminal virgin   luminal pregnant luminal pregnant luminal pregnant luminal pregnant
 [85] luminal pregnant luminal pregnant luminal pregnant luminal pregnant luminal pregnant luminal pregnant
 [91] luminal pregnant luminal pregnant luminal pregnant luminal pregnant luminal pregnant luminal pregnant
 [97] luminal pregnant luminal pregnant luminal pregnant luminal pregnant luminal lactate  luminal lactate 
[103] luminal lactate  luminal lactate  luminal lactate  luminal lactate  luminal lactate  luminal lactate 
[109] luminal lactate  luminal lactate  luminal lactate  luminal lactate  luminal lactate  luminal lactate 
[115] luminal lactate  luminal lactate  luminal lactate  luminal lactate  luminal lactate  luminal lactate 
Levels: basal virgin basal pregnant basal lactate luminal virgin luminal pregnant luminal lactate

So levels(top10_counts$Group_f) will access the values in the Group_f column directly, giving us the same output as we would get with levels(pull(top10_counts, Group_f)), but it is easier to read.

levels(top10_counts$Group_f)
[1] "basal virgin"     "basal pregnant"   "basal lactate"    "luminal virgin"   "luminal pregnant"
[6] "luminal lactate" 

The levels are in the order that we want, so we can now change our plot to use the “Group_f” column instead of Group column (change x= and colour=).

ggplot(data=top10_counts, mapping=aes(x=Group_f, y=Norm_counts, colour=Group_f)) +
  geom_jitter() +
  facet_wrap(~SYMBOL, nrow=2) +
  theme(axis.text.x = element_text(angle = 90)) 

Note that both the legend and the x axis now have the groups in the order that we want.

These are the top genes in the comparison of luminal cells from pregnant vs lactating mice and this type of plot enables us to see what the expression values look like in all the groups. We can see that some genes, such as Pmvk, have more similar expression across all the groups than others, such as Csn1s2b.

Notice that the genes have also been plotted in alphabetical order in the facets. If we wanted to plot these genes in the order of most signficant, then we need to make symbol column into a factor as we did for the groups.

Exercise

  • Plot the genes in order of the significance. Hint: Use mutate() to add a column called SYMBOL_f containing SYMBOL as a factor with the levels in the order in top10_syms. Then remake the plot using the new SYMBOL_f column in facet_wrap() instead of SYMBOL.

Creating workflows using %>%

One of the most useful and powerful things about the tidyverse is dplyr’s pipe operator %>%. This enables you to chain commmands together into workflows. For example, we could chain together the commands we ran earlier on our top10_counts object. Note that like the + we use to add layers to a ggplot, %>% goes at the end of the line and this ‘pipes’ the output into the next command. If we use the %>% we also don’t need to specify the data inside the individual commands, we only need to specify it at the beginning e.g. instead of filter(norm_counts, SYMBOL %in% top10_syms) we use norm_counts %>% filter(SYMBOL %in% top10_syms). We can also pipe outputs directly into ggplot(). The pipe is very useful but some advice on when not to use the pipe is provided by the tidyverse creator Hadley Wickham in the excellent R for Data Science book.

norm_counts %>%
  filter(SYMBOL %in% top10_syms) %>%                             # filter rows for the top 10 genes
  select(-ENTREZID, -GENENAME) %>%                               # remove ENTREZID and GENENAME columns
  gather(key=Sample, value=Norm_counts, starts_with("MCL"))  %>%                    # convert from wide to long (tidy) format
  mutate(Group=case_when(                                        # add a column to specify groups
        Sample %in% c("MCL1.DG", "MCL1.DH")  ~ "basal virgin",
        Sample %in% c("MCL1.DI", "MCL1.DJ")  ~ "basal pregnant",
        Sample %in% c("MCL1.DK", "MCL1.DL")  ~ "basal lactate",
        Sample %in% c("MCL1.LA", "MCL1.LB")  ~ "luminal virgin",
        Sample %in% c("MCL1.LC", "MCL1.LD")  ~ "luminal pregnant",
        Sample %in% c("MCL1.LE", "MCL1.LF")  ~ "luminal lactate"
       )) %>%
  mutate(Group_f=factor(Group, levels=group_order)) %>%           # convert Group column to factor data type to specify ordering
  ggplot(mapping=aes(x=Group_f, y=Norm_counts, colour=Group_f)) + # make stripcharts faceted by gene
  geom_jitter() +
  facet_wrap(~SYMBOL, nrow=2) +
  theme(axis.text.x = element_text(angle = 90)) 

We only have two points per group here but if there were more points you could overlay an error bar with geom_errorbar() or combine a geom_boxplot() with the geom_jitter(), as shown for the stripcharts in the tutorial here.

Exercise

  • Starting from norm_counts make a workflow using %>% that creates stripcharts for the genes Trp53, Brca1, Brca2.

Key Points

  • We can choose rows with dplyr’s filter()
  • We can choose columns with dplyr’s select()
  • We can test for multiple conditions with dplyr’s case_when()
  • We can convert from wide to long (tidy) format with tidyr’s gather()
  • We can use geom_jitter() to make stripcharts
  • We can make multiple plots based on values in a column with facet_wrap()
  • We can create workflows with dplyr’s pipe %>%
---
title: "Introduction to R"
author: "Maria Doyle"
date: "`r format(Sys.time(), '%d %B %Y')`"
output: 
  html_notebook:
    toc: yes
    toc_float: yes
    toc_depth: 4
subtitle: visualising RNA-seq data with stripcharts
---

#### Acknowledgements
Material created by Peter Mac Data Science.

## Objectives
  
* Demonstrate how to create stripcharts with ggplot2's `geom_jitter()`
* Demonstrate how faceting can be used to easily create multiple plots with `facet_wrap()` 
* Demonstrate how tidyr's `gather()` can be used to convert from wide to long (tidy) format
* Demonstrate dplyr's `select()` to choose columns and `filter()` to choose rows
* Demonstrate use of dplyr's `case_when()` instead of multiple `ifelse()`s
* Demonstrate the powerful pipe `%>%`


## Introduction

In this tutorial, we will learn some R through creating stripcharts to visualise results from an RNA-seq experiment.

Stripcharts can be used to visualise the expression of genes by groups. They are used in RNA-seq analyses, for example, to check the expression of the top differentially expressed genes (or favourite genes). They enable us to see if the replicate samples within groups have similar expression values for genes and to compare expression values between groups. In RNA-seq we plot the normalised count values.

### RNA-seq dataset

We will again use the published RNA-seq data from the Nature Cell Biology paper by [Fu et al. 2015](https://www.ncbi.nlm.nih.gov/pubmed/25730472). This study examined expression in basal and luminal cells from mice at different stages (virgin, pregnant and lactating).

Here we will work with the normalised counts for all 12 samples.
![](images/mouse_exp.png)

### Packages

We have already seen some **tidyverse** packages: **readr** for reading in files, **ggplot2** for plotting and **dplyr** for manipulating tables. Today we will use those packages again, aswell as another really useful tidyverse package called **tidyr** that can be used to tidy data, such as convert from wide format into long (tidy) format.


## Loading the data

First let's open a new R script. From the top menu in RStudio: `File > New File > R Script`.
Let's save it as `stripcharts.R`.

```{r, message=FALSE}
library(tidyverse)
```

Next we load in our data, the RNA-seq normalised counts. The file we will use is tab-separated, so again we will use the `read_tsv()` function from the tidyverse readr package to read it in. We will store the contents of our file in an object called `norm_counts`.

```{r results = 'hide'}
norm_counts <- read_tsv("data/limma-voom_normalised_counts.tsv.gz")
```

Let's take a look at the `norm_counts` data.

```{r}
norm_counts
```

How many rows and columns are in `norm_counts`?

## Stripcharts of multiple genes

We can make stripcharts to view the expression of multiple genes. Let's plot the expression of the top 10 most significantly differentially expressed genes in luminal cells from the pregnant mice versus the luminal cells from the lactating mice. These are the genes with the smallest adjusted P values (adj.P.value). In the volcano plot tutorial we showed how to get the symbols for these top 10 genes. Here we will show how to create an object with the symbols manually. Remember, in R we use `c()` to combine multiple values. This `top10_syms` is an R data structure called a **vector**.

```{r}
top10_syms <- c("Csn1s2b", "Slc25a1", "Slc34a2", "Atp2b2", "Acacb", "Slc30a2", "Elovl5", "Egf", "Ceacam10", "Pmvk")
```

Then we extract the normalised counts information for these genes from the `norm_counts` file.

## Filtering rows with `filter()`

We can use dplyr's `filter()` to filter rows. Let's take a look at how `filter()` works. 

To use `filter()` we specify the data and the column(s) we want to use to filter with our criteria. 

If we wanted to filter for rows (genes) in the MCL1.DG sample that have expression above a threshold (e.g. 5) we would write below.
```{r}
filter(norm_counts, MCL1.DG > 5)
```

If we wanted to filter for genes that have an expression value above 5 in *both* the basal virgin samples (MCL1.DG and MCL1.DH) we specify both columns and the criteria. Note that we need to specify the threshold for *each* column e.g `MCL1.DG > 5 & MCL1.DH > 5` and not `MCL1.DG & MCL1.DH > 5`.
```{r}
filter(norm_counts, MCL1.DG > 5 & MCL1.DH > 5)
```


To filter this dataset for the rows that contain the Csn1s2b gene we would write below. Note that we use a `==` when we want to test if a value matches exactly.
```{r}
filter(norm_counts, SYMBOL == "Csn1s2b")
```
If we wanted to filter for 2 genes we could write that as below. Here we use "|" which means or as we want the row if the SYMBOL column contains Csn1s2b or Slc25a1.

```{r}
filter(norm_counts, SYMBOL == "Csn1s2b" | SYMBOL == "Slc25a1")
```

#### Exercise

* Try to filter for all rows containing casein in the GENENAME column. Hint: Take a look at the help for `str_detect()` a function from the tidyverse **stringr** package.

If we want to filter the rows for our top 10 genes. We use `SYMBOL %in% top10_syms` which means we want all the rows where the SYMBOL column contains one of the gene symbols in `top10_syms`. Remember, in R we use `%in%` to test if a value is in a set of values.

```{r}
top10_counts <- filter(norm_counts, SYMBOL %in% top10_syms)
```

Take a look.

```{r}
top10_counts
```


## Selecting columns with `select()`

We don't need the ENTREZID and GENENAME columns, we are only going to use the SYMBOL column and the counts so we can use `select()` to remove these columns.

`filter()` is used to choose rows, to choose columns we use `select()`. Let's take a look at how `select()` works.

If we wanted to select the gene symbol column we would write below.

```{r}
select(top10_counts, SYMBOL)
```

If we wanted the SYMBOL column and the MCL1.DG sample column we would write below.

```{r}
select(top10_counts, SYMBOL, MCL1.DG)
```

We can select column ranges with `:`.

```{r}
select(top10_counts, ENTREZID:GENENAME)
```

There are also useful helper functions you can use inside `select()`, such as `contains()`, `starts_with()` and `ends_with()`.

#### Exercise

* Try to select all (and only) the counts columns. Hint: There is more than one way to do it.

We can also use `select()` to remove columns by specifying a "-"" before the column name(s).

Let's remove the ENTREZID and GENENAME columns so we only keep the SYMBOL column and the sample expression values.

```{r}
top10_counts <- select(top10_counts, -ENTREZID, -GENENAME)
top10_counts
```

> The base R way of selecting columns 
> 
> As mentioned in the tidyverse tutorial 
> [here](https://rawgit.com/bioinformatics-core-shared-training/r-intermediate/master/2.dplyr-intro-live-coding-script.html), 
> in base R we can select columns using `$` or `[]`, for example, to select the SYMBOL column we could use 
> `top10_counts$SYMBOL` or `top10_counts[, "SYMBOL"]`. The `$` operator works well for single columns, but for multiple columns
> it quickly starts to get cumbersome as we need to use the `[]` operator and `c()` for combining the required columns. The column 
> names also need quotation marks. For example, to access both the SYMBOL and GENENAME 
> columns we would use `top10_counts[, c("SYMBOL", "GENENAME")]` whereas with tidyverse's dplyr it is a lot more intuitive 
> `select(top10_counts, SYMBOL, GENENAME)`.


## Converting wide into tidy format with `gather()`

To more easily plot with ggplot2 we need to change the data into ["tidy data"](https://en.wikipedia.org/wiki/Tidy_data). There are 3 rules of tidy data:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell. 
  
In our expression table there should be just *one column containing all expression values* instead of multiple columns with counts for each sample. We can use tidyr's `gather()` to easily change the format into long format.

`gather()` will reformat specified columns into two new columns, "key" and "value". The "key" column will contain the *specified column names*, and the "value" column will contain the *specified column values*. For our data, our sample ids are the column names we will use and the expression values are the values in these columns. We tell gather what we want the new key and value columns to be called. We will give the key column the name "Sample" and the value column the name "Norm_counts" (as they are our normalised count values). `gather()` uses the same methods as `select()` to choose the columns so to say we want to reformat all the sample columns we could write below.

```{r}
# change to tidy data format (all expression values in one column)
top10_counts <- gather(top10_counts, key=Sample, value=Norm_counts, starts_with("MCL"))
```

Take a look.
```{r}
top10_counts
```

Take a closer look with `View()`
```{r eval=FALSE}
View(top10_counts)
```

We can also specify the columns we *don't* want gather to reformat if that's easier and let gather format the rest. For example, the code below would reformat all columns except SYMBOL i.e. all the MCL columns. It would produce the same result as when we specified the columns we wanted with `starts_with("MCL")`.

```{r eval=FALSE}
gather(top10_counts, key=Sample, value=Norm_counts, -SYMBOL)
```

#### Exercise

 * Try running `gather()` on the `norm_counts` object and save it as an object called `testing` (i.e. run `testing <- gather(norm_counts)`. What do you think of the output? Can you improve it so that there is a column with sample ids and a column with counts.
 * `spread()` is the opposite of `gather()`. Try running `spread()` on the `top10_counts` object and see if you can regenerate the table with samples in columns.

## Adding column with `mutate()` and `case_when()`

We want to plot and compare the expression in the groups, so we use `mutate()`to add a column called "Group" to say what group each sample belongs to.  Here we use dplyr's `case_when()` to say if the same id matches are conditions then add the appropriate group name into the Group column. We could use `ifelse()` as we did in the volcano plot tutorial. However, when there are many conditions to test, as there are here, `case_when()` is easier to use.

```{r}
top10_counts <- mutate(top10_counts, Group=case_when(
        Sample %in% c("MCL1.DG", "MCL1.DH")  ~ "basal virgin",
        Sample %in% c("MCL1.DI", "MCL1.DJ")  ~ "basal pregnant",
        Sample %in% c("MCL1.DK", "MCL1.DL")  ~ "basal lactate",
        Sample %in% c("MCL1.LA", "MCL1.LB")  ~ "luminal virgin",
        Sample %in% c("MCL1.LC", "MCL1.LD")  ~ "luminal pregnant",
        Sample %in% c("MCL1.LE", "MCL1.LF")  ~ "luminal lactate"
       ))
```

Take a look at the data again with `View()`
```{r eval=FALSE}
View(top10_counts)
```


#### Exercise

* Try to use `mutate()` and `case_when()` to add a column called CellType. This column should contain the value basal if the Group column contains the word basal, or luminal if the Group contains luminal. Save it as an object called `testing`. Hint: Use `str_detect()` inside `case_when()`.


## Creating stripcharts with `geom_jitter()`

Now we can make a stripchart. We plot the Group on the X axis and the Norm_counts on the y axis. We will use `+ geom_jitter()` to create a jitter plot. A jitter plot is similar to a scatter plot. Why do we not just use a scatter plot? Let's take a look. 

```{r}
ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts)) +
  geom_point()
```

Some of the points are overlapping so we use `geom_jitter()` to add a small amount of random variation to the location of each point so they don't overlap.

```{r}
ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts)) +
  geom_jitter()
```

## Creating multiple plots with `facet_wrap()`

The points are no longer overlapping, however, this is all the genes in one plot. ggplot2 has a really useful feature called faceting that we can use. `facet_wrap()` will create plots for every value in a column in our data. We would like a stripchart of expression values for each gene so we add ` + facet_wrap(~SYMBOL)` to say we want to a plot for each value in the SYMBOL column. There is also `facet_grid()` which is most useful when you have two discrete variables, and all combinations of the variables exist in the data.

```{r}
ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts)) +
  geom_jitter() +
  facet_wrap(~SYMBOL)
```

We can change the number of rows (`nrows`) or columns (`ncols`) to balance the plot. Let's try 2 rows (`nrow=2`).

```{r}
ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts)) +
  geom_jitter() +
  facet_wrap(~SYMBOL, nrow=2)
```


We can add `colour=Group` to say we want to colour by the groups (the Group column we added).

```{r}
ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts, colour=Group)) +
  geom_jitter() +
  facet_wrap(~SYMBOL, nrow=2)
```

We can add `+ theme(axis.text.x = element_text(angle = 90)` to make the x axis labels vertical so they don't overlap.

```{r}
ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts, colour=Group)) +
  geom_jitter() +
  facet_wrap(~SYMBOL, nrow=2) +
  theme(axis.text.x = element_text(angle = 90)) 
```

## Ordering categories along an axis

The groups have been plotted in alphabetical order on the x axis, however, we may want to change the order. We may prefer to plot the groups in order of stage, for example, basal virgin, basal pregnant, basal lactate, luminal virgin, luminal pregnant, luminal lactate. In the volcano plot tutorial we showed how to change the order of items in the legend with `breaks=` into the scale layer. Let's try that here.

First let's make an object with the group order that we want.
```{r}
group_order <- c("basal virgin", "basal pregnant", "basal lactate", "luminal virgin", "luminal pregnant", "luminal lactate")
```

Then let's add this `group_order` into `breaks=`.

```{r}
ggplot(data=top10_counts, mapping=aes(x=Group, y=Norm_counts, colour=Group)) +
  geom_jitter() +
  facet_wrap(~SYMBOL, nrow=2) +
  theme(axis.text.x = element_text(angle = 90)) +
  scale_colour_discrete(breaks=group_order)
```

That reordered the legend but notice that it didn't reorder the groups on the x axis. If we want to reorder groups along an axis with ggplot we need to make the column with the groups into an R data type called a **factor**. Factors in R are a special data type used to specify categories. The names of the categories are called the factor **levels**. Factors are the only R data type with levels, other data types, such as character and numeric, do not. For information on R data types see [here](https://lucyleeow.github.io/BaseR_Intro/session-1.html#data-types)

We'll add another column called "Group_f" where we'll make the Group column into a factor and specify what order we want the levels of the factor.

```{r}
top10_counts <- mutate(top10_counts, Group_f=factor(Group, levels=group_order))
```

Take a look with `View()`.
```{r eval=FALSE}
View(top10_counts)
```

The Group and the Group_f column look the same but take a look by typing `top10_counts`.

```{r}
top10_counts
```

Notice that the Group column has `<chr>` under the heading, that indicates is a character data type, while the Group_f column has `<fct>` under the heading, indicating it is a factor data type. The `str()` command that we saw in first plots tutorial is useful to check the data types in objects.

```{r}
str(top10_counts)
```

`str()` shows us Group_f column is a Factor with 6 levels (categories). 

How can we see what these levels are and what order they're in? Are they in the order that we want? 

To see the levels and their order we can use `levels()`. But we can't use `levels()` on our `top10_counts` object as if we do we get "NULL".
```{r}
levels(top10_counts)
```

We need to give `levels()` just the values in the Group_f column.

Maybe we might think we could use `select()` to select the Group_f column and then check the levels. We'll save the output in an object called `testing` so we can test what happens.

```{r}
# Extract the Group_f column
testing <- select(top10_counts, Group_f)
testing
```


```{r}
levels(testing)
```

We still get "NULL" with `levels()`. Instead of extracting the column, we need to extract the values out of column format and give just the values to `levels()`. We can do that with dplyr's `pull()`.

```{r}
# Pull the values out of the Group_f column
testing <- pull(top10_counts, Group_f)
testing
```

```{r}
# Check the factor levels
levels(testing)
```

Now we can see what the factor levels are and their order.

However, in this case, to check the factor levels, it might be simpler to use the base R method, as we can use `$` to accessing the values in a column. This takes the format `object$columnname` e.g. `top10_counts$Group_f`

```{r}
top10_counts$Group_f
```

So `levels(top10_counts$Group_f)` will access the values in the Group_f column directly, giving us the same output as we would get with `levels(pull(top10_counts, Group_f))`, but it is easier to read.

```{r}
levels(top10_counts$Group_f)
```

The levels are in the order that we want, so we can now change our plot to use the "Group_f" column instead of Group column (change `x=` and `colour=`).

```{r}
ggplot(data=top10_counts, mapping=aes(x=Group_f, y=Norm_counts, colour=Group_f)) +
  geom_jitter() +
  facet_wrap(~SYMBOL, nrow=2) +
  theme(axis.text.x = element_text(angle = 90)) 
```

Note that both the legend and the x axis now have the groups in the order that we want.

These are the top genes in the comparison of luminal cells from pregnant vs lactating mice and this type of plot enables us to see what the expression values look like in all the groups. We can see that some genes, such as Pmvk, have more similar expression across all the groups than others, such as Csn1s2b.

Notice that the genes have also been plotted in alphabetical order in the facets. If we wanted to plot these genes in the order of most signficant, then we need to make symbol column into a factor as we did for the groups.


#### Exercise

* Plot the genes in order of the significance. Hint: Use `mutate()` to add a column called SYMBOL_f containing SYMBOL as a factor with the levels in the order in `top10_syms`. Then remake the plot using the new SYMBOL_f column in `facet_wrap()` instead of SYMBOL.


## Creating workflows using `%>%`

One of the most useful and powerful things about the tidyverse is dplyr's pipe operator `%>%`. This enables you to chain commmands together into workflows. For example, we could chain together the commands we ran earlier on our `top10_counts` object. Note that like the `+` we use to add layers to a ggplot, `%>%` goes at the end of the line and this 'pipes' the output into the next command. If we use the `%>%` we also don't need to specify the data inside the individual commands, we only need to specify it at the beginning e.g. instead of `filter(norm_counts, SYMBOL %in% top10_syms)` we use `norm_counts %>% filter(SYMBOL %in% top10_syms)`. We can also pipe outputs directly into `ggplot()`. The pipe is very useful but some advice on when not to use the pipe is provided by the tidyverse creator Hadley Wickham in the excellent [R for Data Science book](https://r4ds.had.co.nz/pipes.html#when-not-to-use-the-pipe).

```{r}
norm_counts %>%
  filter(SYMBOL %in% top10_syms) %>%                             # filter rows for the top 10 genes
  select(-ENTREZID, -GENENAME) %>%                               # remove ENTREZID and GENENAME columns
  gather(key=Sample, value=Norm_counts, starts_with("MCL"))  %>%                    # convert from wide to long (tidy) format
  mutate(Group=case_when(                                        # add a column to specify groups
        Sample %in% c("MCL1.DG", "MCL1.DH")  ~ "basal virgin",
        Sample %in% c("MCL1.DI", "MCL1.DJ")  ~ "basal pregnant",
        Sample %in% c("MCL1.DK", "MCL1.DL")  ~ "basal lactate",
        Sample %in% c("MCL1.LA", "MCL1.LB")  ~ "luminal virgin",
        Sample %in% c("MCL1.LC", "MCL1.LD")  ~ "luminal pregnant",
        Sample %in% c("MCL1.LE", "MCL1.LF")  ~ "luminal lactate"
       )) %>%
  mutate(Group_f=factor(Group, levels=group_order)) %>%           # convert Group column to factor data type to specify ordering
  ggplot(mapping=aes(x=Group_f, y=Norm_counts, colour=Group_f)) + # make stripcharts faceted by gene
  geom_jitter() +
  facet_wrap(~SYMBOL, nrow=2) +
  theme(axis.text.x = element_text(angle = 90)) 
```
We only have two points per group here but if there were more points you could overlay an error bar with `geom_errorbar()` or combine a `geom_boxplot()` with the `geom_jitter()`, as shown for the stripcharts in the tutorial [here](https://www.bioinformatics.babraham.ac.uk/training/ggplot_course/Introduction%20to%20ggplot.pdf).

#### Exercise

* Starting from `norm_counts` make a workflow using `%>%` that creates stripcharts for the genes Trp53, Brca1, Brca2.


## Key Points
- We can choose rows with dplyr's `filter()`
- We can choose columns with dplyr's `select()`
- We can test for multiple conditions with dplyr's `case_when()`
- We can convert from wide to long (tidy) format with tidyr's `gather()`
- We can use `geom_jitter()` to make stripcharts
- We can make multiple plots based on values in a column with `facet_wrap()`
- We can create workflows with dplyr's pipe `%>%`
