1 Introduction

We’ll perform single cell RNA-seq analysis on mice dorsal striatum tissue. These data include a subset of cells studied on ‘Molecular Architecture of the Mouse Nervous System’ (Zeisel et al. 2018). The data have been taken from this repository stored in PanglaoDB, a web repository of scRNA-seq data (Franzén, Gan, and Björkegren 2019). These data were obtained with 10x Chromium protocol sequenced by Illumina HiSeq 2500.

The main workflow is also heavily inspired by Roman Hillje github repo scRNA-seq analysis workflow (Hillje 2020).

1.1 Dependencies

First of all we load all necessary dependencies plus some useful plotting functions stored in another file. We’ll mainly use the Seurat package (Stuart et al. 2019), which focuses on single cell QC and explorative analysis. More or less we’ll follow the standard Seurat pipeline (see Seurat tutorial) with some additions.

library(SingleCellExperiment)
library(Seurat)
library(scDblFinder)
library(Matrix)
### 
library(cluster)
library(tidyverse)
library(here)
library(patchwork)
library(ggExtra)
library(ggrepel)
library(gghalves)
library(patchwork)
library(viridis)
library(cowplot)


# my own useful functions - see funs.r
source("R/funs.R")

1.2 Data loading

To load the already prepared count table, we downloaded the Rdata file containg the sparse matrix from PanglaoDB. Then, we load it as a standard R object. In this matrix rows are gene, columns are cells and counts are not normalized. To only keep gene symbol we’ll substitute everything after “ENS”. Furthermore, we also filter this matrix with the same filters shown in the Panglao repo (counted reads > 1000 and genes expression over all cells > 0)

load(here("data", "SRA667466_SRS3059955.sparse.RData"))

rownames(sm) <- gsub("_ENS.*", "", rownames(sm))

sm_red <- sm[, colSums(sm) > 1000]
sm_red <- sm_red[rowSums(sm_red) != 0, ]

There are 3 duplicates genes, but Seurat will take care of that. The full matrix has these dimensions: 26190, 11591 (genes, cells). The panglao reduced one has 25618, 3444 (as in the dataset description).

After that, we use the raw counts to initialize a Seurat object. This type of object has useful slots and methods that will be useful in further analysis. Everytime a Seurat object is initialized a metadata slot is also created. It can be accessed through Seurat[[]] or Seurat@meta.data. This is basically a dataframe containing useful information about the cells in our dataset. We’ll append two columns to this df which are percentage of mitochondrial reads for each cell (percent.mt) and novelty (computed as number of genes / number of reads). This measures range from 0 to 1 and is related to how deep we sequenced. Low values means that every new read will probably will be assigned to an already discovered gene whereas high values would mean that for each new reads a new gene will be discovered. This can be useful in detecting low complexity cells (few expressed genes that are easily saturated). We do the same fore the panglao reduced data. To ease our laptop life we also remove the raw matrix from the environment

sm_seurat <- CreateSeuratObject(counts = sm, project = "sc_proj")
sm_seurat[["percent.mt"]] <- PercentageFeatureSet(sm_seurat, pattern = "^mt-")
sm_seurat[["novelty"]] <- sm_seurat[[]]$nFeature_RNA/sm_seurat[[]]$nCount_RNA

panglao <- CreateSeuratObject(counts = sm_red, project = "panglao")
panglao[["percent.mt"]] <- PercentageFeatureSet(panglao, pattern = "^mt-")
# there are only 17 annotated genes but there are 37 genes in the mt
panglao[["novelty"]] <- panglao[[]]$nFeature_RNA/panglao[[]]$nCount_RNA

rm(sm, sm_red)

2 Data pre-processing

To preprocess single cell RNA-seq data on Seurat we’ll follow the standard pipeline, but we’ll also try to analyze the doublets influence on our dataset. The main steps will be:

  1. Doublet analysis
  2. QC analysis
  3. Counts normalization
  4. Detection of highly variable genes
  5. Gene scaling

2.1 Doublet analysis

In order to remove doublets from our dataset we follow the same procedure as in (Hillje 2020) using the ScDblFinder package (Germain 2020). This package finds doublets in single cell data by creating artificial doublets and looking at their prevalence in the neighborhood of each cell.

This function is inspired by (Hillje 2020).

dbl <- invisible(colData(scDblFinder(as.SingleCellExperiment(sm_seurat), verbose = FALSE)))
sm_seurat$multiplet_class <- dbl$scDblFinder.class
sm_seurat$doublet_score <- dbl$scDblFinder.score

rm(dbl)

doublet_analysis(sm_seurat@meta.data, col = c("#ffc313", "#c4e538"))
Doublet analysis regarding the main QC covariates

Doublet analysis regarding the main QC covariates

We can see that the proportion of inferred doublets in our dataset is 10.84% of the cells. Also the different distribution of qc covariates is evident.

We then remove the doublet cells.

sm_seurat <- subset(sm_seurat, subset = multiplet_class == "singlet")

2.2 QC analysis

QC analysis will be performed on 4 main covariates:

  1. Number of counts per cell
  2. Number of genes per cell
  3. Mitochondrial fraction of genes
  4. Novelty

The filtering is made by playing with thresholds of these covariates. Cells with few genes and high mito percentage may be dead cells with cytoplasmic mRNA leaked out. Whereas cells with exceptionally high number of reads may be doublets. Low novelty may represent low complexity cells as blood red cells. Nevertheless, we must be careful with these threshold as we could exclude informative cells population such as cells involved in respiratory processes (high mito) or quiescient populations (low reads) (Luecken and Theis 2019).

We can examine the distribution of these covariates with few different visualizations.

cont <- list(scale_color_viridis(discrete = FALSE), guides(color = guide_colourbar(barwidth = 0.2, 
    barheight = 10, title.position = "left", title.theme = element_text(angle = 90, 
        hjust = 0.5))))

scatter_full1 <- scatter_plot(sm_seurat@meta.data, "nCount_RNA", "nFeature_RNA", 
    "percent.mt") + cont
scatter_full2 <- scatter_plot(sm_seurat@meta.data, "nCount_RNA", "percent.mt", "nFeature_RNA") + 
    cont

scatter_full1 | scatter_full2
Scatterplot of QC covariates of full dataset, red bars represent 10th percentile

Scatterplot of QC covariates of full dataset, red bars represent 10th percentile

We can observe that, as expected, the number of genes and the number of transcritpts are positevely correlated. However, there seems to be two main trajectories. This could be due to Neurons (our majority class presumably) vs all other cells. Also, we can see a small population of cells with low novelty values (lots of reads and few genes), this may be due to low complexity cells.

scatter_panglao1 <- scatter_plot(panglao@meta.data, "nCount_RNA", "nFeature_RNA", 
    "percent.mt") + cont

scatter_panglao2 <- scatter_plot(panglao@meta.data, "nCount_RNA", "percent.mt", "nFeature_RNA") + 
    cont
scatter_panglao1 | scatter_panglao2
Scatterplot of QC covariates of panglao data, red bars represent 10th percentile

Scatterplot of QC covariates of panglao data, red bars represent 10th percentile

The Panglao data show more or less the same trend, however many cells (nCount<1000) were filtered out.

We can also visualize QC covariates distribution with violin plots

t <- list(theme(axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank()), 
    scale_color_manual(values = "#FDA7DF"), scale_fill_manual(values = "#B53471"))

p1 <- vln_plot(sm_seurat@meta.data, "orig.ident", "nCount_RNA", log = T, jitter = T) + 
    t

p2 <- vln_plot(sm_seurat@meta.data, "orig.ident", "nFeature_RNA", log = T, jitter = T) + 
    t

patch1 <- p1 | p2
patch1 + plot_annotation(tag_levels = "A")
Violinplot of number of reads and genes distribution in whole df (A) and panglao df (B)

Violinplot of number of reads and genes distribution in whole df (A) and panglao df (B)

We can also visualize their distribution with a parallel coordinates plot. This kind of plot is useful to visualize how may covariates change with respect to one another. Given that ther’s no grouping in the data, colors are given based on the quartiles of the first covariates in order to bettere visualize their relationships. Furthermore, this function is not performed on all the cells given that it’s reallly slow. Maybe using base r graphics or other optimized plotting libraries might help. Features are scaled from 0 to 100 in order to have the same range.

p3 <- par_coor(head(sm_seurat@meta.data, 1000))

p4 <- par_coor(head(panglao@meta.data, 1000))

patch2 <- p3 | p4
patch2 + plot_annotation(tag_levels = "A")
Parallell coordinates plot of whole df (A) and Panglao df (B) covariates, only 1000 cells taken because this function is quite slow

Parallell coordinates plot of whole df (A) and Panglao df (B) covariates, only 1000 cells taken because this function is quite slow

We can more or less draw the same conclusions as before.

In order to decide the filters we wrote a high level function that analyse the dataset tryin to capture all its possible interesting features. (Heavily inspired by (Luecken and Theis 2019))

prelim_analysis(sm_seurat)
Preliminary analysis of whole dataset, red bars represent 10th quantile

Preliminary analysis of whole dataset, red bars represent 10th quantile

2.3 Final datasets

After visualizing this variables we decided to continue the analysis on three different datasets.

  1. Loosely filtered
  2. Strictly filtered
  3. Panglao criteria
loose <- subset(sm_seurat, subset = nFeature_RNA < 4000 & nFeature_RNA > 50 & nCount_RNA > 
    70 & nCount_RNA < 10000 & percent.mt < 18 & novelty > 0.3)

strict <- subset(sm_seurat, subset = nFeature_RNA < 3000 & nFeature_RNA > 100 & nCount_RNA > 
    1000 & nCount_RNA < 8000 & percent.mt < 12 & novelty > 0.4)

rm(sm_seurat)

So we obtain three datasets with these abundance of cells:

  1. Loose: 7441 cells
  2. Strict: 2482 cells
  3. Panglao: 3444 cells

2.4 View

Loose

prelim_analysis(loose, depth_threshold = c(70, 10000), genes_threshold = c(50, 4000), 
    mit_threshold = 18, novelty_threshold = 0.3)
Preliminary analysis of whole dataset, red bars represent 10th quantile

Preliminary analysis of whole dataset, red bars represent 10th quantile

Loose filters are:

  • number of genes between 50 and 4000
  • number of reads between 70 and 10000
  • mito percentage < 18%
  • novelty > 0.3

Strict

prelim_analysis(strict, depth_threshold = c(1000, 8000), genes_threshold = c(100, 
    3000), mit_threshold = 12, novelty_threshold = 0.4)
Preliminary analysis of whole dataset, red bars represent 10th quantile

Preliminary analysis of whole dataset, red bars represent 10th quantile

Strict filters are:

  • number of genes between 100 and 3000
  • number of reads between 1000 and 8000
  • mito percentage < 12%
  • novelty > 0.3

Panglao

prelim_analysis(panglao, depth_threshold = 1000)
Preliminary analysis of whole dataset, red bars represent 10th quantile

Preliminary analysis of whole dataset, red bars represent 10th quantile

Here we used Panglao criteria:

  • number of reads > 1000

2.5 Counts normalization

To normalize counts we’ll follow the standard Seurat procedure using NormalizeData() function. By default this function normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default). This normalization method is basically ‘counts per million’ normalization as done in bulk exoression analyisi. After this the counts are log-transformed (log(x+1)). This is useful because it’s the canonical way in which cheanges in epression are measured, reduce the skewness of the data and mitigate the mean variance relationship (Luecken and Theis 2019).

loose <- NormalizeData(loose)
strict <- NormalizeData(strict)
panglao <- NormalizeData(panglao)

Normalized data are in seurat[['RNA']]@data whereas raw counts are in sm_seurat[['RNA']]@counts

2.6 Detection of highly variable genes

We then perform feature selection, this means trying to decide which are the most informative genes. We’ll use the Seurat function FindVariableFeatures() with default options, only increasing the final number of genes from 2000 to 3000. This function models the mean and variance relationship of each gene.

loose <- FindVariableFeatures(loose, selection.method = "vst", nfeatures = 3000)
strict <- FindVariableFeatures(strict, selection.method = "vst", nfeatures = 3000)
panglao <- FindVariableFeatures(panglao, selection.method = "vst", nfeatures = 3000)

2.7 Results

Loose

varfeat(loose) + labs(subtitle = "Loose Variable Features")
Variable features of Loose datasets

Variable features of Loose datasets

Strict

varfeat(strict) + labs(subtitle = "Strict Variable Features")
Variable features of Strict datasets

Variable features of Strict datasets

Panglao

varfeat(panglao) + labs(subtitle = "Panglao Variable Features")
Variable features of Panglao datasets

Variable features of Panglao datasets

2.8 Gene scaling

Finally, we scale the counts so each gene has mean across cells is 0 and variance across cells is 1. This is a debated step as dimensionality reduction methods like PCA often requires it so to not being excessively driven by genes with higher ranges of expression. However, by doing this we assume that the magnitude of difference in expression is not informative (Luecken and Theis 2019).

We’ll do that only on the genes we previously found. Otherwise our laptop will have a hard time and after all our further analysis won’t be affected.

The results are in Seurat[['RNA']]@scale.data.

loose <- ScaleData(loose)
strict <- ScaleData(strict)
panglao <- ScaleData(panglao)

3 Principal Component Analysis

Principal Component Analysis (PCA) is often used to reduce dimensionality of the dataset. Reducing the dimensions of the original matrix is essential as single cell data are inherently low-dimensionals (Heimberg et al. 2016). PCA is a linear algorithm to do this, therefore, it’s not suited for visualizing these kind of data in low dimensions. However, it is often used to summarize data to perform further analysis, as in this case. In fact, the next steps will be computed on N reduced dataset, where N is the number of top principal components kept. This parameter is commonly estimated through elbow technique.

Furthermore, PCA is also used for visualization, given that it can project multidimensional features spaces in 2 or 3 PCs. Even if it does not perfrom well on this kind of data it’ might be ’s a good idea to visualize the first dimensions.

loose <- RunPCA(loose, features = VariableFeatures(object = loose))
strict <- RunPCA(strict, features = VariableFeatures(object = strict))
panglao <- RunPCA(panglao, features = VariableFeatures(object = panglao))

3.1 Results

Loose

PC_elbow(loose, threshold = 15, xmax = 40) + ggtitle("Loose")
Elbow plot of PCA components of Loose datasets

Elbow plot of PCA components of Loose datasets

dimplot_custom(loose, "nFeature_RNA", "pca") + cont + ggtitle("Loose")
First two PCS of Loose dataset, colored by the number of genes

First two PCS of Loose dataset, colored by the number of genes

Strict

PC_elbow(strict, threshold = 15, xmax = 40) + ggtitle("Strict")
Elbow plot of PCA components of Strict datasets

Elbow plot of PCA components of Strict datasets

dimplot_custom(strict, "nFeature_RNA", "pca") + cont + ggtitle("Strict")
First two PCS of Strict dataset, colored by the number of genes

First two PCS of Strict dataset, colored by the number of genes

Panglao

PC_elbow(panglao, threshold = 15, xmax = 40) + ggtitle("Panglao")
Elbow plot of PCA components of Panglao datasets

Elbow plot of PCA components of Panglao datasets

dimplot_custom(panglao, "nFeature_RNA", "pca") + cont + ggtitle("Panglao")
First two PCS of Panglao dataset, colored by the number of genes

First two PCS of Panglao dataset, colored by the number of genes

From now on we’ll use the first 15 PCs, this was decided through some trials. Anyway, there seems to be agreement that is better to overestimate rather than using too few PCs.

4 Clustering

To cluster our cells we’ll follow the Seurat functions FindNeighbors() and FindClusters(). Briefly, FindNeighbors() computes clustering starting from a KNN graph (with euclidean distances) in the reduced PCA spaces (15 dimensions). Then, each edge of the graph is weighted based on the overlap of local neighbors (using Jaccard similarity)

Finally, the FindClusters() function by default implements the Louvain algorithm to iteratively group cells together. This function has a resolution parameter which sets the granularity of the clustering. Lower resolution lower granularity. It’s suggested to set this parameter to 0.4-1.2 for datasets around 3000 cells.

In our case we’ll use 4 different values based on the dimensions of our datasets, then we’ll explore which resolution produced a bettere clustering.

invisible({
    capture.output({
        
        npcs = 15
        ### HERE CHOOSE NUMBER OF DIMENSIONS
        
        loose <- FindNeighbors(loose, dims = 1:npcs)
        loose <- FindClusters(loose, resolution = c(0.8, 1, 2, 4))
        
        strict <- FindNeighbors(strict, dims = 1:npcs)
        strict <- FindClusters(strict, resolution = c(0.4, 0.8, 1, 1.2))
        
        panglao <- FindNeighbors(panglao, dims = 1:npcs)
        panglao <- FindClusters(panglao, resolution = c(0.8, 1, 1.2, 2))
        
        
    })
})

4.1 Results

To understand which resolution worked best for our purposes we can visualize the number of clusters for each value or the distribution of reads/genes. Furthermore, we can compute the silhouette for each different resolution value. This is done following the instructions at this GitHub page.

Loose

cluster_abundance(loose, res = T) + scale_fill_viridis(discrete = TRUE) + theme(axis.text.x = element_blank())
Abundance of clusters in loosely filtered data based on resolution value

Abundance of clusters in loosely filtered data based on resolution value

cluster_density(loose, "nCount_RNA", res = TRUE) + scale_fill_viridis(discrete = TRUE) + 
    scale_color_viridis(discrete = TRUE)
Number of reads distribution for each cluster divided by resolution value on loosely filtered data

Number of reads distribution for each cluster divided by resolution value on loosely filtered data

silhouette_analysis(loose, npcs = npcs)
Silhouette analysis for each resolution on loosely filtered data

Silhouette analysis for each resolution on loosely filtered data

We can see that by increasing the resolution the number of clusters increases. The distribution of number of genes in clusters varies quite a bit with some clusters with low counts and some with high variability. The silhouette analysis suggests that despite the high number of cells a lower value of resolution seems to work best. For this reason we’ll go on with the clusters defined with res = 0.8.

Idents(loose) <- "RNA_snn_res.0.8"
loose@meta.data$seurat_clusters <- loose@meta.data$RNA_snn_res.0.8

Strict

We then perform the same analysis on our strictly filtered dataset in order to decide which resolution is the best.

cluster_abundance(strict, res = T) + scale_fill_viridis(discrete = TRUE) + theme(axis.text.x = element_blank())
Abundance of clusters in strictly filtered data based on resolution value

Abundance of clusters in strictly filtered data based on resolution value

cluster_density(strict, "nCount_RNA", res = TRUE) + scale_fill_viridis(discrete = TRUE) + 
    scale_color_viridis(discrete = TRUE)
Number of reads distribution for each cluster divided by resolution value on strictly filtered data

Number of reads distribution for each cluster divided by resolution value on strictly filtered data

silhouette_analysis(strict, npcs = npcs)
Silhouette analysis for each resolution on strictly filtered data

Silhouette analysis for each resolution on strictly filtered data

Here we decided to use res = 0.8 in order to try to not oversimplify the groups (with res = 0.4) even if, probably, with these filters we may have lost some cells sub-population. Neverthless the mean silhouette value is similar.

Idents(strict) <- "RNA_snn_res.0.8"
strict@meta.data$seurat_clusters <- strict@meta.data$RNA_snn_res.0.8

Panglao

Finally, we do the same on panglao data.

cluster_abundance(panglao, res = T) + scale_fill_viridis(discrete = TRUE) + theme(axis.text.x = element_blank())
Abundance of clusters in panglao data based on resolution value

Abundance of clusters in panglao data based on resolution value

cluster_density(panglao, "nCount_RNA", res = TRUE) + scale_fill_viridis(discrete = TRUE) + 
    scale_color_viridis(discrete = TRUE)
Number of reads distribution for each cluster divided by resolution value on panglao data

Number of reads distribution for each cluster divided by resolution value on panglao data

silhouette_analysis(panglao, npcs = npcs)
Silhouette analysis for each resolution on panglao data

Silhouette analysis for each resolution on panglao data

For these data we decided to proceed with res = 1.

Idents(panglao) <- "RNA_snn_res.1"
panglao@meta.data$seurat_clusters <- panglao@meta.data$RNA_snn_res.1

5 Cell cycle scoring

To assess how the cell cycle was influencing our dataset we decided to infer the stage for each cell using the CellCycleScoring() function. This function assign a score for each cell representing the probability of being in phase S, G2M or G1. It manages to do that thanks to a list of phase specific marker genes stored in cc.genes. We’ll only explore how this variable is distributed in our data but we won’t regress out cell cycle information.

cc.genes <- lapply(cc.genes, str_to_title)

loose <- CellCycleScoring(loose, s.features = cc.genes$s.genes, g2m.features = cc.genes$g2m.genes)
strict <- CellCycleScoring(strict, s.features = cc.genes$s.genes, g2m.features = cc.genes$g2m.genes)
panglao <- CellCycleScoring(panglao, s.features = cc.genes$s.genes, g2m.features = cc.genes$g2m.genes)

loose@meta.data <- loose@meta.data %>% mutate(Phase = factor(Phase))
strict@meta.data <- strict@meta.data %>% mutate(Phase = factor(Phase))
panglao@meta.data <- panglao@meta.data %>% mutate(Phase = factor(Phase))

5.1 Results

Loose

l1 <- phase_prop(loose) + labs(subtitle = "Loose")

l2 <- phase_prop(loose, by.cluster = T) + labs(subtitle = "Loose")

l1 + l2 + plot_layout(widths = c(1, 3)) + plot_annotation(tag_levels = "A")
A. Distribution of phase specific cells in Loose dataset. B. Distribution in each cluster

A. Distribution of phase specific cells in Loose dataset. B. Distribution in each cluster

Strict

# 
s1 <- phase_prop(strict) + labs(subtitle = "Strict")

s2 <- phase_prop(strict, by.cluster = T) + labs(subtitle = "Strict")

s1 + s2 + plot_layout(widths = c(1, 3)) + plot_annotation(tag_levels = "A")
A. Distribution of phase specific cells in Strict dataset. B. Distribution in each cluster

A. Distribution of phase specific cells in Strict dataset. B. Distribution in each cluster

Panglao

pp1 <- phase_prop(panglao) + labs(subtitle = "Panglao")

pp2 <- phase_prop(panglao, by.cluster = T) + labs(subtitle = "Panglao")

pp1 + pp2 + plot_layout(widths = c(1, 3)) + plot_annotation(tag_levels = "A")
A. Distribution of phase specific cells in Panglao dataset. B. Distribution in each cluster

A. Distribution of phase specific cells in Panglao dataset. B. Distribution in each cluster

We can see that more or less the proportion of cell phases are the same between datsets and between clusters. Except for few peculiar clusters with almost no cells in G1 phase (12 in loose, 9-11 in strict and 9-12 in panglao). We’ll try to figure out the reason behind this, they may represent specific proliferating cells.

6 Non-linear dimensionality reduction

Non linear dimensionality algortihms are often used to visualize complex data in 2 dimensions. The most common techiniques emplyed in scRNA-seq analysis are t-distributed stochastic neighbour embedding (tSNE) (Maaten and Hinton 2008) and Uniform Approximation and Projection method (UMAP) (McInnes, Healy, and Melville 2018). The first tends to capture local affinity rather than global structure. This characteristic is controllled by the perplexity parameter, at lower perplexity the projected data will be spread out with little preservation of global structure. UMAP is a new algorithm able to scale up well with increasing data dimensions and to preserve global structure. UMAP main parameter is the number of neighbors used to construct the initial graph. Low values will cause to focus on local structures while higher values will result in a representation of the big picture.

We’ll run both of these algorithms on our datasets with default Seurat parameters.

loose <- RunUMAP(loose, dims = 1:npcs)
loose <- RunTSNE(loose, dims = 1:npcs)

strict <- RunUMAP(strict, dims = 1:npcs)
strict <- RunTSNE(strict, dims = 1:npcs)

panglao <- RunUMAP(panglao, dims = 1:npcs)
panglao <- RunTSNE(panglao, dims = 1:npcs)

For each dataset we’ll perform a visual analysis in the newly reduced space:

  • Continous vars contains the scatterplot in the reduced dimensions where cells are colored by number of reads, mito percentage and novelty.
  • Phase plot represent the cells in redcude dimensions colore by cellphase
  • Phase facet is the same plot as before but faceted by phase in order to visualize overlapping points
  • Clusters represents cells colored by cluster in the reduced dimensions

6.1 Results

Loose

loose_dimred <- dimred_analysis(loose)

Continuous vars

loose_dimred$cont

Phase

loose_dimred$phase

Phase faceted

loose_dimred$phase_facet

Clusters

loose_dimred$clust

Strict

strict_dimred <- dimred_analysis(strict)

Continuous vars

strict_dimred$cont

Phase

strict_dimred$phase

Phase faceted

strict_dimred$phase_facet

Clusters

strict_dimred$clust

Panglao

panglao_dimred <- dimred_analysis(panglao)

Continuous vars

panglao_dimred$cont

Phase

panglao_dimred$phase

Phase faceted

panglao_dimred$phase_facet

Clusters

panglao_dimred$clust

7 Cell type markers

We can find differently expressed genes of single clusters compared to all other cells. This can be done through the FindAllMArkers() function. This function returns a dataframe of genes for each cluster with associated statistics like p-value and average log fold change. The default is to do a wilcoxon test against all other groups.

loose.markers <- FindAllMarkers(loose, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
strict.markers <- FindAllMarkers(strict, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
panglao.markers <- FindAllMarkers(panglao, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)

7.1 Genes

We can visualize the expression of this gene in our cells by coloring each cell by the intensity of its expression or by doing a violinplot of the gene expression in each cluster.

Loose

FeaturePlot(loose, features = top_spread(loose.markers, 1)$rank1, reduction = "umap", 
    cols = c("#e7e1ef", "#b71540"), coord.fixed = TRUE)
Feature plot of the first DE gene (by avg logFC) for each cluster

Feature plot of the first DE gene (by avg logFC) for each cluster

VlnPlot(loose, features = top_spread(loose.markers, 1)$rank1, slot = "counts", log = TRUE, 
    pt.size = 0)
Violin plot of the same genes

Violin plot of the same genes

Strict

FeaturePlot(strict, features = top_spread(strict.markers, 1)$rank1, reduction = "umap", 
    cols = c("#e7e1ef", "#b71540"), coord.fixed = TRUE)
Feature plot of the first DE gene (by avg logFC) for each cluster

Feature plot of the first DE gene (by avg logFC) for each cluster

VlnPlot(strict, features = top_spread(strict.markers, 1)$rank1, slot = "counts", 
    log = TRUE, pt.size = 0)
Violin plot of the same genes

Violin plot of the same genes

Panglao

FeaturePlot(panglao, features = top_spread(panglao.markers, 1)$rank1, reduction = "umap", 
    cols = c("#e7e1ef", "#b71540"), coord.fixed = TRUE)
Feature plot of the first DE gene (by avg logFC) for each cluster

Feature plot of the first DE gene (by avg logFC) for each cluster

VlnPlot(panglao, features = top_spread(panglao.markers, 1)$rank1, slot = "counts", 
    log = TRUE, pt.size = 0)
Violin plot of the same genes

Violin plot of the same genes

7.2 Save progress

saveRDS(loose, file = "data/loose.rds")
saveRDS(loose.markers, file = "data/loose_markers.rds")
saveRDS(strict, file = "data/strict.rds")
saveRDS(strict.markers, file = "data/strict_markers.rds")
saveRDS(panglao, file = "data/panglao.rds")
saveRDS(panglao.markers, file = "data/panglao_markers.rds")

7.3 Type assignment

To assign each cluster to its type we’ll search on three different database of markers gene and we’ll cross information to be as sure as possible of the assigned cell type. Panglao Gene Search, Dropviz (Saunders et al. 2018) and Mousebrain Gene Search (Zeisel et al. 2018).

We’ll use a little function panglao_searcher() that helps us copypasting and assign the new type (in this case I saved the data in another file and run this script on the saved data). This chunk won’t be evaluated!

ol <- "Oligodendrocytes"
n <- "Neurons"
u <- "Unknown"
a <- "Astrocytes"
m <- "Microglia"
e <- "Endothelial cells"
opc <- "Oligodendrocytes parental cells"
i <- "Interneurons"
p <- "Polydendrocytes"
er <- "Erythroid-like and erythroid precursor cells"
ep <- "Ependymal cells"
msp <- "Medium spiny neurons"
ipc <- "Intermediate progenitor cells"
int <- "Interneurons"
dspn <- "Direct spiny neurons"
ispn <- "Indirect spiny neurons"

loose_df <- top_spread(loose.markers, 3)
loose_df <- panglao_searcher(loose_df, collapse = ",", append = "cell_type")
loose_df <- loose_df %>% mutate_all(as.factor)

saveRDS(loose_df, "data/loose_type.rds")

strict_df <- top_spread(strict.markers, 3)
strict_df <- panglao_searcher(strict_df, collapse = ",", append = "cell_type")
strict_df <- strict_df %>% mutate_all(as.factor)

saveRDS(strict_df, "data/strict_type.rds")

panglao_df <- top_spread(panglao.markers, 3)
panglao_df <- panglao_searcher(panglao_df, collapse = ",", append = "cell_type")
panglao_df <- panglao_df %>% mutate_all(as.factor)

saveRDS(panglao_df, "data/panglao_type.rds")

We load a dataframe containing the associated cell type for each cluster and associated cell type. We join this dataset with our Seurat object metadata, then, we set the cell type as new identities.

loose_df <- readRDS("data/loose_type.rds")
strict_df <- readRDS("data/strict_type.rds")
panglao_df <- readRDS("data/panglao_type.rds")

l_id <- as.vector(loose_df$cell_type)
names(l_id) <- levels(loose)
loose <- RenameIdents(loose, l_id)

loose@meta.data <- loose@meta.data %>% left_join(loose_df, by = c(seurat_clusters = "cluster"))

s_id <- as.vector(strict_df$cell_type)
names(s_id) <- levels(strict)
strict <- RenameIdents(strict, s_id)

strict@meta.data <- strict@meta.data %>% left_join(strict_df, by = c(seurat_clusters = "cluster"))

p_id <- as.vector(panglao_df$cell_type)
names(p_id) <- levels(panglao)
panglao <- RenameIdents(panglao, p_id)

panglao@meta.data <- panglao@meta.data %>% left_join(panglao_df, by = c(seurat_clusters = "cluster"))

7.4 Results

Loose

l <- celltype_analysis(loose, "cell_type")

Clust

l$clust
Cells labeled by cell type in reduced dimensions

Cells labeled by cell type in reduced dimensions

Clust faceted

l$clust_facet
Cells labeled by cell type in reduced dimensions faceted by cell type

Cells labeled by cell type in reduced dimensions faceted by cell type

Strict

s <- celltype_analysis(strict, "cell_type")

Clust

s$clust
Cells labeled by cell type in reduced dimensions

Cells labeled by cell type in reduced dimensions

Clust faceted

s$clust_facet
Cells labeled by cell type in reduced dimensions faceted by cell type

Cells labeled by cell type in reduced dimensions faceted by cell type

Panglao

p <- celltype_analysis(panglao, "cell_type")

Clust

p$clust
Cells labeled by cell type in reduced dimensions

Cells labeled by cell type in reduced dimensions

Clust faceted

p$clust_facet
Cells labeled by cell type in reduced dimensions faceted by cell type

Cells labeled by cell type in reduced dimensions faceted by cell type

8 SI

loose_mrk <- loose.markers %>% left_join(loose_df, by = "cluster") %>% select(-contains("rank")) %>% 
    group_by(cluster) %>% top_n(6, wt = avg_logFC) %>% ungroup()

xlsx::write.xlsx(as.data.frame(loose_mrk), "data/SI_Mutti.xlsx", sheetName = "loose_markers", 
    row.names = FALSE)

loose_n <- loose@meta.data %>% group_by(cell_type, seurat_clusters) %>% tally() %>% 
    ungroup()

xlsx::write.xlsx(as.data.frame(loose_n), "data/SI_Mutti.xlsx", sheetName = "loose_n_by_clust", 
    row.names = FALSE, append = TRUE)

strict_mrk <- strict.markers %>% left_join(strict_df, by = "cluster") %>% select(-contains("rank")) %>% 
    group_by(cluster) %>% top_n(6, wt = avg_logFC) %>% ungroup()

xlsx::write.xlsx(as.data.frame(strict_mrk), "data/SI_Mutti.xlsx", sheetName = "strict_markers", 
    row.names = FALSE, append = TRUE)

strict_n <- strict@meta.data %>% group_by(cell_type, seurat_clusters) %>% tally() %>% 
    ungroup()

xlsx::write.xlsx(as.data.frame(strict_n), "data/SI_Mutti.xlsx", sheetName = "strict_n_by_clust", 
    row.names = FALSE, append = TRUE)

panglao_mrk <- panglao.markers %>% left_join(panglao_df, by = "cluster") %>% select(-contains("rank")) %>% 
    group_by(cluster) %>% top_n(6, wt = avg_logFC) %>% ungroup()

xlsx::write.xlsx(as.data.frame(panglao_mrk), "data/SI_Mutti.xlsx", sheetName = "panglao_markers", 
    row.names = FALSE, append = TRUE)

panglao_n <- panglao@meta.data %>% group_by(cell_type, seurat_clusters) %>% tally() %>% 
    ungroup()

xlsx::write.xlsx(as.data.frame(panglao_n), "data/SI_Mutti.xlsx", sheetName = "panglao_n_by_clust", 
    row.names = FALSE, append = TRUE)

9 Session

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=Italian_Italy.1252  LC_CTYPE=Italian_Italy.1252   
[3] LC_MONETARY=Italian_Italy.1252 LC_NUMERIC=C                  
[5] LC_TIME=Italian_Italy.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] cowplot_1.1.0               viridis_0.5.1              
 [3] viridisLite_0.3.0           gghalves_0.1.0             
 [5] ggrepel_0.8.2               ggExtra_0.9                
 [7] patchwork_1.0.1             here_0.1                   
 [9] forcats_0.5.0               stringr_1.4.0              
[11] dplyr_1.0.2                 purrr_0.3.4                
[13] readr_1.3.1                 tidyr_1.1.2                
[15] tibble_3.0.3                ggplot2_3.3.2              
[17] tidyverse_1.3.0             cluster_2.1.0              
[19] Matrix_1.2-18               scDblFinder_1.2.0          
[21] Seurat_3.2.2                SingleCellExperiment_1.10.1
[23] SummarizedExperiment_1.18.1 DelayedArray_0.14.0        
[25] matrixStats_0.56.0          Biobase_2.48.0             
[27] GenomicRanges_1.40.0        GenomeInfoDb_1.24.0        
[29] IRanges_2.22.2              S4Vectors_0.26.1           
[31] BiocGenerics_0.34.0         rmdformats_0.3.7           
[33] knitr_1.30                 

loaded via a namespace (and not attached):
  [1] reticulate_1.16           tidyselect_1.1.0         
  [3] htmlwidgets_1.5.1         grid_4.0.2               
  [5] BiocParallel_1.22.0       Rtsne_0.15               
  [7] munsell_0.5.0             codetools_0.2-16         
  [9] ica_1.0-2                 statmod_1.4.34           
 [11] scran_1.16.0              future_1.19.1            
 [13] miniUI_0.1.1.1            withr_2.3.0              
 [15] colorspace_1.4-1          highr_0.8                
 [17] rstudioapi_0.11           ROCR_1.0-11              
 [19] tensor_1.5                rJava_0.9-13             
 [21] listenv_0.8.0             labeling_0.3             
 [23] GenomeInfoDbData_1.2.3    polyclip_1.10-0          
 [25] farver_2.0.3              rprojroot_1.3-2          
 [27] vctrs_0.3.4               generics_0.0.2           
 [29] xfun_0.17                 randomForest_4.6-14      
 [31] R6_2.4.1                  ggbeeswarm_0.6.0         
 [33] rsvd_1.0.3                locfit_1.5-9.4           
 [35] bitops_1.0-6              spatstat.utils_1.17-0    
 [37] assertthat_0.2.1          promises_1.1.1           
 [39] scales_1.1.1              beeswarm_0.2.3           
 [41] gtable_0.3.0              globals_0.13.0           
 [43] goftest_1.2-2             xlsx_0.6.4.2             
 [45] rlang_0.4.7               splines_4.0.2            
 [47] lazyeval_0.2.2            broom_0.7.0              
 [49] yaml_2.2.1                reshape2_1.4.4           
 [51] abind_1.4-5               modelr_0.1.8             
 [53] backports_1.1.10          httpuv_1.5.4             
 [55] tools_4.0.2               bookdown_0.20            
 [57] ellipsis_0.3.1            RColorBrewer_1.1-2       
 [59] ggridges_0.5.2            Rcpp_1.0.5               
 [61] plyr_1.8.6                zlibbioc_1.34.0          
 [63] RCurl_1.98-1.2            rpart_4.1-15             
 [65] deldir_0.1-29             pbapply_1.4-3            
 [67] zoo_1.8-8                 haven_2.3.1              
 [69] fs_1.5.0                  magrittr_1.5             
 [71] data.table_1.13.0         lmtest_0.9-38            
 [73] reprex_0.3.0              RANN_2.6.1               
 [75] fitdistrplus_1.1-1        hms_0.5.3                
 [77] xlsxjars_0.6.1            mime_0.9                 
 [79] evaluate_0.14             xtable_1.8-4             
 [81] readxl_1.3.1              gridExtra_2.3            
 [83] compiler_4.0.2            scater_1.16.2            
 [85] KernSmooth_2.23-17        crayon_1.3.4             
 [87] htmltools_0.5.0           mgcv_1.8-31              
 [89] later_1.1.0.1             lubridate_1.7.9          
 [91] DBI_1.1.0                 formatR_1.7              
 [93] dbplyr_1.4.4              MASS_7.3-53              
 [95] rappdirs_0.3.1            cli_2.0.2                
 [97] igraph_1.2.5              pkgconfig_2.0.3          
 [99] plotly_4.9.2.1            xml2_1.3.2               
[101] vipor_0.4.5               dqrng_0.2.1              
[103] XVector_0.28.0            rvest_0.3.6              
[105] digest_0.6.25             sctransform_0.3          
[107] RcppAnnoy_0.0.16          spatstat.data_1.4-3      
[109] rmarkdown_2.3             cellranger_1.1.0         
[111] leiden_0.3.3              uwot_0.1.8               
[113] edgeR_3.30.3              DelayedMatrixStats_1.10.1
[115] shiny_1.5.0               lifecycle_0.2.0          
[117] nlme_3.1-148              jsonlite_1.7.1           
[119] BiocNeighbors_1.6.0       limma_3.44.3             
[121] fansi_0.4.1               pillar_1.4.6             
[123] lattice_0.20-41           fastmap_1.0.1            
[125] httr_1.4.2                survival_3.2-3           
[127] glue_1.4.2                spatstat_1.64-1          
[129] png_0.1-7                 stringi_1.5.3            
[131] blob_1.2.1                BiocSingular_1.4.0       
[133] irlba_2.3.3               future.apply_1.6.0       

References

Franzén, Oscar, Li-Ming Gan, and Johan LM Björkegren. 2019. “PanglaoDB: A Web Server for Exploration of Mouse and Human Single-Cell Rna Sequencing Data.” Database 2019.

Germain, Pierre-Luc. 2020. ScDblFinder: ScDblFinder. https://github.com/plger/scDblFinder.

Heimberg, Graham, Rajat Bhatnagar, Hana El-Samad, and Matt Thomson. 2016. “Low Dimensionality in Gene Expression Data Enables the Accurate Extraction of Transcriptional Programs from Shallow Sequencing.” Cell Systems 2 (4): 239–50.

Hillje, R. 2020. “ScRNA-Seq Analysis Workflow.” GitHub Repository. https://romanhaa.github.io/projects/scrnaseq_workflow/; GitHub.

Luecken, Malte D, and Fabian J Theis. 2019. “Current Best Practices in Single-Cell Rna-Seq Analysis: A Tutorial.” Molecular Systems Biology 15 (6): e8746.

Maaten, Laurens van der, and Geoffrey Hinton. 2008. “Visualizing Data Using T-Sne.” Journal of Machine Learning Research 9 (Nov): 2579–2605.

McInnes, L., J. Healy, and J. Melville. 2018. “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.” ArXiv E-Prints, February. http://arxiv.org/abs/1802.03426.

Saunders, Arpiar, Evan Z Macosko, Alec Wysoker, Melissa Goldman, Fenna M Krienen, Heather de Rivera, Elizabeth Bien, et al. 2018. “Molecular Diversity and Specializations Among the Cells of the Adult Mouse Brain.” Cell 174 (4): 1015–30.

Stuart, Tim, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M Mauck III, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. 2019. “Comprehensive Integration of Single-Cell Data.” Cell 177 (7): 1888–1902.

Zeisel, Amit, Hannah Hochgerner, Peter Lönnerberg, Anna Johnsson, Fatima Memic, Job Van Der Zwan, Martin Häring, et al. 2018. “Molecular Architecture of the Mouse Nervous System.” Cell 174 (4): 999–1014.