Whole Human Brain 10x RNA-seq gene expression data (part 2)

Whole Human Brain 10x RNA-seq gene expression data (part 2)#

In this notebook we’ll explore some of the gene expression and combine it with the cell metadata we showed in part 1.

You need to be either connected to the internet or connected to a cache that has the WHB data downloaded already and have already downloaded the example data via the getting started notebook to run this notebook.

import pandas as pd
import numpy as np
import anndata
from pathlib import Path
import matplotlib.pyplot as plt

from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache

We will interact with the data using the AbcProjectCache. This cache object tracks which data has been downloaded and serves the path to the requsted data on disk. For metadata, the cache can also directly serve a up a Pandas Dataframe. See the getting_started notebook for more details on using the cache including installing it if it has not already been.

Change the download_base variable to where you have downloaded the data in your system.

download_base = Path('../../data/abc_atlas')
abc_cache = AbcProjectCache.from_cache_dir(download_base)

abc_cache.current_manifest

'releases/20241130/manifest.json'

Create the expanded cell metadata as was done previously in the cluster annotation tutorial and part 1 of the 10X tutorial.

# Load the cell metadata.
cell = abc_cache.get_metadata_dataframe(
    directory='WHB-10Xv3',
    file_name='cell_metadata',
    dtype={'cell_label': str}
)
cell.set_index('cell_label', inplace=True)
print("Number of cells = ", len(cell))

# Load the cluster memembership metadata and combine the data with the cell data.
membership = abc_cache.get_metadata_dataframe(
    directory='WHB-taxonomy',
    file_name='cluster_to_cluster_annotation_membership'
)

term_sets = abc_cache.get_metadata_dataframe(directory='WHB-taxonomy', file_name='cluster_annotation_term_set').set_index('label')
cluster_details = membership.groupby(['cluster_alias', 'cluster_annotation_term_set_name'])['cluster_annotation_term_name'].first().unstack()
cluster_details = cluster_details[term_sets['name']] # order columns
cluster_details.fillna('Other', inplace=True)

cluster_details.sort_values(['supercluster', 'cluster', 'subcluster'], inplace=True)
cluster_colors = membership.groupby(['cluster_alias', 'cluster_annotation_term_set_name'])['color_hex_triplet'].first().unstack()
cluster_colors = cluster_colors[term_sets['name']]
cluster_colors.sort_values(['supercluster', 'cluster', 'subcluster'], inplace=True)
cluster_colors

roi = abc_cache.get_metadata_dataframe(directory='WHB-10Xv3', file_name='region_of_interest_structure_map')
roi.set_index('region_of_interest_label', inplace=True)
roi.rename(columns={'color_hex_triplet': 'region_of_interest_color'},
           inplace=True)

del membership
del term_sets

cell_extended = cell.join(cluster_details, on='cluster_alias')
cell_extended = cell_extended.join(cluster_colors, on='cluster_alias', rsuffix='_color')
cell_extended = cell_extended.join(roi[['region_of_interest_color']], on='region_of_interest_label')

del cluster_details
del cluster_colors
del roi

cell_extended.head(5)

Number of cells =  3369219

	cell_barcode	barcoded_cell_sample_label	library_label	feature_matrix_label	entity	brain_section_label	library_method	donor_label	donor_sex	dataset_label	...	abc_sample_id	subcluster	cluster	supercluster	neurotransmitter	subcluster_color	cluster_color	supercluster_color	neurotransmitter_color	region_of_interest_color
cell_label
10X386_2:CATGGATTCTCGACGG	CATGGATTCTCGACGG	10X386_2	LKTX_210825_01_B01	WHB-10Xv3-Neurons	nuclei	H19.30.001.CX.51	10Xv3	H19.30.001	M	WHB-10Xv3	...	b3d3a6c0-3dc5-4738-af4c-96cc5ff6033f	URL_312_20	URL_312	Upper rhombic lip	VGLUT1	#4CB941	#97B8C8	#80BAED	#2BDFD1	#5D6CB2
10X383_5:TCTTGCGGTGAATTGA	TCTTGCGGTGAATTGA	10X383_5	LKTX_210818_02_E01	WHB-10Xv3-Neurons	nuclei	H19.30.002.BS.94	10Xv3	H19.30.002	M	WHB-10Xv3	...	0a7ed99b-d5eb-4c50-b95e-eb31b448d2c7	URL_312_20	URL_312	Upper rhombic lip	VGLUT1	#4CB941	#97B8C8	#80BAED	#2BDFD1	#5D6CB2
10X386_2:CTCATCGGTCGAGCAA	CTCATCGGTCGAGCAA	10X386_2	LKTX_210825_01_B01	WHB-10Xv3-Neurons	nuclei	H19.30.001.CX.51	10Xv3	H19.30.001	M	WHB-10Xv3	...	458f6409-af8d-4c7c-ac9b-0b0d9b73b36d	URL_312_17	URL_312	Upper rhombic lip	VGLUT1	#C85E40	#97B8C8	#80BAED	#2BDFD1	#5D6CB2
10X378_8:TTGGATGAGACAAGCC	TTGGATGAGACAAGCC	10X378_8	LKTX_210809_01_H01	WHB-10Xv3-Neurons	nuclei	H19.30.002.BS.93	10Xv3	H19.30.002	M	WHB-10Xv3	...	ff6b2d95-3752-49bf-9421-f17107538868	URL_312_18	URL_312	Upper rhombic lip	VGLUT1	#61C1C2	#97B8C8	#80BAED	#2BDFD1	#517DBE
10X387_7:TGAACGTAGTATTCCG	TGAACGTAGTATTCCG	10X387_7	LKTX_210825_02_G01	WHB-10Xv3-Neurons	nuclei	H19.30.001.CX.51	10Xv3	H19.30.001	M	WHB-10Xv3	...	589a1c5c-839b-480d-babc-4098c467211b	URL_312_16	URL_312	Upper rhombic lip	VGLUT1	#45328F	#97B8C8	#80BAED	#2BDFD1	#5D6CB2

5 rows × 25 columns

Single cell transcriptomes#

The ~3 million cell dataset of WHB has been divided into 2 expression matrices: one with Neurons and one with Nonneurons. Each matrix file is formatted as an annadata, h5ad file with minimal metadata.

Below we show some interactions with data from the 10X expression matricies in the WHB dataset. For a deeper dive into how to access specific gene data from the expression matricies, take a look at general_acessing_10x_snRNASeq_tutorial.ipynb. Below we will use precomputed metadata from these matricies to look at gene expression both in relation to different neurotransmitters and locations across the brain.

First, we list the available metadata in the WHB-10Xv3 dataset again. The two files we will be using in this tutorial are the gene metadata and the example_genes_all_cells_expression table.

abc_cache.list_metadata_files('WHB-10Xv3')

['anatomical_division_structure_map',
 'cell_metadata',
 'donor',
 'example_genes_all_cells_expression',
 'gene',
 'region_of_interest_structure_map']

The table below holds metadata for all genes sequenced in the dataset.

gene = abc_cache.get_metadata_dataframe(directory='WHB-10Xv3', file_name='gene')
gene.set_index('gene_identifier', inplace=True)
print("Number of genes = ", len(gene))
gene.head(5)

Number of genes =  59357

	gene_symbol	biotype	name
gene_identifier
ENSG00000000003	TSPAN6	protein_coding	tetraspanin 6
ENSG00000000005	TNMD	protein_coding	tenomodulin
ENSG00000000419	DPM1	protein_coding	dolichyl-phosphate mannosyltransferase subunit...
ENSG00000000457	SCYL3	protein_coding	SCY1 like pseudokinase 3
ENSG00000000460	C1orf112	protein_coding	chromosome 1 open reading frame 112

We’ll skip accessing these data from the expression matricies specifically for now, however, users can learn how to access specific genes from the released expression matricies in thegeneral_acessing_10x_snRNASeq_tutorial notebook.

The precomputed table below contains expressions for the genes SLC17A6, SLC17A7, SLC32A1, PTPRC, PLP1, AQP4, and TTR for all cells across the the WHB dataset. We then join this with our previously created, cell_extended pandas DataFrame from this tutorial.

example_cells_with_genes = abc_cache.get_metadata_dataframe(
    directory='WHB-10Xv3',
    file_name='example_genes_all_cells_expression'
).set_index('cell_label')
example_cells_with_genes.head()

example_genes_all_cells_expression.csv: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 233M/233M [00:09<00:00, 24.5MMB/s]

	PTPRC	SLC17A6	SLC32A1	SLC17A7	TTR	PLP1	AQP4
cell_label
10X386_2:CATGGATTCTCGACGG	0.0	0.0	0.0	0.000000	13.652457	0.000000	0.0
10X383_5:TCTTGCGGTGAATTGA	0.0	0.0	0.0	0.000000	10.484976	0.000000	0.0
10X386_2:CTCATCGGTCGAGCAA	0.0	0.0	0.0	0.000000	14.006844	0.000000	0.0
10X378_8:TTGGATGAGACAAGCC	0.0	0.0	0.0	9.348821	0.000000	9.348821	0.0
10X387_7:TGAACGTAGTATTCCG	0.0	0.0	0.0	0.000000	12.046448	0.000000	0.0

cell_extended_with_genes = cell_extended.join(example_cells_with_genes)

Example use cases#

In this section, we show a use case with the example genes SLC17A6, SLC17A7, SLC32A1, PTPRC, PLP1, AQP4, and TTR. These genes were selected because they were presented in Siletti et al. 2023 as markers genes for glutamatergic (SLC17A6, SLC17A7) and GABAergic (SLC32A1) neurons, immune cells (PTPRC), oligodendrocytes (PLP1), astrocytes (AQP4), and choroid plexus (TTR). “Marker genes” have much higher expression in the specified cell type or anatomic structure when compared to all other cells, and in many cases are functionally relevant for those cell types. Additional context about these genes, and additional genes of interest can be found in Siletti et al. 2023.

We define a helper functions aggregate_by_metadata to compute the average expression for a given catergory and later plot_umap to plot cells in a UMAP colorized by metadata or expression values similar to what was used in part 1.

def aggregate_by_metadata(df, gnames, value, sort = False):
    grouped = df.groupby(value)[gnames].mean()
    if sort:
        grouped = grouped.sort_values(by=gnames[0], ascending=False)
    return grouped

Expression of selected genes in the whole brain#

The helper function below creates a heatmap showing the relation between various parameters in the combined cell metadata and the genes we loaded.

def plot_heatmap(df, fig_width=8, fig_height=4, cmap=plt.cm.magma_r) :

    arr = df.to_numpy()

    fig, ax = plt.subplots()
    fig.set_size_inches(fig_width, fig_height)

    im = ax.imshow(arr, cmap=cmap, aspect='auto', vmin=0, vmax=6)
    xlabs = df.columns.values
    ylabs = df.index.values

    ax.set_xticks(range(len(xlabs)))
    ax.set_xticklabels(xlabs)

    ax.set_yticks(range(len(ylabs)))
    res = ax.set_yticklabels(ylabs)
    
    return im

Below, we plot the expression of the genes selected in seven genes. We show the genes vs neurotransmitter type/idendity.

agg = aggregate_by_metadata(cell_extended_with_genes, example_cells_with_genes.columns, 'neurotransmitter')
res = plot_heatmap(agg, 8, 8)
plt.show()

../_images/0692cebca6f48779e867cae20ded63db2e51346f45816ead63fbe8502a6b194a.png

Below is by dissection region of interest shows that each of these genes are associated with distinct regions.

agg = aggregate_by_metadata(cell_extended_with_genes, example_cells_with_genes.columns, 'anatomical_division_label')
res = plot_heatmap(agg, 8, 8)
plt.show()

../_images/a8ae84c0e790ab0f65048cc6a7fbfcfd64094a1cacb3b7f1716bfb920261a00f.png

And finally by supercluster from the taxonomy.

agg = aggregate_by_metadata(cell_extended_with_genes, example_cells_with_genes.columns, 'supercluster')
res = plot_heatmap(agg, 8, 12)
plt.show()

../_images/0c283b0e6a73769ce1c3371b02f7f61466ec5d31a869bdd2059c3806198177ae.png

We can also visualize the relationship these genes and their location in the UMAP. Note again as in part one that Neurons and Non-neurons should be plotted seperately.

def plot_umap(xx, yy, cc=None, val=None, fig_width=8, fig_height=8, cmap=None):

    fig, ax = plt.subplots()
    fig.set_size_inches(fig_width, fig_height)

    if cmap is not None :
        plt.scatter(xx, yy, s=0.5, c=val, marker='.', cmap=cmap)
    elif cc is not None :
        plt.scatter(xx, yy, s=0.5, color=cc, marker='.')
        
    ax.axis('equal')
    # ax.set_xlim(-18, 27)
    # ax.set_ylim(-18, 27)
    ax.set_xticks([])
    ax.set_yticks([])
    
    return fig, ax

neurons_subsampled = cell_extended_with_genes[cell_extended_with_genes['feature_matrix_label'] == 'WHB-10Xv3-Neurons'][::10]
non_neurons_subsampled = cell_extended_with_genes[cell_extended_with_genes['feature_matrix_label'] == 'WHB-10Xv3-Nonneurons'][::10]

fig, ax = plot_umap(neurons_subsampled['x'], neurons_subsampled['y'], val=neurons_subsampled['SLC17A6'], cmap=plt.cm.magma_r)
res = ax.set_title("Neuron: Gene SLC17A6")
plt.show()
fig, ax = plot_umap(non_neurons_subsampled['x'], non_neurons_subsampled['y'], val=non_neurons_subsampled['SLC17A6'], cmap=plt.cm.magma_r)
res = ax.set_title("Non-Neuron: Gene SLC17A6")
plt.show()

../_images/17563170dabe403a5830feb6e7450130fcd710c3c6cce1c2c811682a1c5cad4e.png

../_images/9b8d74cc46c35a0ccbe276a2987e0252d12b6f7510c974a8d14ffa690ca3b2a1.png

fig, ax = plot_umap(neurons_subsampled['x'], neurons_subsampled['y'], val=neurons_subsampled['SLC32A1'], cmap=plt.cm.magma_r)
res = ax.set_title("Neuron: Gene SLC32A1")
plt.show()
fig, ax = plot_umap(non_neurons_subsampled['x'], non_neurons_subsampled['y'], val=non_neurons_subsampled['SLC32A1'], cmap=plt.cm.magma_r)
res = ax.set_title("Non-Neuron: Gene SLC32A1")
plt.show()

../_images/29b6b22592613ba48e775f901d056cf7b0a3223445b75543eb01b393300955a4.png

../_images/3a95a8ab050ddc2bcbb3080e56ed7544d313a03340b4d5a1b13b0f05c6884402.png

fig, ax = plot_umap(neurons_subsampled['x'], neurons_subsampled['y'], val=neurons_subsampled['PLP1'], cmap=plt.cm.magma_r)
res = ax.set_title("Neuron: Gene PLP1")
plt.show()
fig, ax = plot_umap(non_neurons_subsampled['x'], non_neurons_subsampled['y'], val=non_neurons_subsampled['PLP1'], cmap=plt.cm.magma_r)
res = ax.set_title("Non-Neuron: Gene PLP1")
plt.show()

../_images/34da82dbb528678d9e6c92740ecf5e09fff86fc1510acbbd873e66271d6b8145.png

../_images/70f361b5177f46a17581f5909032bbe8c408b54c3536f1df9690867db280179f.png

fig, ax = plot_umap(neurons_subsampled['x'], neurons_subsampled['y'], val=neurons_subsampled['AQP4'], cmap=plt.cm.magma_r)
res = ax.set_title("Neuron: Gene AQP4")
plt.show()
fig, ax = plot_umap(non_neurons_subsampled['x'], non_neurons_subsampled['y'], val=non_neurons_subsampled['AQP4'], cmap=plt.cm.magma_r)
res = ax.set_title("Non-Neuron: Gene AQP4")
plt.show()

../_images/7cdcaecd41ecea9ec66ccae0bc46b79c1c8783bb05c8a587c3b6e365c77d6a3f.png

../_images/91b55349d8fd4bfde653a30d29d8bafc72aa8549ee0fefbce82beab301d01a41.png

(Plots for SLC17A7, PTPRC, and TTR could be generated the same way, but aren’t shown.)

Whole Human Brain 10x RNA-seq gene expression data (part 2)

Contents

Whole Human Brain 10x RNA-seq gene expression data (part 2)#

Single cell transcriptomes#

Example use cases#

Expression of selected genes in the whole brain#