Whole Human Brain (WHB) 10x RNA-seq gene expression data (part 1)#

The purpose of this set of notebooks is to provide an overview of the data, the file organization, and how to combine data and metadata through example use cases.

You need to be connected to the internet to run this notebook or connected to a cache that has the WHB data downloaded already.

The notebook presented here shows quick visualizations from precomputed metadata in the atlas. For examples on accesing the expression matricies, specifically selecting genes from expression matricies, see the general_acessing_10x_snRNASeq_tutorial.ipynb tutorial/example.

For full details of the data, see Siletti et al. 2023.

import os
import pandas as pd
from pathlib import Path
import numpy as np
import anndata
import time
import matplotlib.pyplot as plt

from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache

We will interact with the data using the AbcProjectCache. This cache object tracks which data has been downloaded and serves the path to the requsted data on disk. For metadata, the cache can also directly serve a up a Pandas Dataframe from the stored csv file. See the getting_started notebook for more details on using the cache including installing it if it has not already been.

Change the download_base variable to where you have downloaded the data in your system.

download_base = Path('../../abc_download_root')
abc_cache = AbcProjectCache.from_s3_cache(download_base)
abc_cache.current_manifest
'releases/20240330/manifest.json'

Data overview#

Cell metadata#

Essential cell metadata is stored as a dataframe. Each row represents one cell indexed by a cell label. The cell label is the concatenation of barcode and name of the sample. In this context, the sample is the barcoded cell sample that represents a single load into one port of the 10x Chromium. Note that cell barcodes are only unique within a single barcoded cell sample and that the same barcode can be reused. The barcoded cell sample label or name is unique in the database.

Each cell is associated with a library label, library method, donor label, donor sex, dissection region_of_interest_label, the corresponding coarse anatomical division label and the matrix_prefix identifying which data package this cell is part of.

Further, each cell is associated with a cluster alias representing which cluster this cell is a member of and (x, y) coordinates of the cells UMAP in Figure 1B of Siletti et al. The neurons and non-neurons in the cells table have overlapping (x, y) coordinates and should be plotted seperately.

Below, we load the first of the metadata used in this tutorial. This pattern of loading metadata is repeated throughout the tutorials.

abc_cache.list_metadata_files('WHB-10Xv3')
['anatomical_division_structure_map',
 'cell_metadata',
 'donor',
 'example_genes_all_cells_expression',
 'gene',
 'region_of_interest_structure_map']
abc_cache.list_metadata_files('WHB-taxonomy')
['cluster',
 'cluster_annotation_term',
 'cluster_annotation_term_set',
 'cluster_to_cluster_annotation_membership']
cell = abc_cache.get_metadata_dataframe(
    directory='WHB-10Xv3',
    file_name='cell_metadata',
    dtype={'cell_label': str}
)
cell.set_index('cell_label', inplace=True)
print("Number of cells = ", len(cell))
cell.head(5)
cell.columns
Number of cells =  3369219
Index(['cell_barcode', 'barcoded_cell_sample_label', 'library_label',
       'feature_matrix_label', 'entity', 'brain_section_label',
       'library_method', 'donor_label', 'donor_sex', 'dataset_label', 'x', 'y',
       'cluster_alias', 'region_of_interest_label',
       'anatomical_division_label'],
      dtype='object')

We can use pandas groupby function to see how many unique items are associated for each field and list them out if the number of items is small.

def print_column_info(df):
    
    for c in df.columns:
        grouped = df[[c]].groupby(c).count()
        members = ''
        if len(grouped) < 30:
            members = str(list(grouped.index))
        print("Number of unique %s = %d %s" % (c, len(grouped), members))
print_column_info(cell)
Number of unique cell_barcode = 2205155 
Number of unique barcoded_cell_sample_label = 606 
Number of unique library_label = 606 
Number of unique feature_matrix_label = 2 ['WHB-10Xv3-Neurons', 'WHB-10Xv3-Nonneurons']
Number of unique entity = 1 ['nuclei']
Number of unique brain_section_label = 57 
Number of unique library_method = 1 ['10Xv3']
Number of unique donor_label = 4 ['H18.30.001', 'H18.30.002', 'H19.30.001', 'H19.30.002']
Number of unique donor_sex = 2 ['F', 'M']
Number of unique dataset_label = 1 ['WHB-10Xv3']
Number of unique x = 3369219 
Number of unique y = 3369219 
Number of unique cluster_alias = 3313 
Number of unique region_of_interest_label = 109 
Number of unique anatomical_division_label = 14 ['Amygdaloid complex', 'Basal forebrain', 'Basal nuclei', 'Cerebellum', 'Cerebral cortex', 'Claustrum', 'Extended amygdala', 'Hippocampus', 'Hypothalamus', 'Midbrain', 'Myelencephalon', 'Pons', 'Spinal cord', 'Thalamus']
cell.groupby('dataset_label')[['x']].count()
x
dataset_label
WHB-10Xv3 3369219

We can also create a pivot table to associate each cell with terms at each cell type classification level. To do this we need to load multiple other metadata tables and join them into the main cells table. See the cluster annotation tutorial for more details.

membership = abc_cache.get_metadata_dataframe(
    directory='WHB-taxonomy',
    file_name='cluster_to_cluster_annotation_membership'
)
membership_groupby = membership.groupby(['cluster_alias', 'cluster_annotation_term_set_name'])
membership.head(5)
cluster_to_cluster_annotation_membership.csv: 100%|██████████| 1.02M/1.02M [00:00<00:00, 3.20MMB/s]
cluster_annotation_term_label cluster_annotation_term_set_label cluster_alias cluster_annotation_term_name cluster_annotation_term_set_name number_of_cells color_hex_triplet
0 CS202210140_494 CCN202210140_SUBC 0 URL_297_0 subcluster 34 #7E807A
1 CS202210140_495 CCN202210140_SUBC 1 URL_308_1 subcluster 220 #C54945
2 CS202210140_496 CCN202210140_SUBC 2 URL_308_2 subcluster 187 #5232B7
3 CS202210140_497 CCN202210140_SUBC 3 URL_308_3 subcluster 246 #31BEBA
4 CS202210140_498 CCN202210140_SUBC 4 URL_308_4 subcluster 188 #C8A9BC
term_sets = abc_cache.get_metadata_dataframe(directory='WHB-taxonomy', file_name='cluster_annotation_term_set').set_index('label')
cluster_details = membership_groupby['cluster_annotation_term_name'].first().unstack()
cluster_details = cluster_details[term_sets['name']] # order columns
cluster_details.fillna('Other', inplace=True)
cluster_details.sort_values(['supercluster', 'cluster', 'subcluster'], inplace=True)
cluster_details.head(5)
cluster_annotation_term_set.csv: 100%|██████████| 1.33k/1.33k [00:00<00:00, 11.7kMB/s]
cluster_annotation_term_set_name subcluster cluster supercluster neurotransmitter
cluster_alias
2461 Amex_153_2461 Amex_153 Amygdala excitatory VGLUT1 VGLUT2
2462 Amex_153_2462 Amex_153 Amygdala excitatory VGLUT1 VGLUT2
2463 Amex_153_2463 Amex_153 Amygdala excitatory VGLUT1 VGLUT2
2464 Amex_153_2464 Amex_153 Amygdala excitatory VGLUT1 VGLUT2
2465 Amex_153_2465 Amex_153 Amygdala excitatory VGLUT1 VGLUT2
cluster_colors = membership_groupby['color_hex_triplet'].first().unstack()
cluster_colors = cluster_colors[term_sets['name']]
cluster_colors.sort_values(['supercluster', 'cluster', 'subcluster'], inplace=True)
cluster_colors.head(5)
cluster_annotation_term_set_name subcluster cluster supercluster neurotransmitter
cluster_alias
2218 #374B8A #062463 #003380 #2BDFD1
2216 #4BCBC6 #062463 #003380 #2BDFD1
2217 #83C943 #062463 #003380 #2BDFD1
2219 #BC4440 #062463 #003380 #2BDFD1
2220 #CB7ABA #062463 #003380 #2BDFD1
roi = abc_cache.get_metadata_dataframe(directory='WHB-10Xv3', file_name='region_of_interest_structure_map')
roi.set_index('region_of_interest_label', inplace=True)
roi.rename(columns={'color_hex_triplet': 'region_of_interest_color'},
           inplace=True)
roi.head(5)
region_of_interest_structure_map.csv: 100%|██████████| 8.82k/8.82k [00:00<00:00, 125kMB/s]
structure_identifier structure_symbol structure_name region_of_interest_color
region_of_interest_label
Human A13 DHBA:10202 A13 caudal division of OFCi (area 13) #CAB781
Human A14 DHBA:10196 A14r rostral subdivision of area 14 #B8A26D
Human A14 DHBA:10197 A14c caudal subdivision of area 14 #B8A26D
Human A19 DHBA:10272 PSC peristriate cortex (area 19) #D14D46
Human A1C DHBA:10236 A1C primary auditory cortex (core) #D670A0
cell_extended = cell.join(cluster_details, on='cluster_alias')
cell_extended = cell_extended.join(cluster_colors, on='cluster_alias', rsuffix='_color')
cell_extended = cell_extended.join(roi[['region_of_interest_color']], on='region_of_interest_label')
cell_extended.head(5)
cell_barcode barcoded_cell_sample_label library_label feature_matrix_label entity brain_section_label library_method donor_label donor_sex dataset_label ... anatomical_division_label subcluster cluster supercluster neurotransmitter subcluster_color cluster_color supercluster_color neurotransmitter_color region_of_interest_color
cell_label
10X386_2:CATGGATTCTCGACGG CATGGATTCTCGACGG 10X386_2 LKTX_210825_01_B01 WHB-10Xv3-Neurons nuclei H19.30.001.CX.51 10Xv3 H19.30.001 M WHB-10Xv3 ... Myelencephalon URL_312_20 URL_312 Upper rhombic lip VGLUT1 #4CB941 #97B8C8 #80BAED #2BDFD1 #5D6CB2
10X383_5:TCTTGCGGTGAATTGA TCTTGCGGTGAATTGA 10X383_5 LKTX_210818_02_E01 WHB-10Xv3-Neurons nuclei H19.30.002.BS.94 10Xv3 H19.30.002 M WHB-10Xv3 ... Myelencephalon URL_312_20 URL_312 Upper rhombic lip VGLUT1 #4CB941 #97B8C8 #80BAED #2BDFD1 #5D6CB2
10X386_2:CTCATCGGTCGAGCAA CTCATCGGTCGAGCAA 10X386_2 LKTX_210825_01_B01 WHB-10Xv3-Neurons nuclei H19.30.001.CX.51 10Xv3 H19.30.001 M WHB-10Xv3 ... Myelencephalon URL_312_17 URL_312 Upper rhombic lip VGLUT1 #C85E40 #97B8C8 #80BAED #2BDFD1 #5D6CB2
10X378_8:TTGGATGAGACAAGCC TTGGATGAGACAAGCC 10X378_8 LKTX_210809_01_H01 WHB-10Xv3-Neurons nuclei H19.30.002.BS.93 10Xv3 H19.30.002 M WHB-10Xv3 ... Pons URL_312_18 URL_312 Upper rhombic lip VGLUT1 #61C1C2 #97B8C8 #80BAED #2BDFD1 #517DBE
10X387_7:TGAACGTAGTATTCCG TGAACGTAGTATTCCG 10X387_7 LKTX_210825_02_G01 WHB-10Xv3-Neurons nuclei H19.30.001.CX.51 10Xv3 H19.30.001 M WHB-10Xv3 ... Myelencephalon URL_312_16 URL_312 Upper rhombic lip VGLUT1 #45328F #97B8C8 #80BAED #2BDFD1 #5D6CB2

5 rows × 24 columns

print_column_info(cell_extended)
Number of unique cell_barcode = 2205155 
Number of unique barcoded_cell_sample_label = 606 
Number of unique library_label = 606 
Number of unique feature_matrix_label = 2 ['WHB-10Xv3-Neurons', 'WHB-10Xv3-Nonneurons']
Number of unique entity = 1 ['nuclei']
Number of unique brain_section_label = 57 
Number of unique library_method = 1 ['10Xv3']
Number of unique donor_label = 4 ['H18.30.001', 'H18.30.002', 'H19.30.001', 'H19.30.002']
Number of unique donor_sex = 2 ['F', 'M']
Number of unique dataset_label = 1 ['WHB-10Xv3']
Number of unique x = 3369219 
Number of unique y = 3369219 
Number of unique cluster_alias = 3313 
Number of unique region_of_interest_label = 109 
Number of unique anatomical_division_label = 14 ['Amygdaloid complex', 'Basal forebrain', 'Basal nuclei', 'Cerebellum', 'Cerebral cortex', 'Claustrum', 'Extended amygdala', 'Hippocampus', 'Hypothalamus', 'Midbrain', 'Myelencephalon', 'Pons', 'Spinal cord', 'Thalamus']
Number of unique subcluster = 3313 
Number of unique cluster = 461 
Number of unique supercluster = 31 
Number of unique neurotransmitter = 20 ['CHOL VGLUT1 VGLUT2', 'CHOL VGLUT3', 'DA VGLUT2', 'GABA', 'GABA HDC', 'GABA HDC VGLUT2', 'GABA VGLUT2', 'GABA VGLUT3', 'GLY', 'GLY VGLUT2', 'HDC', 'HDC VGLUT2', 'Other', 'SER VGLUT3', 'VGLUT1', 'VGLUT1 VGLUT2', 'VGLUT1 VGLUT2 VGLUT3', 'VGLUT2', 'VGLUT2 VGLUT3', 'VGLUT3']
Number of unique subcluster_color = 3313 
Number of unique cluster_color = 461 
Number of unique supercluster_color = 31 
Number of unique neurotransmitter_color = 20 ['#196AA5', '#2252C2', '#22A5BB', '#2B39DF', '#2B93DF', '#2BDFD1', '#3F38B8', '#423B77', '#4F90B2', '#4FE3AB', '#666666', '#6B0C48', '#8BAD78', '#8C4F7F', '#8C7063', '#95369C', '#C6525E', '#FF3358', '#FF553D', '#FF7621']
Number of unique region_of_interest_color = 67 

UMAP spatial embedding#

Now that we’ve merged the cluster metadata into the main cells data, we can plot the Uniform Manifold Approximation and Projection (UMAP) for all the cells in the dataset using information from the clusters. The UMAP is a dimension reduction technique that can be used for visualizing and exploring large-dimension datasets. The x, y columns of the cell metadata table represents the coordinate of the all cells UMAP in Figure 1B of the manuscript. Note that the (x, y) coordinates for Neuron and Non-neuron cells overlap and should be plotted seperately.

We define a small helper function plot umap to visualize the cells on the UMAP. In this example will will plot associated cell information colorized by: dissection region of interest, neurotransmitter identity, cell supercluster, cluster, and subcluster. For ease of demostration, we do a simple subsampling of the cells by a factor of 10 to reduce processing time.

def plot_umap(xx, yy, cc=None, val=None, fig_width=8, fig_height=8, cmap=None):

    fig, ax = plt.subplots()
    fig.set_size_inches(fig_width, fig_height)

    if cmap is not None:
        plt.scatter(xx, yy, s=0.5, c=val, marker='.', cmap=cmap)
    elif cc is not None:
        plt.scatter(xx, yy, s=0.5, color=cc, marker='.')
        
    ax.axis('equal')
    ax.set_xticks([])
    ax.set_yticks([])
    
    return fig, ax
neurons_subsampled = cell_extended[cell_extended['feature_matrix_label'] == 'WHB-10Xv3-Neurons'][::10]
non_neurons_subsampled = cell_extended[cell_extended['feature_matrix_label'] == 'WHB-10Xv3-Nonneurons'][::10]
print("n neurons to plot:", len(neurons_subsampled))
print("n non-neurons to plot:", len(non_neurons_subsampled))
n neurons to plot: 303681
n non-neurons to plot: 107871
fig, ax = plot_umap(neurons_subsampled['x'], neurons_subsampled['y'], cc=neurons_subsampled['region_of_interest_color'])
res = ax.set_title("Neurons: Dissection Region Of Interest")
fig, ax = plot_umap(non_neurons_subsampled['x'], non_neurons_subsampled['y'], cc=non_neurons_subsampled['region_of_interest_color'])
res = ax.set_title("Non-neurons: Dissection Region Of Interest")
../_images/6c3cdbedb529721cd4962df128005889cc505ab58d249e9db4cb4cb414442ac5.png ../_images/d301e6a5e974e516b8650cea6191391684dc891b5783eac7e5c18990c61aca1e.png
fig, ax = plot_umap(neurons_subsampled['x'], neurons_subsampled['y'], cc=neurons_subsampled['neurotransmitter_color'])
res = ax.set_title("Neurons: Neuortransmitter Identity")
fig, ax = plot_umap(non_neurons_subsampled['x'], non_neurons_subsampled['y'], cc=non_neurons_subsampled['neurotransmitter_color'])
res = ax.set_title("Non-neurons: Neuortransmitter Identity")
../_images/1d85aa8cee1391e82df6442f74c3d30c0dc9ab675ec7f54349bbcb8b4d0c7232.png ../_images/5f3193afad0081896fe0039a8e1b00e73bfdfe1b7237546372489801f780a60c.png
fig, ax = plot_umap(neurons_subsampled['x'], neurons_subsampled['y'], cc=neurons_subsampled['supercluster_color'])
res = ax.set_title("Neuron Cell Types: Supercluster")
fig, ax = plot_umap(non_neurons_subsampled['x'], non_neurons_subsampled['y'], cc=non_neurons_subsampled['supercluster_color'])
res = ax.set_title("Non-neuron Cell Types: Supercluster")
../_images/924d8123b454f53632de65048b67b347326b20a8bda0bcc49824d714f3a87b5c.png ../_images/dfb0a57a3e306c16dc9c009f6f60ccd0edb166bec411ac8d8025fb3753583982.png
fig, ax = plot_umap(neurons_subsampled['x'], neurons_subsampled['y'], cc=neurons_subsampled['cluster_color'])
res = ax.set_title("Neuron Cell Types: Cluster")
fig, ax = plot_umap(non_neurons_subsampled['x'], non_neurons_subsampled['y'], cc=non_neurons_subsampled['cluster_color'])
res = ax.set_title("Non-neuron Cell Types: Cluster")
../_images/1ca31029c2bfd1e8c78f3d792782a16c0b034df9a35f21356126bc42eff10817.png ../_images/8e0428a91449aa0eeb1ee39b9cc112f703d1ee9a9d42b242bf0cd28959e806e0.png
fig, ax = plot_umap(neurons_subsampled['x'], neurons_subsampled['y'], cc=neurons_subsampled['subcluster_color'])
res = ax.set_title("Neuron Cell Types: Subcluster")
fig, ax = plot_umap(non_neurons_subsampled['x'], non_neurons_subsampled['y'], cc=non_neurons_subsampled['subcluster_color'])
res = ax.set_title("Non-neuron Cell Types: Subcluster")
../_images/b714de6a76e5b6b756c2701f0caa601f6f98a125d5fa77a3e8d039d082284993.png ../_images/8f8aab033f2ce293caa9d80b3d457dafc075ff2c7619b7e67595b1d812f69c3d.png

In part 2 we’ll focus on gene data including using the UMAP to plot gene expression locations in the different clusterings.