Consensus Whole Mouse Brain #1: Clustering and Annotations#

The Consensus, Whole Mouse Brain (WMB) integrated taxonomy is built upon two publicly released transcriptionally defined WMB taxonomies: one derived from single-cell RNA sequencing (scRNA-seq) data and the other from single-nucleus RNA sequencing (snRNA-seq) data. The Allen Institute for Brain Science (AIBS) taxonomy, based on over 4 million scRNA-seq profiles, defines 5,322 clusters organized hierarchically into 34 classes, 338 subclasses, and 1,201 supertypes. The set of data products for this release can be found here. In parallel, the Broad Institute taxonomy, constructed from 4.4 million snRNA-seq profiles, defines 16 classes, 223 metaclusters, and 5,030 clusters. Integrating these two large-scale taxonomies into a unified framework represents a natural and impactful next step, enabling a consensus view of cell types across the entire mouse brain and benefiting the broader neuroscience community.

To generate this consensus taxonomy, we applied the AIBS Quality Control (QC) and post-integration QC pipelines, retaining 7,651,713 cells and nuclei. Integration of scRNA-seq and snRNA-seq data was performed using scVI, with subsampling by original clusters to mitigate sampling imbalance across cell types and brain regions, followed by projection of all remaining cells into a shared latent space. The same iterative clustering strategy used in the AIBS taxonomy was applied in a hierarchical manner—globally, at nine neighborhood levels, and across eight finer group levels. The resulting comprehensive taxonomy comprises a hierarchically arranged set of cell types with 9 neighborhoods, 43 classes, 414 subclasses, 1,386 supertypes, and 6,721 clusters. A detailed cell type annotation table accompanies the taxonomy, including hierarchical membership, anatomical localization, and neurotransmitter identity. All associated metadata is publicly available as an AWS Public Dataset hosted on Amazon S3 and through the Allen Brain Cell Atlass Access (abc_atlas_access) package.

Below we explore this taxonomy and combine it with cell and other metadata, visualizing the data in a 2d spatial projection and summary statistics.

%matplotlib inline

import pandas as pd
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple, Optional

from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache

We will interact with the data using the AbcProjectCache. This cache object downloads data requested by the user, tracks which files have already been downloaded to your local system, and serves the path to the requested data on disk. For metadata, the cache can also directly serve up a Pandas DataFrame. See the getting_started notebook for more details on using the cache including installing the package.

Change the download_base variable to where you would like to download the data in your system.

download_base = Path('../../data/abc_atlas')
abc_cache = AbcProjectCache.from_cache_dir(
    download_base
)

abc_cache.current_manifest
'releases/20251031/manifest.json'

Data overview#

Below we list the files available with the Consensus Mouse data release. There are three main packages: Consensus-WMB-AIBS-10X, Consensus-WMB-Macosko-10X, and Consensus-WMB-integrated-taxonomy.

For the Consensus-WMB-AIBS-10X there are no new expression matrices, all gene expression is the same as the previous WMB release and we reuse those files, see the gene expression tutorial for more details on using the gene expression data. This release provides updated cell metadata. While there is significant overlap between the previous WMB AIBS release and this Consensus WMB release, there are a fraction of cells in this metadata that are not in the previous release and vice versa.

Below we list the available metadata files for WMB-AIBS in the consensus release. The expression matrices are part of the previously released WMB-10Xv2 and WMB-10Xv3 packages.

print("Consensus-WMB-AIBS-10X: metadata \n\t",
      abc_cache.list_metadata_files(directory='Consensus-WMB-AIBS-10X'))
Consensus-WMB-AIBS-10X: metadata 
	 ['cell_metadata', 'donor', 'example_gene_expression', 'library', 'value_sets']

Next we list the available expression matrix and metadata files in the WMB-Macosko directories. The expression matrices are divided similarlly to those in the WMB-AIBS release, that is by coarse brain region. The metadata are similarly structured to those in the WMB-AIBS portion of the data listed above.

print("Consensus-WMB-Macosko-10X: gene expression data (h5ad)\n\t",
      abc_cache.list_expression_matrix_files(directory='Consensus-WMB-Macosko-10X'))
print("Consensus-WMB-Macosko-10X: gene expression data (h5ad)\n\t",
      abc_cache.list_metadata_files(directory='Consensus-WMB-Macosko-10X'))
Consensus-WMB-Macosko-10X: gene expression data (h5ad)
	 ['Macosko-10X-CB/log2', 'Macosko-10X-CB/raw', 'Macosko-10X-HPF/log2', 'Macosko-10X-HPF/raw', 'Macosko-10X-HY/log2', 'Macosko-10X-HY/raw', 'Macosko-10X-Isocortex/log2', 'Macosko-10X-Isocortex/raw', 'Macosko-10X-MB/log2', 'Macosko-10X-MB/raw', 'Macosko-10X-MY-Pons-BS/log2', 'Macosko-10X-MY-Pons-BS/raw', 'Macosko-10X-OLF/log2', 'Macosko-10X-OLF/raw', 'Macosko-10X-PAL/log2', 'Macosko-10X-PAL/raw', 'Macosko-10X-STR/log2', 'Macosko-10X-STR/raw', 'Macosko-10X-TH/log2', 'Macosko-10X-TH/raw']
Consensus-WMB-Macosko-10X: gene expression data (h5ad)
	 ['cell_metadata', 'donor', 'example_gene_expression', 'gene', 'library', 'value_sets']

Finally, we list the metadata files that make up the consensus taxonomy. This data includes 2d projections for all cells as well as the cells in each neighborhood. We’ll show how to join the taxonomy with the cell metadata files listed above later in this notebook.

print("Consensus-WMB-integrated-taxonomy: metadata (csv)\n\t", abc_cache.list_metadata_files(directory='Consensus-WMB-integrated-taxonomy'))
Consensus-WMB-integrated-taxonomy: metadata (csv)
	 ['HY-EA-Glut-GABA_cell_2d_embedding_coordinates', 'MB-GABA_cell_2d_embedding_coordinates', 'MB-Glut-Dopa-Sero_cell_2d_embedding_coordinates', 'NN-IMN_cell_2d_embedding_coordinates', 'P-MY-CB-GABA_cell_2d_embedding_coordinates', 'P-MY-CB-Glut_cell_2d_embedding_coordinates', 'Pallium-Glut_cell_2d_embedding_coordinates', 'Subpallium-GABA_cell_2d_embedding_coordinates', 'TH-EPI-Glut_cell_2d_embedding_coordinates', 'cell_2d_embedding_coordinates', 'cell_to_cluster_membership', 'cluster', 'cluster_annotation_term', 'cluster_annotation_term_set', 'cluster_to_cluster_annotation_membership']

Cell metadata#

Below we load the metadata for each cell in both the WMB-Macosko and WMB-AIBS portion of the data. These contain base information of the cell’s ID, its barcode and barcoded_cell_sample (if available), the library the cell comes from and two columns defining which h5ad file a given cell’s gene expression is located.

Below we load both the WMB-Macosko and WMB-AIBS cell data.

macosko_cell_metadata = abc_cache.get_metadata_dataframe(
    directory='Consensus-WMB-Macosko-10X',
    file_name='cell_metadata',
    dtype={'cell_label': str}
).set_index('cell_label')
print("Number of cells = ", len(macosko_cell_metadata))
macosko_cell_metadata.head()
cell_metadata.csv: 100%|██████████| 506M/506M [01:31<00:00, 5.55MMB/s]    
Number of cells =  3736281
cell_barcode barcoded_cell_sample_label library_label dataset_label feature_matrix_label
cell_label
pBICCNsMMrBSL1aiM007d190529_ACTTCCGGTGGTCCCA ACTTCCGGTGGTCCCA NaN pBICCNsMMrBSL1aiM007d190529 Consensus-WMB-Macosko-10X Macosko-10X-MY-Pons-BS
pBICCNsMMrBSL1aiM007d190529_AACCTTTGTTAAGTCC AACCTTTGTTAAGTCC NaN pBICCNsMMrBSL1aiM007d190529 Consensus-WMB-Macosko-10X Macosko-10X-MY-Pons-BS
pBICCNsMMrBSL1aiM007d190529_TTTCCTCTCACCGGTG TTTCCTCTCACCGGTG NaN pBICCNsMMrBSL1aiM007d190529 Consensus-WMB-Macosko-10X Macosko-10X-MY-Pons-BS
pBICCNsMMrBSL1aiM007d190529_CACACAACATCATCCC CACACAACATCATCCC NaN pBICCNsMMrBSL1aiM007d190529 Consensus-WMB-Macosko-10X Macosko-10X-MY-Pons-BS
pBICCNsMMrBSL1aiM007d190529_ACTATCTCAGTTAAAG ACTATCTCAGTTAAAG NaN pBICCNsMMrBSL1aiM007d190529 Consensus-WMB-Macosko-10X Macosko-10X-MY-Pons-BS
aibs_cell_metadata = abc_cache.get_metadata_dataframe(
    directory='Consensus-WMB-AIBS-10X',
    file_name='cell_metadata',
    dtype={'cell_label': str}
).set_index('cell_label')
print("Number of cells = ", len(aibs_cell_metadata))
aibs_cell_metadata.head()
cell_metadata.csv: 100%|██████████| 375M/375M [01:04<00:00, 5.79MMB/s]    
Number of cells =  3915432
cell_barcode barcoded_cell_sample_label library_label dataset_label feature_matrix_label
cell_label
GCGAGAAGTTAAGGGC-410_B05 GCGAGAAGTTAAGGGC 410_B05 L8TX_201030_01_C12 WMB-10Xv3 WMB-10Xv3-HPF
AATGGCTCAGCTCCTT-411_B06 AATGGCTCAGCTCCTT 411_B06 L8TX_201029_01_E10 WMB-10Xv3 WMB-10Xv3-HPF
AACACACGTTGCTTGA-410_B05 AACACACGTTGCTTGA 410_B05 L8TX_201030_01_C12 WMB-10Xv3 WMB-10Xv3-HPF
CACAGATAGAGGCGGA-410_A05 CACAGATAGAGGCGGA 410_A05 L8TX_201029_01_A10 WMB-10Xv3 WMB-10Xv3-HPF
GATCGTATCGAATCCA-411_B06 GATCGTATCGAATCCA 411_B06 L8TX_201029_01_E10 WMB-10Xv3 WMB-10Xv3-HPF

We can use pandas groupby function to see how many unique items are associated for each field and list them out if the number of unique items is small.

def print_column_info(df):
    
    for c in df.columns:
        grouped = df[[c]].groupby(c).count()
        members = ''
        if len(grouped) < 30:
            members = str(list(grouped.index))
        print("Number of unique %s = %d %s" % (c, len(grouped), members))
print_column_info(pd.concat([aibs_cell_metadata, macosko_cell_metadata]))
Number of unique cell_barcode = 3580247 
Number of unique barcoded_cell_sample_label = 781 
Number of unique library_label = 1434 
Number of unique dataset_label = 3 ['Consensus-WMB-Macosko-10X', 'WMB-10Xv2', 'WMB-10Xv3']
Number of unique feature_matrix_label = 33 

Library and Donor metadata#

Next we load metadata associated with each dataset’s libraries and donors.

Below we load the library metadata. The primary information we’ll be using from these tables are the anatomical region the sample originated from and the id of the donor the library came from.

macosko_library = abc_cache.get_metadata_dataframe(
    directory='Consensus-WMB-Macosko-10X',
    file_name='library'
).set_index('library_label')
macosko_library.head()
library.csv: 100%|██████████| 27.7k/27.7k [00:00<00:00, 226kMB/s]
region_of_interest_acronym anatomical_division_label donor_label
library_label
pBICCNsMMrACAiF019d210630A1 ACA Isocortex F019
pBICCNsMMrACAiF019d210630A2 ACA Isocortex F019
pBICCNsMMrACAiF019d210630A3 ACA Isocortex F019
pBICCNsMMrACAiF019d210630A4 ACA Isocortex F019
pBICCNsMMrACAiF019d210630A5 ACA Isocortex F019
aibs_library = abc_cache.get_metadata_dataframe(
    directory='Consensus-WMB-AIBS-10X',
    file_name='library'
).set_index('library_label')
aibs_library.head()
library.csv: 100%|██████████| 59.0k/59.0k [00:00<00:00, 443kMB/s]
library_method alignment_id region_of_interest_acronym anatomical_division_label donor_label
library_label
L8TX_171026_01_A04 10Xv2 1186619234 MOp Isocortex Snap25-IRES2-Cre;Ai14-352353
L8TX_171026_01_A05 10Xv2 1178482616 MOp Isocortex Snap25-IRES2-Cre;Ai14-352356
L8TX_171026_01_B04 10Xv2 1178483191 MOp Isocortex Snap25-IRES2-Cre;Ai14-352353
L8TX_171026_01_B05 10Xv2 1178482921 MOp Isocortex Snap25-IRES2-Cre;Ai14-352356
L8TX_171026_01_C05 10Xv2 1186619314 MOp Isocortex Snap25-IRES2-Cre;Ai14-352357

Finally, we’ll load the donor metadata, this provides a formalized column of where the sample originated (Macosko or AIBS) and the sex of the donor. For the WMB-AIBS data, we have additional information on the age of the donor at death.

macosko_donor = abc_cache.get_metadata_dataframe(
    directory='Consensus-WMB-Macosko-10X',
    file_name='donor'
).set_index('donor_label')
macosko_donor.head()
donor.csv: 100%|██████████| 1.29k/1.29k [00:00<00:00, 16.3kMB/s]
donor_sex origin_dataset
donor_label
1F1 Female WMB-Macosko
1F3 Female WMB-Macosko
1F5 Female WMB-Macosko
1F6 Female WMB-Macosko
1M1 Male WMB-Macosko
aibs_donor = abc_cache.get_metadata_dataframe(
    directory='Consensus-WMB-AIBS-10X',
    file_name='donor'
).set_index('donor_label')
aibs_donor.head()
donor.csv: 100%|██████████| 16.6k/16.6k [00:00<00:00, 188kMB/s]
donor_sex donor_age origin_dataset
donor_label
Gad2-IRES-Cre;Ai14-529270 Male 61 days WMB-AIBS
Gad2-IRES-Cre;Ai14-529271 Male 66 days WMB-AIBS
Gad2-IRES-Cre;Ai14-529272 Female 62 days WMB-AIBS
Gad2-IRES-Cre;Ai14-529273 Female 66 days WMB-AIBS
Gad2-IRES-Cre;Ai14-558836 Male 56 days WMB-AIBS

Now that we’ve loaded the additional metadata, we’ll join them into the cell metadata tables on the library and donor label.

macosko_cell_extended = macosko_cell_metadata.join(macosko_library, on='library_label')
macosko_cell_extended = macosko_cell_extended.join(macosko_donor, on='donor_label')
aibs_cell_extended = aibs_cell_metadata.join(aibs_library, on='library_label')
aibs_cell_extended = aibs_cell_extended.join(aibs_donor, on='donor_label')

Below we compute statistics using pandas groupby funcationality to count the number of cells in either of the two datasets, AIBS and Macosko. The we show the breakdown of cell count by anatomical region.

pd.concat([aibs_cell_extended, macosko_cell_extended]).groupby('origin_dataset')[['cell_barcode']].count()
cell_barcode
origin_dataset
WMB-AIBS 3915432
WMB-Macosko 3736281
pd.concat([aibs_cell_extended, macosko_cell_extended]).groupby('anatomical_division_label')[['cell_barcode']].count()
cell_barcode
anatomical_division_label
CB 645731
CTXsp 119786
HPF 704031
HY 347613
Isocortex 2125008
MB 974404
MY-Pons-BS 981716
OLF 433855
PAL 311432
STR 421355
TH 586782

Adding color and feature order#

In anticipation of plotting these cells and their metadata, we’ll load a lookup table that maps values in each of our loaded tables to color, ontological ordering, and (if available) external identifiers that represent these data.

value_sets = abc_cache.get_metadata_dataframe(
    directory='Consensus-WMB-AIBS-10X',
    file_name='value_sets'
).set_index('label')
value_sets.head()
value_sets.csv: 100%|██████████| 2.47k/2.47k [00:00<00:00, 31.1kMB/s]
field table description order external_identifier parent_label color_hex_triplet comment
label
Female donor_sex donor Female 1 NaN NaN #565353 NaN
Male donor_sex donor Male 2 NaN NaN #ADC4C3 NaN
WMB-AIBS origin_dataset donor Allen Institute for Brain Science, Whole Mouse... 1 NaN NaN #1f77b4 NaN
WMB-Macosko origin_dataset donor Broad Institute, Macosko lab Whole Mouse Brain... 2 NaN NaN #ff7f0e NaN
Isocortex anatomical_division_label library Isocortex 1 MBA:315 NaN #70FF71 division, ID and parent from CCF-2020

The convenience function below, extracts the color and order information and adds it to our DataFrames.

def extract_value_set(cell_metadata_df: pd.DataFrame, input_value_set: pd.DataFrame, input_value_set_label: str):
    """Add color and order columns to the cell metadata dataframe based on the input
    value set.

    Columns are added as {input_value_set_label}_color and {input_value_set_label}_order.

    Parameters
    ----------
    cell_metadata_df : pd.DataFrame
        DataFrame containing cell metadata.
    input_value_set : pd.DataFrame
        DataFrame containing the value set information.
    input_value_set_label : str
        The the column name to extract color and order information for. will be added to the cell metadata.
    """
    cell_metadata_df[f'{input_value_set_label}_color'] = input_value_set[
        input_value_set['field'] == input_value_set_label
    ].loc[cell_metadata_df[input_value_set_label]]['color_hex_triplet'].values
    cell_metadata_df[f'{input_value_set_label}_order'] = input_value_set[
        input_value_set['field'] == input_value_set_label
    ].loc[cell_metadata_df[input_value_set_label]]['order'].values
# Add region of interest color and order
extract_value_set(macosko_cell_extended, value_sets, 'origin_dataset')
# Add species common name color and order
extract_value_set(macosko_cell_extended, value_sets, 'donor_sex')
# Add species scientific name color and order
extract_value_set(macosko_cell_extended, value_sets, 'anatomical_division_label')
macosko_cell_extended.head()
cell_barcode barcoded_cell_sample_label library_label dataset_label feature_matrix_label region_of_interest_acronym anatomical_division_label donor_label donor_sex origin_dataset origin_dataset_color origin_dataset_order donor_sex_color donor_sex_order anatomical_division_label_color anatomical_division_label_order
cell_label
pBICCNsMMrBSL1aiM007d190529_ACTTCCGGTGGTCCCA ACTTCCGGTGGTCCCA NaN pBICCNsMMrBSL1aiM007d190529 Consensus-WMB-Macosko-10X Macosko-10X-MY-Pons-BS BS MY-Pons-BS M007 Male WMB-Macosko #ff7f0e 2 #ADC4C3 2 #FF9BCD 12
pBICCNsMMrBSL1aiM007d190529_AACCTTTGTTAAGTCC AACCTTTGTTAAGTCC NaN pBICCNsMMrBSL1aiM007d190529 Consensus-WMB-Macosko-10X Macosko-10X-MY-Pons-BS BS MY-Pons-BS M007 Male WMB-Macosko #ff7f0e 2 #ADC4C3 2 #FF9BCD 12
pBICCNsMMrBSL1aiM007d190529_TTTCCTCTCACCGGTG TTTCCTCTCACCGGTG NaN pBICCNsMMrBSL1aiM007d190529 Consensus-WMB-Macosko-10X Macosko-10X-MY-Pons-BS BS MY-Pons-BS M007 Male WMB-Macosko #ff7f0e 2 #ADC4C3 2 #FF9BCD 12
pBICCNsMMrBSL1aiM007d190529_CACACAACATCATCCC CACACAACATCATCCC NaN pBICCNsMMrBSL1aiM007d190529 Consensus-WMB-Macosko-10X Macosko-10X-MY-Pons-BS BS MY-Pons-BS M007 Male WMB-Macosko #ff7f0e 2 #ADC4C3 2 #FF9BCD 12
pBICCNsMMrBSL1aiM007d190529_ACTATCTCAGTTAAAG ACTATCTCAGTTAAAG NaN pBICCNsMMrBSL1aiM007d190529 Consensus-WMB-Macosko-10X Macosko-10X-MY-Pons-BS BS MY-Pons-BS M007 Male WMB-Macosko #ff7f0e 2 #ADC4C3 2 #FF9BCD 12
# Add region of interest color and order
extract_value_set(aibs_cell_extended, value_sets, 'origin_dataset')
# Add species common name color and order
extract_value_set(aibs_cell_extended, value_sets, 'donor_sex')
# Add species scientific name color and order
extract_value_set(aibs_cell_extended, value_sets, 'anatomical_division_label')
aibs_cell_extended.head()
cell_barcode barcoded_cell_sample_label library_label dataset_label feature_matrix_label library_method alignment_id region_of_interest_acronym anatomical_division_label donor_label donor_sex donor_age origin_dataset origin_dataset_color origin_dataset_order donor_sex_color donor_sex_order anatomical_division_label_color anatomical_division_label_order
cell_label
GCGAGAAGTTAAGGGC-410_B05 GCGAGAAGTTAAGGGC 410_B05 L8TX_201030_01_C12 WMB-10Xv3 WMB-10Xv3-HPF 10Xv3 1177903638 RHP HPF Snap25-IRES2-Cre;Ai14-550850 Female 53 days WMB-AIBS #1f77b4 1 #565353 1 #7ED04B 6
AATGGCTCAGCTCCTT-411_B06 AATGGCTCAGCTCCTT 411_B06 L8TX_201029_01_E10 WMB-10Xv3 WMB-10Xv3-HPF 10Xv3 1177903464 RHP HPF Snap25-IRES2-Cre;Ai14-550851 Female 53 days WMB-AIBS #1f77b4 1 #565353 1 #7ED04B 6
AACACACGTTGCTTGA-410_B05 AACACACGTTGCTTGA 410_B05 L8TX_201030_01_C12 WMB-10Xv3 WMB-10Xv3-HPF 10Xv3 1177903638 RHP HPF Snap25-IRES2-Cre;Ai14-550850 Female 53 days WMB-AIBS #1f77b4 1 #565353 1 #7ED04B 6
CACAGATAGAGGCGGA-410_A05 CACAGATAGAGGCGGA 410_A05 L8TX_201029_01_A10 WMB-10Xv3 WMB-10Xv3-HPF 10Xv3 1177903446 RHP HPF Snap25-IRES2-Cre;Ai14-550850 Female 53 days WMB-AIBS #1f77b4 1 #565353 1 #7ED04B 6
GATCGTATCGAATCCA-411_B06 GATCGTATCGAATCCA 411_B06 L8TX_201029_01_E10 WMB-10Xv3 WMB-10Xv3-HPF 10Xv3 1177903464 RHP HPF Snap25-IRES2-Cre;Ai14-550851 Female 53 days WMB-AIBS #1f77b4 1 #565353 1 #7ED04B 6

UMAP spatial embedding#

Now that we have metadata with color information, we can utilize the available Uniform Mapping Approximation and Projection (UMAP) available for this consensus mouse release to visualize the information.

Below we load the projection and join it into a combined set of WMB-AIBS and WMB-Macosko, cell metadata.

cell_2d_embedding_coordinates = value_sets = abc_cache.get_metadata_dataframe(
    directory='Consensus-WMB-integrated-taxonomy',
    file_name='cell_2d_embedding_coordinates'
).set_index('cell_label')
cell_2d_embedding_coordinates.head()
cell_2d_embedding_coordinates.csv: 100%|██████████| 490M/490M [01:37<00:00, 5.01MMB/s]    
x y
cell_label
GCGAGAAGTTAAGGGC-410_B05 16.037980 3.101109
AATGGCTCAGCTCCTT-411_B06 15.951514 3.144049
AACACACGTTGCTTGA-410_B05 15.900673 3.124507
CACAGATAGAGGCGGA-410_A05 16.062553 3.185574
GATCGTATCGAATCCA-411_B06 15.971468 3.124298
cell_extended = pd.concat([aibs_cell_extended, macosko_cell_extended]).join(cell_2d_embedding_coordinates, how='inner')
cell_extended = cell_extended.sample(frac=1)

del cell_2d_embedding_coordinates

We define a small helper function plot_umap to visualize the cells on the UMAP. In the examples below we will plot associated cell information colorized by dissection donor species, sex, and region of interest.

def plot_umap(
    xx: np.ndarray,
    yy: np.ndarray,
    cc: np.ndarray = None,
    val: np.ndarray = None,
    fig_width: float = 8,
    fig_height: float = 8,
    cmap: Optional[plt.Colormap] = None,
    labels: np.ndarray = None,
    term_orders: np.ndarray = None,
    colorbar: bool = False,
    sizes: np.ndarray = None,
    fig: plt.Figure = None,
    ax: plt.Axes = None,
 ) -> Tuple[plt.Figure, plt.Axes]:
    """
    Plot a scatter plot of the UMAP coordinates.

    Parameters
    ----------
    xx : array-like
        x-coordinates of the points to plot.
    yy : array-like
        y-coordinates of the points to plot.
    cc : array-like, optional
        colors of the points to plot. If None, the points will be colored by the values in `val`.
    val : array-like, optional
        values of the points to plot. If None, the points will be colored by the values in `cc`.
    fig_width : float, optional
        width of the figure in inches. Default is 8.
    fig_height : float, optional
        height of the figure in inches. Default is 8.
    cmap : str, optional
        colormap to use for coloring the points. If None, the points will be colored by the values in `cc`.
    labels : array-like, optional
        labels for the points to plot. If None, no labels will be added to the plot.
    term_orders : array-like, optional
        order of the labels for the legend. If None, the labels will be ordered by their appearance in `labels`.
    colorbar : bool, optional
        whether to add a colorbar to the plot. Default is False.
    sizes : array-like, optional
        sizes of the points to plot. If None, all points will have the same size.
    fig : matplotlib.figure.Figure, optional
        figure to plot on. If None, a new figure will be created with 1 subplot.
    ax : matplotlib.axes.Axes, optional
        axes to plot on. If None, a new figure will be created with 1 subplot.
    """
    if sizes is None:
        sizes = 1
    if ax is None or fig is None:
        fig, ax = plt.subplots()
        fig.set_size_inches(fig_width, fig_height)

    if cmap is not None:
        scatt = ax.scatter(xx, yy, c=val, s=0.5, marker='.', cmap=cmap, alpha=sizes)
    elif cc is not None:
        scatt = ax.scatter(xx, yy, c=cc, s=0.5, marker='.', alpha=sizes)

    if labels is not None:
        from matplotlib.patches import Rectangle
        unique_label_colors = (labels + ',' + cc).unique()
        unique_labels = np.array([label_color.split(',')[0] for label_color in unique_label_colors])
        unique_colors = np.array([label_color.split(',')[1] for label_color in unique_label_colors])

        if term_orders is not None:
            unique_order = term_orders.unique()
            term_order = np.argsort(unique_order)
            unique_labels = unique_labels[term_order]
            unique_colors = unique_colors[term_order]
            
        rects = []
        for color in unique_colors:
            rects.append(Rectangle((0, 0), 1, 1, fc=color))

        legend = ax.legend(rects, unique_labels, loc=0)
        # ax.add_artist(legend)
    
    ax.set_xticks([])
    ax.set_yticks([])

    if colorbar:
        fig.colorbar(scatt, ax=ax)
    
    return fig, ax

Below we visualize the location of cells colored by origin_dataset, donor_sex, and the anatomical region the cells belong to.

# Select every 10th cell for plotting
sub_selected = cell_extended[::10]
fig, ax = plot_umap(
    sub_selected['x'],
    sub_selected['y'],
    cc=sub_selected['origin_dataset_color'],
    labels=sub_selected['origin_dataset'],
    term_orders=sub_selected['origin_dataset_order'],
    fig_width=12,
    fig_height=12
)
res = ax.set_title("Origin Dataset")
plt.show()
../_images/8bad59247cc4d8a31620cb1d70248f609c7dd4cc2d3dfeffc056d3e4d49d4703.png
fig, ax = plot_umap(
    sub_selected['x'],
    sub_selected['y'],
    cc=sub_selected['donor_sex_color'],
    labels=sub_selected['donor_sex'],
    term_orders=sub_selected['donor_sex_order'],
    fig_width=12,
    fig_height=12
)
res = ax.set_title("Donor Sex")
plt.show()
../_images/a45af3131162f3fba2fe61533adf5e4f6c837505b5ccc86a46cc8549a84abc69.png
fig, ax = plot_umap(
    sub_selected['x'],
    sub_selected['y'],
    cc=sub_selected['anatomical_division_label_color'],
    labels=sub_selected['anatomical_division_label'],
    term_orders=sub_selected['anatomical_division_label_order'],
    fig_width=12,
    fig_height=12
)
res = ax.set_title("Anatomical Region")
plt.show()
../_images/60221aaf3c14793e3ef1defb929667a30cbcf5cebe51899755bd743b6088e2cc.png

Taxonomy Information#

The final set of metadata we load into our extended cell metadata file maps the cells into their assigned cluster in the taxonomy. We additionally load metadata for the clusters and compute useful information, such as the number of cells in each taxon at each level of the taxonomy.

First, we load information associated with each Cluster in the taxonomy. This includes a useful alias value for each cluster as well as the number of cells in each cluster.

cluster = abc_cache.get_metadata_dataframe('Consensus-WMB-integrated-taxonomy', 'cluster').set_index('label')
cluster.head()
cluster.csv: 100%|██████████| 210k/210k [00:00<00:00, 1.06MMB/s] 
cluster_alias number_of_cells
label
CS20251031_CLUS_02721 2721 16355
CS20251031_CLUS_16574 16574 1519
CS20251031_CLUS_01736 1736 307
CS20251031_CLUS_01737 1737 825
CS20251031_CLUS_01743 1743 4671

Next, we load the table that describes the levels in the taxonomy from Neighborhood at the highest to Cluster at the lowest level.

cluster_annotation_term_set = abc_cache.get_metadata_dataframe(
    directory='Consensus-WMB-integrated-taxonomy',
    file_name='cluster_annotation_term_set',
    skip_hash_check=True
).set_index('label')
cluster_annotation_term_set
cluster_annotation_term_set.csv: 100%|██████████| 388/388 [00:00<00:00, 5.97kMB/s]
name description order parent_term_set_label
label
CCN20251031_NEUR neurotransmitter neurotransmitter 0 NaN
CCN20251031_LEVEL_0 neighborhood neighborhood 1 NaN
CCN20251031_LEVEL_1 class class 2 CCN20251031_LEVEL_0
CCN20251031_LEVEL_2 subclass subclass 3 CCN20251031_LEVEL_1
CCN20251031_LEVEL_3 supertype supertype 4 CCN20251031_LEVEL_2
CCN20251031_LEVEL_4 cluster cluster 5 CCN20251031_LEVEL_3

We load the annotation information defining all the taxons at all levels in the taxonomy. This also includes the term order and color associated with the taxon which we will use to plot later.

cluster_annotation_term = abc_cache.get_metadata_dataframe('Consensus-WMB-integrated-taxonomy', 'cluster_annotation_term').set_index('label')
cluster_annotation_term.head()
cluster_annotation_term.csv: 100%|██████████| 1.38M/1.38M [00:00<00:00, 3.73MMB/s]
name cluster_annotation_term_set_label cluster_annotation_term_set_name color_hex_triplet term_order term_set_order parent_term_label parent_term_name parent_term_set_label
label
CS20251031_NEUR_0004 Chol CCN20251031_NEUR neurotransmitter #73E785 4 0 NaN NaN NaN
CS20251031_NEUR_0012 Chol-Dopa CCN20251031_NEUR neurotransmitter #B8EC68 12 0 NaN NaN NaN
CS20251031_NEUR_0008 Dopa CCN20251031_NEUR neurotransmitter #fcf04b 8 0 NaN NaN NaN
CS20251031_NEUR_0002 GABA CCN20251031_NEUR neurotransmitter #FF3358 2 0 NaN NaN NaN
CS20251031_NEUR_0006 GABA-Chol CCN20251031_NEUR neurotransmitter #000080 6 0 NaN NaN NaN

Finally, we load the cluster to cluster annotation membership table. Each row in this table is a mapping between a cluster and taxon in the taxonomy, including the clusters themselves. We’ll use this table in a groupbys to allow us to count up the number of clusters at each taxonomy level and sum the number of cells in each taxon in the taxonomy a all levels.

cluster_to_cluster_annotation_membership = abc_cache.get_metadata_dataframe(
    directory='Consensus-WMB-integrated-taxonomy',
    file_name='cluster_to_cluster_annotation_membership'
).set_index('cluster_annotation_term_label')
cluster_to_cluster_annotation_membership.head()
cluster_to_cluster_annotation_membership.csv: 100%|██████████| 3.07M/3.07M [00:00<00:00, 4.67MMB/s]
cluster_annotation_term_set_name cluster_annotation_term_name cluster_alias cluster_annotation_term_set_label
cluster_annotation_term_label
CS20251031_NEUR_0001 neurotransmitter Glut 2721 CCN20251031_NEUR
CS20251031_NEUR_0001 neurotransmitter Glut 16574 CCN20251031_NEUR
CS20251031_NEUR_0001 neurotransmitter Glut 1736 CCN20251031_NEUR
CS20251031_NEUR_0001 neurotransmitter Glut 1737 CCN20251031_NEUR
CS20251031_NEUR_0001 neurotransmitter Glut 1743 CCN20251031_NEUR
membership_with_cluster_info = cluster_to_cluster_annotation_membership.join(
    cluster.reset_index().set_index('cluster_alias')[['number_of_cells']],
    on='cluster_alias'
)
membership_with_cluster_info = membership_with_cluster_info.join(cluster_annotation_term, rsuffix='_anno_term').reset_index()
membership_groupby = membership_with_cluster_info.groupby(
    ['cluster_alias', 'cluster_annotation_term_set_name']
)
membership_with_cluster_info.head()
cluster_annotation_term_label cluster_annotation_term_set_name cluster_annotation_term_name cluster_alias cluster_annotation_term_set_label number_of_cells name cluster_annotation_term_set_label_anno_term cluster_annotation_term_set_name_anno_term color_hex_triplet term_order term_set_order parent_term_label parent_term_name parent_term_set_label
0 CS20251031_NEUR_0001 neurotransmitter Glut 2721 CCN20251031_NEUR 16355 Glut CCN20251031_NEUR neurotransmitter #2B93DF 1 0 NaN NaN NaN
1 CS20251031_NEUR_0001 neurotransmitter Glut 16574 CCN20251031_NEUR 1519 Glut CCN20251031_NEUR neurotransmitter #2B93DF 1 0 NaN NaN NaN
2 CS20251031_NEUR_0001 neurotransmitter Glut 1736 CCN20251031_NEUR 307 Glut CCN20251031_NEUR neurotransmitter #2B93DF 1 0 NaN NaN NaN
3 CS20251031_NEUR_0001 neurotransmitter Glut 1737 CCN20251031_NEUR 825 Glut CCN20251031_NEUR neurotransmitter #2B93DF 1 0 NaN NaN NaN
4 CS20251031_NEUR_0001 neurotransmitter Glut 1743 CCN20251031_NEUR 4671 Glut CCN20251031_NEUR neurotransmitter #2B93DF 1 0 NaN NaN NaN

From the membership table, we create three tables via a groupby. First the alias of each cluster and its parents.

# term_sets = abc_cache.get_metadata_dataframe(directory='WHB-taxonomy', file_name='cluster_annotation_term_set').set_index('label')
cluster_details = membership_groupby['cluster_annotation_term_name'].first().unstack()
cluster_details = cluster_details[cluster_annotation_term_set['name']] # order columns
cluster_details.sort_values(['neighborhood', 'class', 'subclass', 'supertype', 'cluster'], inplace=True)
cluster_details.head()
cluster_annotation_term_set_name neurotransmitter neighborhood class subclass supertype cluster
cluster_alias
6562 GABA HY-EA-Glut-GABA 011 CNU-HYa GABA 090 MEA-BST_Lhx6:Nfib_Gaba 0376 MEA-BST_Lhx6:Nfib_Gaba 1 1543 MEA-BST_Lhx6:Nfib_Gaba 1
6567 GABA HY-EA-Glut-GABA 011 CNU-HYa GABA 090 MEA-BST_Lhx6:Nfib_Gaba 0376 MEA-BST_Lhx6:Nfib_Gaba 1 1544 MEA-BST_Lhx6:Nfib_Gaba 1
6576 GABA HY-EA-Glut-GABA 011 CNU-HYa GABA 090 MEA-BST_Lhx6:Nfib_Gaba 0376 MEA-BST_Lhx6:Nfib_Gaba 1 1545 MEA-BST_Lhx6:Nfib_Gaba 1
6578 GABA HY-EA-Glut-GABA 011 CNU-HYa GABA 090 MEA-BST_Lhx6:Nfib_Gaba 0376 MEA-BST_Lhx6:Nfib_Gaba 1 1546 MEA-BST_Lhx6:Nfib_Gaba 1
6579 GABA HY-EA-Glut-GABA 011 CNU-HYa GABA 090 MEA-BST_Lhx6:Nfib_Gaba 0376 MEA-BST_Lhx6:Nfib_Gaba 1 1547 MEA-BST_Lhx6:Nfib_Gaba 1

Next the plotting order of each of the clusters and their parents.

cluster_order = membership_groupby['term_order'].first().unstack()
cluster_order.sort_values(['neighborhood', 'class', 'subclass', 'supertype', 'cluster'], inplace=True)
cluster_order.head()
cluster_annotation_term_set_name class cluster neighborhood neurotransmitter subclass supertype
cluster_alias
2721 1 1 1 1 1 1
16574 1 2 1 1 1 1
1736 1 3 1 1 1 2
1737 1 4 1 1 1 2
1743 1 5 1 1 1 2

Finally, the colors we will use to plot for each of the unique taxons at all levels.

cluster_colors = membership_groupby['color_hex_triplet'].first().unstack()
cluster_colors = cluster_colors[cluster_annotation_term_set['name']]
cluster_colors.sort_values(
    ['neighborhood', 'class', 'subclass', 'supertype', 'cluster'],
    inplace=True
)
cluster_colors.head()
cluster_annotation_term_set_name neurotransmitter neighborhood class subclass supertype cluster
cluster_alias
8736 #2B93DF #006200 #006200 #002099 #663D41 #8C4599
8734 #2B93DF #006200 #006200 #002099 #99E0FF #410F66
8730 #2B93DF #006200 #006200 #002099 #99E0FF #811799
8732 #2B93DF #006200 #006200 #002099 #99E0FF #FFE4E2
8738 #2B93DF #006200 #006200 #002099 #99E0FF #FFF49B

Next, we bring it all together by loading the mapping of cells to cluster and join into our final metadata table.

cell_to_cluster_membership = abc_cache.get_metadata_dataframe('Consensus-WMB-integrated-taxonomy', 'cell_to_cluster_membership').set_index('cell_label')
cell_to_cluster_membership.head()
cell_to_cluster_membership.csv: 100%|██████████| 310M/310M [00:56<00:00, 5.54MMB/s]    
cluster_alias
cell_label
CAGGTGCAGGCTAGCA-040_C01 5491
CGGACGTGTGTGAATA-063_B01 6268
GATCCCTTCGTGCACG-107_B01 6659
TCACGAACAACTGCGC-026_A01 6672
ACACCAAGTCAAACTC-026_B01 7067

We merge this table with information from our clusters.

cell_extended = cell_extended.join(cell_to_cluster_membership, rsuffix='_cell_to_cluster_membership', how='inner')
cell_extended = cell_extended.join(cluster_details, on='cluster_alias')
cell_extended = cell_extended.join(cluster_colors, on='cluster_alias', rsuffix='_color')
cell_extended = cell_extended.join(cluster_order, on='cluster_alias', rsuffix='_order')

del cell_to_cluster_membership

cell_extended.head()
cell_barcode barcoded_cell_sample_label library_label dataset_label feature_matrix_label library_method alignment_id region_of_interest_acronym anatomical_division_label donor_label ... class_color subclass_color supertype_color cluster_color class_order cluster_order neighborhood_order neurotransmitter_order subclass_order supertype_order
cell_label
pBICCNsMMrVISPiF019d210623B1_ATTCCCGCAACTCGAT ATTCCCGCAACTCGAT NaN pBICCNsMMrVISPiF019d210623B1 Consensus-WMB-Macosko-10X Macosko-10X-Isocortex NaN NaN VISp Isocortex F019 ... #FA0087 #E9530F #5C6899 #FFDFA0 1 26 1 1 2 6
pBICCNsMMrMBL5iM013d201007_CTGCCATTCTTACCAT CTGCCATTCTTACCAT NaN pBICCNsMMrMBL5iM013d201007 Consensus-WMB-Macosko-10X Macosko-10X-MB NaN NaN MB MB M013 ... #594a26 #AD5CCC #0F6635 #2C00FF 37 6508 9 25 392 1305
TGAGCCGGTGCGATAG-039_A01 TGAGCCGGTGCGATAG 039_A01 L8TX_180815_01_H08 WMB-10Xv2 WMB-10Xv2-TH 10Xv2 1178465217 TH TH Snap25-IRES2-Cre;Ai14-404124 ... #9EF01A #C2FF26 #2E995C #5C9985 21 3829 6 2 215 800
pBICCNsMMrOLFiM015d201201C2_GTTCGCTAGTAGGAAG GTTCGCTAGTAGGAAG NaN pBICCNsMMrOLFiM015d201201C2 Consensus-WMB-Macosko-10X Macosko-10X-OLF NaN NaN OLF OLF M015 ... #D00000 #7F1FCC #7ACCA8 #006635 3 623 1 1 40 149
TACTTACGTGGGTCAA-096_C01 TACTTACGTGGGTCAA 096_C01 L8TX_190228_01_A12 WMB-10Xv2 WMB-10Xv2-OLF 10Xv2 1178471871 OLF OLF Snap25-IRES2-Cre;Ai14-443636 ... #1b4332 #FF0073 #990001 #4D89FF 5 717 2 2 48 172

5 rows × 40 columns

print_column_info(cell_extended)
Number of unique cell_barcode = 3578991 
Number of unique barcoded_cell_sample_label = 781 
Number of unique library_label = 1434 
Number of unique dataset_label = 3 ['Consensus-WMB-Macosko-10X', 'WMB-10Xv2', 'WMB-10Xv3']
Number of unique feature_matrix_label = 33 
Number of unique library_method = 2 ['10Xv2', '10Xv3']
Number of unique alignment_id = 781 
Number of unique region_of_interest_acronym = 42 
Number of unique anatomical_division_label = 11 ['CB', 'CTXsp', 'HPF', 'HY', 'Isocortex', 'MB', 'MY-Pons-BS', 'OLF', 'PAL', 'STR', 'TH']
Number of unique donor_label = 373 
Number of unique donor_sex = 2 ['Female', 'Male']
Number of unique donor_age = 22 ['51 days', '52 days', '53 days', '54 days', '55 days', '56 days', '57 days', '58 days', '59 days', '60 days', '61 days', '62 days', '63 days', '64 days', '65 days', '66 days', '67 days', '68 days', '69 days', '70 days', '71 days', 'unknown']
Number of unique origin_dataset = 2 ['WMB-AIBS', 'WMB-Macosko']
Number of unique origin_dataset_color = 2 ['#1f77b4', '#ff7f0e']
Number of unique origin_dataset_order = 2 [1, 2]
Number of unique donor_sex_color = 2 ['#565353', '#ADC4C3']
Number of unique donor_sex_order = 2 [1, 2]
Number of unique anatomical_division_label_color = 11 ['#70FF71', '#7ED04B', '#8599CC', '#8ADA87', '#98D6F9', '#9AD2BD', '#E64438', '#F0F080', '#FF64FF', '#FF7080', '#FF9BCD']
Number of unique anatomical_division_label_order = 11 [1, 4, 6, 7, 8, 9, 12, 13, 14, 15, 18]
Number of unique x = 7505915 
Number of unique y = 7517910 
Number of unique cluster_alias = 6721 
Number of unique neurotransmitter = 22 ['Chol', 'Chol-Dopa', 'Dopa', 'GABA', 'GABA-Chol', 'GABA-Dopa', 'GABA-Glyc', 'GABA-Hist', 'GABA-Sero', 'Glut', 'Glut-Chol', 'Glut-Dopa', 'Glut-GABA', 'Glut-GABA-Chol', 'Glut-GABA-Dopa', 'Glut-GABA-Glyc', 'Glut-GABA-Sero', 'Glut-Glyc', 'Glut-Sero', 'Glyc', 'NN', 'Sero']
Number of unique neighborhood = 9 ['HY-EA-Glut-GABA', 'MB-GABA', 'MB-Glut-Dopa-Sero', 'NN-IMN', 'P-MY-CB-GABA', 'P-MY-CB-Glut', 'Pallium-Glut', 'Subpallium-GABA', 'TH-EPI-Glut']
Number of unique class = 43 
Number of unique subclass = 414 
Number of unique supertype = 1386 
Number of unique cluster = 6721 
Number of unique neurotransmitter_color = 22 ['#000080', '#0000FF', '#008080', '#0a9964', '#1B9E77', '#2B93DF', '#377EB8', '#533691', '#66636C', '#73E785', '#800000', '#800080', '#9189FF', '#A65628', '#B8EC68', '#F781BF', '#FF3358', '#FF4500', '#FF7080', '#fad502', '#fcf04b', '#ff7621']
Number of unique neighborhood_color = 9 ['#006200', '#0096C7', '#03045E', '#1283FF', '#9EF01A', '#B199FF', '#F0A0FF', '#FA0087', '#FF6600']
Number of unique class_color = 43 
Number of unique subclass_color = 414 
Number of unique supertype_color = 1385 
Number of unique cluster_color = 6126 
Number of unique class_order = 43 
Number of unique cluster_order = 6721 
Number of unique neighborhood_order = 9 [1, 2, 3, 4, 5, 6, 7, 8, 9]
Number of unique neurotransmitter_order = 22 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 17, 18, 19, 20, 21, 22, 23, 24, 25]
Number of unique subclass_order = 414 
Number of unique supertype_order = 1386 

Plotting the taxonomy#

Now that we have our cells with associated taxonomy information, we’ll plot them into the UMAP we showed previously.

Below we plot the taxonomy mapping of the cells for each level in the taxonomy. We use the labels and their orders to plot them in a legend. We omit the legends the lower levels as the legends become too busy.

sub_selected = cell_extended[::10]
fig, ax = plot_umap(
    sub_selected['x'],
    sub_selected['y'],
    cc=sub_selected['neighborhood_color'],
    labels=sub_selected['neighborhood'],
    term_orders=sub_selected['neighborhood_order'],
    fig_width=12,
    fig_height=12
)
res = ax.set_title("neighborhood")
plt.show()
../_images/d4ce7df2348d6eae6b6ebfe34c1ddc3adfc7d30a785f7e8d08899de2eb22afea.png
fig, ax = plot_umap(
    sub_selected['x'],
    sub_selected['y'],
    cc=sub_selected['class_color'],
    labels=sub_selected['class'],
    term_orders=sub_selected['class_order'],
    fig_width=20,
    fig_height=20
)
res = ax.set_title("class")
plt.show()
../_images/08cd60fc3db5b8dcc8ebfbbe787b34d31795108a7b427fe58873c7bfa6bae3ab.png
fig, ax = plot_umap(
    sub_selected['x'],
    sub_selected['y'],
    cc=sub_selected['subclass_color'],
    fig_width=18,
    fig_height=18
)
res = ax.set_title("subclass")
plt.show()
../_images/c152b3f4581a57ab74c01ce19de0ffa1f168dc63ee39b18b15e8597fa3f9c7d1.png
fig, ax = plot_umap(
    sub_selected['x'],
    sub_selected['y'],
    cc=sub_selected['supertype_color'],
    fig_width=18,
    fig_height=18
)
res = ax.set_title("supertype")
plt.show()
../_images/16b5fbe91e571a6b39b3643c20bee294c7683bc52ec80ea3909d715279c9d4a2.png
fig, ax = plot_umap(
    sub_selected['x'],
    sub_selected['y'],
    cc=sub_selected['cluster_color'],
    fig_width=18,
    fig_height=18
)
res = ax.set_title("cluster")
plt.show()
../_images/5019568491d93cbb586b570fb7a1da0e90149b022eea544e4a44f7bd4e1f2925.png

Additionally, we plot by neurotransmitter.

fig, ax = plot_umap(
    sub_selected['x'],
    sub_selected['y'],
    cc=sub_selected['neurotransmitter_color'],
    labels=sub_selected['neurotransmitter'],
    term_orders=sub_selected['neurotransmitter_order'],
    fig_width=16,
    fig_height=16
)
res = ax.set_title("neurotransmitter")
plt.show()
../_images/e086a58fe8b4462698a86ff7af6445a90d19fe362e38238d052ed73ceea960c3.png

Neighborhood UMAPS#

The release also provides individual UMAPs for cells in each of the 9 neighborhoods.

We first subselect one of these neighborhoods, ‘Pallium-Glut’. Note that similar masking can for any column/value pair can be done as well. For instance cell_extended[cell_extended['class'] == '001 L1-ET Glut'] will return a DataFrame with only cells in the class 001 L1-ET Glut.

neighborhood_cells = cell_extended[cell_extended['neighborhood'] == 'Pallium-Glut']
neighborhood_cells.head()
cell_barcode barcoded_cell_sample_label library_label dataset_label feature_matrix_label library_method alignment_id region_of_interest_acronym anatomical_division_label donor_label ... class_color subclass_color supertype_color cluster_color class_order cluster_order neighborhood_order neurotransmitter_order subclass_order supertype_order
cell_label
pBICCNsMMrVISPiF019d210623B1_ATTCCCGCAACTCGAT ATTCCCGCAACTCGAT NaN pBICCNsMMrVISPiF019d210623B1 Consensus-WMB-Macosko-10X Macosko-10X-Isocortex NaN NaN VISp Isocortex F019 ... #FA0087 #E9530F #5C6899 #FFDFA0 1 26 1 1 2 6
pBICCNsMMrOLFiM015d201201C2_GTTCGCTAGTAGGAAG GTTCGCTAGTAGGAAG NaN pBICCNsMMrOLFiM015d201201C2 Consensus-WMB-Macosko-10X Macosko-10X-OLF NaN NaN OLF OLF M015 ... #D00000 #7F1FCC #7ACCA8 #006635 3 623 1 1 40 149
CTAGAGTGTCTACCTC-1.03_A01 CTAGAGTGTCTACCTC 1.03_A01 L8TX_171026_01_F03 WMB-10Xv2 WMB-10Xv2-Isocortex-2 10Xv2 1178483180 MOp Isocortex Snap25-IRES2-Cre;Ai14-352353 ... #FA0087 #E9530F #5C6899 #FFDFA0 1 26 1 1 2 6
CGTAGCGCAATGTAAG-006_C01 CGTAGCGCAATGTAAG 006_C01 L8TX_180221_01_F10 WMB-10Xv2 WMB-10Xv2-Isocortex-4 10Xv2 1186619312 PL-ILA-ORB Isocortex Snap25-IRES2-Cre;Ai14-372312 ... #FA0087 #C0FF4D #8D1F75 #289917 1 102 1 1 5 25
pBICCNsMMrACAiM016d210628A3_TAATCTCAGACTTCGT TAATCTCAGACTTCGT NaN pBICCNsMMrACAiM016d210628A3 Consensus-WMB-Macosko-10X Macosko-10X-Isocortex NaN NaN ACA Isocortex M016 ... #FA0087 #3DCCB7 #33FFB6 #99A145 1 36 1 1 3 9

5 rows × 40 columns

Now we load and join in the coordinates of the neighborhood UMAP. Note the inner join and the suffix added to the joined DataFrame as we already have ‘x’ and ‘y’ columns.

neighborhood_cells = neighborhood_cells.join(
    abc_cache.get_metadata_dataframe(
        'Consensus-WMB-integrated-taxonomy',
        'Pallium-Glut_cell_2d_embedding_coordinates'
        ).set_index('cell_label'),
    how='inner',
    rsuffix='_pallium_glut'
)
neighborhood_cells.head()
Pallium-Glut_cell_2d_embedding_coordinates.csv: 100%|██████████| 138M/138M [00:25<00:00, 5.45MMB/s]   
cell_barcode barcoded_cell_sample_label library_label dataset_label feature_matrix_label library_method alignment_id region_of_interest_acronym anatomical_division_label donor_label ... supertype_color cluster_color class_order cluster_order neighborhood_order neurotransmitter_order subclass_order supertype_order x_pallium_glut y_pallium_glut
cell_label
pBICCNsMMrVISPiF019d210623B1_ATTCCCGCAACTCGAT ATTCCCGCAACTCGAT NaN pBICCNsMMrVISPiF019d210623B1 Consensus-WMB-Macosko-10X Macosko-10X-Isocortex NaN NaN VISp Isocortex F019 ... #5C6899 #FFDFA0 1 26 1 1 2 6 10.966314 10.968464
pBICCNsMMrOLFiM015d201201C2_GTTCGCTAGTAGGAAG GTTCGCTAGTAGGAAG NaN pBICCNsMMrOLFiM015d201201C2 Consensus-WMB-Macosko-10X Macosko-10X-OLF NaN NaN OLF OLF M015 ... #7ACCA8 #006635 3 623 1 1 40 149 -1.411870 16.865832
CTAGAGTGTCTACCTC-1.03_A01 CTAGAGTGTCTACCTC 1.03_A01 L8TX_171026_01_F03 WMB-10Xv2 WMB-10Xv2-Isocortex-2 10Xv2 1178483180 MOp Isocortex Snap25-IRES2-Cre;Ai14-352353 ... #5C6899 #FFDFA0 1 26 1 1 2 6 11.382113 11.546221
CGTAGCGCAATGTAAG-006_C01 CGTAGCGCAATGTAAG 006_C01 L8TX_180221_01_F10 WMB-10Xv2 WMB-10Xv2-Isocortex-4 10Xv2 1186619312 PL-ILA-ORB Isocortex Snap25-IRES2-Cre;Ai14-372312 ... #8D1F75 #289917 1 102 1 1 5 25 15.446065 10.078577
pBICCNsMMrACAiM016d210628A3_TAATCTCAGACTTCGT TAATCTCAGACTTCGT NaN pBICCNsMMrACAiM016d210628A3 Consensus-WMB-Macosko-10X Macosko-10X-Isocortex NaN NaN ACA Isocortex M016 ... #33FFB6 #99A145 1 36 1 1 3 9 11.877949 13.603071

5 rows × 42 columns

fig, ax = plot_umap(
    neighborhood_cells['x_pallium_glut'],
    neighborhood_cells['y_pallium_glut'],
    cc=neighborhood_cells['subclass_color'],
    labels=neighborhood_cells['subclass'],
    term_orders=neighborhood_cells['subclass_order'],
    fig_width=18,
    fig_height=18
)
res = ax.set_title("Subclass in Pallium-Glut Neighborhood")
plt.show()
../_images/9ca6c2960047e0d4f542a5d145e2b7c534ebeeda0ca736fff9c40b5050b26b94.png

Below is a block of code that will plot a given term (e.g. taxonomy class, origin_dataset, donor_sex) in each of the 9 neighborhood UMAPS. Change the value in term_to_plot to any of the other columns we used above to visualize that data in each of the 9 neighborhoods.

term_to_plot = 'subclass' # Change to other term (e.g. taxonomy level, anatomical, donor_sex etc.)
# Loop through all neighborhoods and plot subclass UMAPs ordered by term_order.
fig, ax = plt.subplots(3, 3)
fig.set_size_inches(18, 18)
ax = ax.flatten()

for idx, neighborhood in enumerate(cluster_annotation_term[
        cluster_annotation_term['cluster_annotation_term_set_name'] == 'neighborhood'
        ].sort_values('term_order')['name']):
    neighborhood_cells = cell_extended.join(
        abc_cache.get_metadata_dataframe(
            'Consensus-WMB-integrated-taxonomy',
            f'{neighborhood}_cell_2d_embedding_coordinates'
            ).set_index('cell_label'),
        how='inner',
        rsuffix=f'_{neighborhood}'
    )
    plot_umap(
        neighborhood_cells['x_' + neighborhood],
        neighborhood_cells['y_' + neighborhood],
        cc=neighborhood_cells[term_to_plot + '_color'],
        fig=fig,
        ax=ax[idx]
    )
    res = ax[idx].set_title(f"{neighborhood} Neighborhood")
plt.tight_layout()
plt.show()
Subpallium-GABA_cell_2d_embedding_coordinates.csv: 100%|██████████| 53.7M/53.7M [00:09<00:00, 5.46MMB/s]  
HY-EA-Glut-GABA_cell_2d_embedding_coordinates.csv: 100%|██████████| 16.5M/16.5M [00:02<00:00, 5.72MMB/s]
MB-Glut-Dopa-Sero_cell_2d_embedding_coordinates.csv: 100%|██████████| 15.2M/15.2M [00:02<00:00, 5.69MMB/s]
TH-EPI-Glut_cell_2d_embedding_coordinates.csv: 100%|██████████| 15.8M/15.8M [00:02<00:00, 5.83MMB/s]
MB-GABA_cell_2d_embedding_coordinates.csv: 100%|██████████| 8.94M/8.94M [00:01<00:00, 5.35MMB/s]
P-MY-CB-Glut_cell_2d_embedding_coordinates.csv: 100%|██████████| 55.5M/55.5M [00:09<00:00, 5.69MMB/s]  
P-MY-CB-GABA_cell_2d_embedding_coordinates.csv: 100%|██████████| 11.1M/11.1M [00:01<00:00, 5.62MMB/s]
NN-IMN_cell_2d_embedding_coordinates.csv: 100%|██████████| 184M/184M [00:34<00:00, 5.38MMB/s]    
../_images/07ffc671cdbdf9d9cd931b14f413bf354153b0c109939e9dbaa88e9ed91396a9.png

Aggregating cluster and cells counts.#

Let’s investigate the taxonomy information a bit more. In this section, we’ll create bar plots showing the number of clusters and cells at each level in the taxonomy.

First, we need to compute the number of clusters that are in each of the cell type taxons above it.

term_cluster_count = membership_with_cluster_info.reset_index().groupby(
        ['cluster_annotation_term_label']
    )[['cluster_alias']].count()
term_cluster_count.columns = ['number_of_clusters']
term_cluster_count.head()
number_of_clusters
cluster_annotation_term_label
CS20251031_CLAS_0001 518
CS20251031_CLAS_0002 98
CS20251031_CLAS_0003 24
CS20251031_CLAS_0004 25
CS20251031_CLAS_0005 167

Next we sum the cells that are associated for each level in the taxonomy.

term_cell_count = membership_with_cluster_info.reset_index().groupby(
    ['cluster_annotation_term_label']
)[['number_of_cells']].sum()
term_cell_count.head()
number_of_cells
cluster_annotation_term_label
CS20251031_CLAS_0001 1628174
CS20251031_CLAS_0002 386033
CS20251031_CLAS_0003 17251
CS20251031_CLAS_0004 142462
CS20251031_CLAS_0005 218309
# Join counts with the term dataframe
term_with_counts = cluster_annotation_term.join(term_cluster_count)
term_with_counts = term_with_counts.join(term_cell_count)
term_with_counts.head()
name cluster_annotation_term_set_label cluster_annotation_term_set_name color_hex_triplet term_order term_set_order parent_term_label parent_term_name parent_term_set_label number_of_clusters number_of_cells
label
CS20251031_NEUR_0004 Chol CCN20251031_NEUR neurotransmitter #73E785 4 0 NaN NaN NaN 25 3632
CS20251031_NEUR_0012 Chol-Dopa CCN20251031_NEUR neurotransmitter #B8EC68 12 0 NaN NaN NaN 1 261
CS20251031_NEUR_0008 Dopa CCN20251031_NEUR neurotransmitter #fcf04b 8 0 NaN NaN NaN 24 7207
CS20251031_NEUR_0002 GABA CCN20251031_NEUR neurotransmitter #FF3358 2 0 NaN NaN NaN 2332 1301636
CS20251031_NEUR_0006 GABA-Chol CCN20251031_NEUR neurotransmitter #000080 6 0 NaN NaN NaN 3 397

Below we create a function to plot the cluster and cell counts in a bar graph, coloring by the associated taxon level.

def bar_plot_by_level_and_type(df: pd.DataFrame, level: str, fig_width: float = 8.5, fig_height: float = 8.5):
    """Plot the number of cells by the specified level.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing cluster annotation terms with counts.
    level : str
        The level of the taxonomy to plot (e.g., 'Neighborhood', 'Class', 'Subclass', 'Group', 'Cluster').
    fig_width : float, optional
        Width of the figure in inches. Default is 8.5.
    fig_height : float, optional
        Height of the figure in inches. Default is 4.
    """

    fig, ax = plt.subplots(1, 2)
    fig.set_size_inches(fig_width, fig_height)

    for idx, ctype in enumerate(['clusters', 'cells']):

        pred = (df['cluster_annotation_term_set_name'] == level)
        sort_order = np.argsort(df[pred]['term_order'])
        names = df[pred]['name'].iloc[sort_order]
        counts = df[pred]['number_of_%s' % ctype].iloc[sort_order]
        colors = df[pred]['color_hex_triplet'].iloc[sort_order]
        
        ax[idx].barh(names[::-1], counts[::-1], color=colors[::-1])
        ax[idx].set_title('Number of %s by %s' % (ctype,level))
        ax[idx].set_xlabel('Number of %s' % ctype)
        if ctype == 'cells':
            ax[idx].set_xscale('log')
        
        if idx > 0:
            ax[idx].set_yticklabels([])

    return fig, ax

Now, we plot bar graphs of the number of clusters and cells by taxonomy level. Below we show neighborhood and class, but this comparison can be made for all levels in the taxonomy.

fig, ax = bar_plot_by_level_and_type(term_with_counts, 'neighborhood')
plt.show()
../_images/c663215a05fb19a0f2aebf3d87081c754515012c06a8db3a27670ae696962813.png
fig, ax = bar_plot_by_level_and_type(term_with_counts, 'class')
plt.show()
../_images/ca4d2d29a9029857ddfa56f6fcf528396bb7ca0e49714ba1d86bae06f43f6513.png

Visualizing the taxonomy#

Finally, we create a pie chart for Neighborhood, Class, Subclass, and Supertype. This is plotted in such a way that the inner rings are all children of the parent taxon above. The width’s of the pie colors are given by the number of clusters in taxon.

levels = ['neighborhood', 'class', 'subclass', 'supertype']
df = {}

# Copy the term order of the parent into each of the level below it.
term_with_counts['parent_order'] = ""
for idx, row in term_with_counts.iterrows():
    if pd.isna(row['parent_term_label']):
        continue
    term_with_counts.loc[idx, 'parent_order'] = term_with_counts.loc[row['parent_term_label']]['term_order']

term_with_counts = term_with_counts.reset_index()
for lvl in levels:
    pred = term_with_counts['cluster_annotation_term_set_name'] == lvl
    df[lvl] = term_with_counts[pred]
    df[lvl] = df[lvl].sort_values(['parent_order', 'term_order'])

fig, ax = plt.subplots()
fig.set_size_inches(10, 10)
size = 0.15

for i, lvl in enumerate(levels):
    
    if lvl == 'neighborhood':
        ax.pie(df[lvl]['number_of_clusters'],
               colors=df[lvl]['color_hex_triplet'],
               labels = df[lvl]['name'],
               rotatelabels=True,
               labeldistance=1.025,
               radius=1,
               wedgeprops=dict(width=size, edgecolor=None),
               startangle=0)
    else :
        ax.pie(df[lvl]['number_of_clusters'],
               colors=df[lvl]['color_hex_triplet'],
               radius=1-i*size,
               wedgeprops=dict(width=size, edgecolor=None),
               startangle=0)
term_with_counts = term_with_counts.set_index('label')
plt.show()
../_images/b14bba18d5dedba0c4c301933ebb8d4038d4f2345404a32fe4885d03d28a9d33.png

In the next notebook, we’ll explore the gene expression data and combine them with the taxonomy and cell level metadata. You can also explore the previously released Whole Mouse Brain (WMB-10X) through the notebooks linked here.