Human-Mammalian Brain - Basal Ganglia 10X snRANSeq analysis: clustering and annotations#

The basal ganglia (BG) are a system of interconnected brain structures that play a crucial role in motor control, learning, behavior, and emotion. With approximately 200 million neurons in the human basal ganglia alone, these structures are involved in a wide range of neurological processes and are implicated in numerous disorders affecting human health, including Parkinson’s disease, Huntington’s disease, and substance abuse disorders. To further understand the complexity of the basal ganglia, researchers have historically classified its neurons into various types based on their cytoarchitecture, connectivity, molecular profile, and functional properties. However, recent advancements in high-throughput transcriptomic profiling have revolutionized our ability to systematically categorize these cell types within species, while the maturation of machine learning technologies have enabled the integration of these taxonomies across species.

Our consensus basal ganglia cell type taxonomy is the result of iterative clustering and cross-species integration of transcriptomic data. The taxonomy encompasses neurons from key structures within the basal ganglia, including the caudate (Ca), putamen (Pu), nucleus accumbens (NAc), the external and internal segments of the globus pallidus (GPe, GPi), subthalamic nucleus (STN), and substantia nigra (SN). By combining data from multiple primate and rodent species, we have developed a consensus taxonomy that highlights both conserved and species-specific cell types. We validate our taxonomy through marker gene expression analysis, comparison with previously published taxonomies, and self-projection, ensuring the accuracy and robustness of each level in the taxonomic hierarchy.

In a related tutorial, we also show how to access and use HMBA-BG gene expression data.

You need to be connected to the internet to run this notebook or connected to a cache that has the HMBA-BG data downloaded already.

The notebook presented here shows quick visualizations from precomputed metadata in the atlas. For examples on accessing the expression matrices, specifically selecting genes from expression matrices, see the general_acessing_10x_snRNASeq_tutorial.ipynb tutorial/example.

import pandas as pd
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt

from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache

We will interact with the data using the AbcProjectCache. This cache object downloads data requested by the user, tracks which files have already been downloaded to your local system, and serves the path to the requested data on disk. For metadata, the cache can also directly serve a up a Pandas DataFrame. See the getting_started notebook for more details on using the cache including installing it if it has not already been.

Change the download_base variable to where you would like to download the data to your system.

download_base = Path('../../data/abc_atlas')
abc_cache = AbcProjectCache.from_cache_dir(download_base)

abc_cache.current_manifest
'releases/20250531/manifest.json'

Data overview#

We’ll quickly walk through the data we will be using in this notebook. The HMBA-BG 10X datasets are located across several directories listed in the ABCProjectCache. In this notebook, we will be looking at the Aligned HMBA-BG 10X dataset. These data combine the there species with an aligned set of ~16k across the species. We will be using data and metadata from the following directories:

  • HMBA-10xMultiome-BG-Aligned

  • HMBA-BG-taxonomy-CCN20250428

Note that data for each individual species is available in the directory HMBA-10xMultiome-BG-Aligned.

Below we list the data and metadata in the HMBA-10xMultiome-BG-Aligned dataset.

print("HMBA-10xMultiome-BG-Aligned: gene expression data (h5ad)\n\t", abc_cache.list_data_files(directory='HMBA-10xMultiome-BG-Aligned'))
print("HMBA-10xMultiome-BG-Aligned: metadata (csv)\n\t", abc_cache.list_metadata_files(directory='HMBA-10xMultiome-BG-Aligned'))
HMBA-10xMultiome-BG-Aligned: gene expression data (h5ad)
	 ['HMBA-10xMultiome-BG-Aligned/log2', 'HMBA-10xMultiome-BG-Aligned/raw']
HMBA-10xMultiome-BG-Aligned: metadata (csv)
	 ['cell_metadata', 'donor', 'example_gene_expression', 'gene', 'library', 'value_sets']

We will also use metadata from the HMBA-BG taxonomy directory. Below is the list of available files:

print("HMBA-BG-taxonomy-CCN20250428: metadata (csv)\n\t", abc_cache.list_metadata_files(directory='HMBA-BG-taxonomy-CCN20250428'))
HMBA-BG-taxonomy-CCN20250428: metadata (csv)
	 ['abbreviation_term', 'cell_2d_embedding_coordinates', 'cell_to_cluster_membership', 'cluster', 'cluster_annotation_term', 'cluster_annotation_term_set', 'cluster_annotation_to_abbreviation_map', 'cluster_to_cluster_annotation_membership']

Cell metadata#

Essential cell metadata is stored as a CSV file that we load as a Pandas DataFrame. Each row represents one cell indexed by a cell label. The cell label is the concatenation of barcode and name of the sample. In this context, the sample is the barcoded cell sample that represents a single load into one port of the 10x Chromium. Note that cell barcodes are only unique within a single barcoded cell sample and that the same barcode can be reused. This metadata file contains cells across all species in the HMBA-BG dataset.

Each cell is associated with a library label, donor label, alignment_job_id, feature_matrix_label and dataset_label identifying which data package this cell is part of. This metadata file will be combined with other metadata files that ship with this package to add information associated with the donor, UMAP coordinates, cluster assignments, and more.

Below, we load the first of the metadata used in this tutorial. This represents the cell metadata for the aligned dataset.

The command we use below both downloads the data if it is not already available in the local cache and loads the data as a Pandas DataFrame. This pattern of loading metadata is repeated throughout the tutorials.

cell = abc_cache.get_metadata_dataframe(
    directory='HMBA-10xMultiome-BG-Aligned',
    file_name='cell_metadata',
    dtype={'cell_label': str}
).set_index('cell_label')
print("Number of cells = ", len(cell))
cell.head()
/Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.12/site-packages/abc_atlas_access/abc_atlas_cache/abc_project_cache.py:429: DtypeWarning: Columns (5) have mixed types. Specify dtype option on import or set low_memory=False.
  return pd.read_csv(path, **kwargs)
Number of cells =  2035898
cell_barcode donor_label barcoded_cell_sample_label library_label alignment_job_id doublet_score umi_count feature_matrix_label dataset_label abc_sample_id
cell_label
AAACAGCCAAATGCCC-2362_A05 AAACAGCCAAATGCCC H24.30.001 2362_A05 L8XR_240808_01_E02 8a4bf81821a0f425be8ba9c15dfad6b509020312 0.027027 15259.0 HMBA-10xMultiome-BG-Aligned HMBA-10xMultiome-BG-Aligned 38447b19-a5dc-4eca-9021-56ae191e8809
AAACAGCCAATTGAGA-2362_A05 AAACAGCCAATTGAGA H24.30.001 2362_A05 L8XR_240808_01_E02 8a4bf81821a0f425be8ba9c15dfad6b509020312 0.054795 20645.0 HMBA-10xMultiome-BG-Aligned HMBA-10xMultiome-BG-Aligned 3dc4dc1f-4b8a-4012-bb07-e3906ad70da0
AAACAGCCAGCATGTC-2362_A05 AAACAGCCAGCATGTC H24.30.001 2362_A05 L8XR_240808_01_E02 8a4bf81821a0f425be8ba9c15dfad6b509020312 0.000000 2551.0 HMBA-10xMultiome-BG-Aligned HMBA-10xMultiome-BG-Aligned 291d6ebf-ca70-4d9d-8dde-10fa841dba93
AAACAGCCATTGACAT-2362_A05 AAACAGCCATTGACAT H24.30.001 2362_A05 L8XR_240808_01_E02 8a4bf81821a0f425be8ba9c15dfad6b509020312 0.000000 2341.0 HMBA-10xMultiome-BG-Aligned HMBA-10xMultiome-BG-Aligned 12153419-9b49-4d5f-a10d-ddb7837a729d
AAACAGCCATTGTGGC-2362_A05 AAACAGCCATTGTGGC H24.30.001 2362_A05 L8XR_240808_01_E02 8a4bf81821a0f425be8ba9c15dfad6b509020312 0.027397 8326.0 HMBA-10xMultiome-BG-Aligned HMBA-10xMultiome-BG-Aligned 5716499d-d69d-489e-b4ed-6841b6ba051d

We can use pandas groupby function to see how many unique items are associated for each field and list them out if the number of items is small.

def print_column_info(df):
    
    for c in df.columns:
        grouped = df[[c]].groupby(c).count()
        members = ''
        if len(grouped) < 30:
            members = str(list(grouped.index))
        print("Number of unique %s = %d %s" % (c, len(grouped), members))
print_column_info(cell)
Number of unique cell_barcode = 690037 
Number of unique donor_label = 22 ['CJ23.56.002', 'CJ23.56.003', 'CJ24.56.001', 'CJ24.56.004', 'H18.30.001', 'H19.30.004', 'H20.30.001', 'H20.30.002', 'H21.30.004', 'H23.30.001', 'H24.30.001', 'H24.30.003', 'H24.30.004', 'H24.30.007', 'Q19.26.010', 'Q21.26.002', 'Q21.26.010', 'QM21.26.001', 'QM21.26.003', 'QM23.50.001', 'QM23.50.002', 'QM23.50.003']
Number of unique barcoded_cell_sample_label = 417 
Number of unique library_label = 417 
Number of unique alignment_job_id = 303 
Number of unique doublet_score = 1426 
Number of unique umi_count = 86532 
Number of unique feature_matrix_label = 1 ['HMBA-10xMultiome-BG-Aligned']
Number of unique dataset_label = 1 ['HMBA-10xMultiome-BG-Aligned']
Number of unique abc_sample_id = 1898081 

Donor and Library metadata#

The first two associated metadata we load are the donor and library tables. The donor table contains species, sex, and age information. The library table contains information on 10X methods and brain region of interest the tissue was extracted from.

donor = abc_cache.get_metadata_dataframe(
    directory='HMBA-10xMultiome-BG-Aligned',
    file_name='donor'
).set_index('donor_label')
donor.head()
donor_species species_scientific_name species_common_name donor_sex donor_age donor_age_value donor_age_unit
donor_label
QM23.50.003 NCBITaxon:9544 Macaca mulatta Macaque Male 6 yrs 6.0 years
QM21.26.001 NCBITaxon:9544 Macaca mulatta Macaque Male 6 yrs 6.0 years
Q19.26.010 NCBITaxon:9544 Macaca mulatta Macaque Female 10 yrs 10.0 years
H24.30.003 NCBITaxon:9606 Homo sapiens Human Female 19 yrs 19.0 years
H21.30.004 NCBITaxon:9606 Homo sapiens Human Male 57 yrs 57.0 years
library = abc_cache.get_metadata_dataframe(
    directory='HMBA-10xMultiome-BG-Aligned',
    file_name='library'
).set_index('library_label')
library.head()
library_method barcoded_cell_sample_label enrichment_population cell_specimen_type parcellation_term_identifier region_of_interest_name region_of_interest_label anatomical_division_label donor_label
library_label
L8XR_240222_21_H03 10xMultiome;GEX 2077_A02 50% NeuN+, 35% OLIG2-, 15% OLIG2+ Nuclei DHBA:11538 caudal putamen PuC Basal ganglia QM23.50.003
L8XR_220428_02_A05 10xMultiome;GEX 1224_A04 10% NeuN+, 49% OLIG2-, 25% OLIG2+, 16% Nurr1+ Nuclei DHBA:12261 ventral tegmental area VTA Basal ganglia QM21.26.001
L8XR_220428_02_C04 10xMultiome;GEX 1222_A02 60% NeuN+, 27% OLIG2-, 13% OLIG2+ Nuclei DHBA:10466 subthalamic nucleus STH Basal ganglia Q19.26.010
L8XR_240705_01_A06 10xMultiome;GEX 2305_E01 50% NeuN+, 35% OLIG2-, 15% OLIG2+ Nuclei DHBA:11537 rostral putamen PuR Basal ganglia H24.30.003
L8XR_240919_21_B05 10xMultiome;GEX 2453_A02 70% NeuN+, 20% OLIG2-, 10% OLIG2+ Nuclei DHBA:10345 Ventral pallidus VeP Basal ganglia H21.30.004

We combine these into an extended cell metadata table.

cell_extended = cell.join(donor, on='donor_label')
cell_extended = cell_extended.join(library, on='library_label', rsuffix='_library_table')

Below we use a pandas groupby to show the number of cells from species.

cell_extended.groupby('species_common_name')[['feature_matrix_label']].count()
feature_matrix_label
species_common_name
Human 1051428
Macaque 569895
Marmoset 414575

We can also use the groupby function to show the number of cells in each region of interest extracted from the BG.

cell_extended.groupby('region_of_interest_name')[['region_of_interest_label']].count()
region_of_interest_label
region_of_interest_name
Ventral pallidus 39593
body of caudate 188438
brain 414575
caudal putamen 166294
caudate nucleus 27964
core of nucleus accumbens 19283
external segment of globus pallidus 205322
globus pallidus 4900
head of caudate 132325
internal segment of globus pallidus 134560
nucleus accumbens 157981
peri-caudate ependymal and subependymal zone 10428
posteroventral putamen 74681
putamen 35092
rostral putamen 196452
shell of nucleus accumbens 9525
substantia nigra 40396
subthalamic nucleus 63348
tail of caudate 90271
ventral tegmental area 24470

Adding color and feature order#

Each feature in the donor and library table comes is associated with unique colors and an ordering with the set of values. Below we load the value_sets DataFrame which is a mapping from the various value in the donor and species tables to those colors and orderings. We incorporate these values into the cell metadata table.

value_sets = abc_cache.get_metadata_dataframe(
    directory='HMBA-10xMultiome-BG-Aligned',
    file_name='value_sets'
).set_index('label')
value_sets.head()
table field description order external_identifier parent_label color_hex_triplet comment Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12
label
Female donor donor_sex Female 1 NaN NaN #565353 NaN NaN NaN NaN NaN
Male donor donor_sex Male 2 NaN NaN #ADC4C3 NaN NaN NaN NaN NaN
Human donor species_common_name Human 1 NaN NaN #377eb8 NaN NaN NaN NaN NaN
Macaque donor species_common_name Macaque 2 NaN NaN #4daf4a NaN NaN NaN NaN NaN
Marmoset donor species_common_name Marmoset 3 NaN NaN #FF5F5D NaN NaN NaN NaN NaN

We define a convince function to add colors for the various values in the data (e.g. unique region of interest or donor sex values).

def extract_value_set(cell_metadata_df: pd.DataFrame, input_value_set: pd.DataFrame, input_value_set_label: str):
    """Add color and order columns to the cell metadata dataframe based on the input
    value set.

    Columns are added as {input_value_set_label}_color and {input_value_set_label}_order.

    Parameters
    ----------
    cell_metadata_df : pd.DataFrame
        DataFrame containing cell metadata.
    input_value_set : pd.DataFrame
        DataFrame containing the value set information.
    input_value_set_label : str
        The the column name to extract color and order information for. will be added to the cell metadata.
    """
    cell_metadata_df[f'{input_value_set_label}_color'] = input_value_set[
        input_value_set['field'] == input_value_set_label
    ].loc[cell_metadata_df[input_value_set_label]]['color_hex_triplet'].values
    cell_metadata_df[f'{input_value_set_label}_order'] = input_value_set[
        input_value_set['field'] == input_value_set_label
    ].loc[cell_metadata_df[input_value_set_label]]['order'].values

Use our function to add the relevant color and order columns to our cell_metadata table.

# Add region of interest color and order
extract_value_set(cell_extended, value_sets, 'region_of_interest_label')
# Add species common name color and order
extract_value_set(cell_extended, value_sets, 'species_common_name')
# Add species scientific name color and order
extract_value_set(cell_extended, value_sets, 'species_scientific_name')
# Add donor sex color and order
extract_value_set(cell_extended, value_sets, 'donor_sex')
cell_extended.head()
cell_barcode donor_label barcoded_cell_sample_label library_label alignment_job_id doublet_score umi_count feature_matrix_label dataset_label abc_sample_id ... anatomical_division_label donor_label_library_table region_of_interest_label_color region_of_interest_label_order species_common_name_color species_common_name_order species_scientific_name_color species_scientific_name_order donor_sex_color donor_sex_order
cell_label
AAACAGCCAAATGCCC-2362_A05 AAACAGCCAAATGCCC H24.30.001 2362_A05 L8XR_240808_01_E02 8a4bf81821a0f425be8ba9c15dfad6b509020312 0.027027 15259.0 HMBA-10xMultiome-BG-Aligned HMBA-10xMultiome-BG-Aligned 38447b19-a5dc-4eca-9021-56ae191e8809 ... Basal ganglia H24.30.001 #53DB33 15 #377eb8 1 #377eb8 1 #ADC4C3 2
AAACAGCCAATTGAGA-2362_A05 AAACAGCCAATTGAGA H24.30.001 2362_A05 L8XR_240808_01_E02 8a4bf81821a0f425be8ba9c15dfad6b509020312 0.054795 20645.0 HMBA-10xMultiome-BG-Aligned HMBA-10xMultiome-BG-Aligned 3dc4dc1f-4b8a-4012-bb07-e3906ad70da0 ... Basal ganglia H24.30.001 #53DB33 15 #377eb8 1 #377eb8 1 #ADC4C3 2
AAACAGCCAGCATGTC-2362_A05 AAACAGCCAGCATGTC H24.30.001 2362_A05 L8XR_240808_01_E02 8a4bf81821a0f425be8ba9c15dfad6b509020312 0.000000 2551.0 HMBA-10xMultiome-BG-Aligned HMBA-10xMultiome-BG-Aligned 291d6ebf-ca70-4d9d-8dde-10fa841dba93 ... Basal ganglia H24.30.001 #53DB33 15 #377eb8 1 #377eb8 1 #ADC4C3 2
AAACAGCCATTGACAT-2362_A05 AAACAGCCATTGACAT H24.30.001 2362_A05 L8XR_240808_01_E02 8a4bf81821a0f425be8ba9c15dfad6b509020312 0.000000 2341.0 HMBA-10xMultiome-BG-Aligned HMBA-10xMultiome-BG-Aligned 12153419-9b49-4d5f-a10d-ddb7837a729d ... Basal ganglia H24.30.001 #53DB33 15 #377eb8 1 #377eb8 1 #ADC4C3 2
AAACAGCCATTGTGGC-2362_A05 AAACAGCCATTGTGGC H24.30.001 2362_A05 L8XR_240808_01_E02 8a4bf81821a0f425be8ba9c15dfad6b509020312 0.027397 8326.0 HMBA-10xMultiome-BG-Aligned HMBA-10xMultiome-BG-Aligned 5716499d-d69d-489e-b4ed-6841b6ba051d ... Basal ganglia H24.30.001 #53DB33 15 #377eb8 1 #377eb8 1 #ADC4C3 2

5 rows × 34 columns

UMAP spatial embedding#

Now that we’ve merged our donor and library metadata into the main cells data, our next step is to put these values in a Uniform Manifold Approximation and Projection (UMAP) for cells in the dataset. The UMAP is a dimension reduction technique that can be used for visualizing and exploring large-dimension datasets.

Below we load this 2-D embedding for a sub selection of our cells and merge the x-y coordinates into the extended cell metadata we are creating.

cell_2d_embedding_coordinates = abc_cache.get_metadata_dataframe(
    directory='HMBA-BG-taxonomy-CCN20250428',
    file_name='cell_2d_embedding_coordinates'
).set_index('cell_label')
cell_2d_embedding_coordinates.head()
x y
cell_label
AAACAGCCAAATGCCC-2362_A05 0.452302 2.938630
AAACAGCCAATTGAGA-2362_A05 0.051495 16.282684
AAACAGCCAGCATGTC-2362_A05 -1.233702 8.666612
AAACAGCCATTGACAT-2362_A05 0.517126 15.368365
AAACAGCCATTGTGGC-2362_A05 -3.697715 -1.647361
cell_extended = cell_extended.join(cell_2d_embedding_coordinates)
cell_extended = cell_extended.sample(frac=1)

We define a small helper function plot umap to visualize the cells on the UMAP. In the examples below we will plot associated cell information colorized by dissection donor species, sex, and region of interest.

def plot_umap(
    xx,
    yy,
    cc=None,
    val=None,
    fig_width=8,
    fig_height=8,
    cmap=None,
    labels=None,
    term_orders=None,
    colorbar=False,
    sizes=None
):
    """
    Plot a scatter plot of the UMAP coordinates.

    Parameters
    ----------
    xx : array-like
        x-coordinates of the points to plot.
    yy : array-like
        y-coordinates of the points to plot.
    cc : array-like, optional
        colors of the points to plot. If None, the points will be colored by the values in `val`.
    val : array-like, optional
        values of the points to plot. If None, the points will be colored by the values in `cc`.
    fig_width : float, optional
        width of the figure in inches. Default is 8.
    fig_height : float, optional
        height of the figure in inches. Default is 8.
    cmap : str, optional
        colormap to use for coloring the points. If None, the points will be colored by the values in `cc`.
    labels : array-like, optional
        labels for the points to plot. If None, no labels will be added to the plot.
    term_orders : array-like, optional
        order of the labels for the legend. If None, the labels will be ordered by their appearance in `labels`.
    colorbar : bool, optional
        whether to add a colorbar to the plot. Default is False.
    sizes : array-like, optional
        sizes of the points to plot. If None, all points will have the same size.
    """
    if sizes is None:
        sizes = 1
    fig, ax = plt.subplots()
    fig.set_size_inches(fig_width, fig_height)

    if cmap is not None:
        scatt = ax.scatter(xx, yy, c=val, s=0.5, marker='.', cmap=cmap, alpha=sizes)
    elif cc is not None:
        scatt = ax.scatter(xx, yy, c=cc, s=0.5, marker='.', alpha=sizes)

    if labels is not None:
        from matplotlib.patches import Rectangle
        unique_label_colors = (labels + ',' + cc).unique()
        unique_labels = np.array([label_color.split(',')[0] for label_color in unique_label_colors])
        unique_colors = np.array([label_color.split(',')[1] for label_color in unique_label_colors])

        if term_orders is not None:
            unique_order = term_orders.unique()
            term_order = np.argsort(unique_order)
            unique_labels = unique_labels[term_order]
            unique_colors = unique_colors[term_order]
            
        rects = []
        for color in unique_colors:
            rects.append(Rectangle((0, 0), 1, 1, fc=color))

        legend = ax.legend(rects, unique_labels, loc=1)
        # ax.add_artist(legend)

    if colorbar:
        fig.colorbar(scatt, ax=ax)
    
    return fig, ax
cell_with_umap = cell_extended[~pd.isna(cell_extended['x'])]
fig, ax = plot_umap(
    cell_with_umap['x'],
    cell_with_umap['y'],
    cc=cell_with_umap['species_common_name_color'],
    labels=cell_with_umap['species_common_name'],
    term_orders=cell_with_umap['species_common_name_order'],
    fig_width=12,
    fig_height=12
)
res = ax.set_title("species_common_name")
plt.show()
../_images/7cf0499e216150af565118cf480a2a264cd4e0b07e5172cdcadac4b748382e77.png
fig, ax = plot_umap(
    cell_with_umap['x'],
    cell_with_umap['y'],
    cc=cell_with_umap['species_scientific_name_color'],
    labels=cell_with_umap['species_scientific_name'],
    term_orders=cell_with_umap['species_scientific_name_order'],
    fig_width=12,
    fig_height=12
)
res = ax.set_title("species_scientific_name")
plt.show()
../_images/1ea9754b2e84e956c5fc92b6014a434da317a8bf14e2f26c97adc41a7a2d77e6.png
fig, ax = plot_umap(
    cell_with_umap['x'],
    cell_with_umap['y'],
    cc=cell_with_umap['donor_sex_color'],
    labels=cell_with_umap['donor_sex'],
    term_orders=cell_with_umap['donor_sex_order'],
    fig_width=12,
    fig_height=12
)
res = ax.set_title("donor_sex")
plt.show()
../_images/df3eead785d51177d4cdf498c89f55c6109ed8e4d3c5a86edbacd42bb332f730.png

Below we show the region of interest for the three species. Note, however, that Marmoset does not currently have fine grained ROIs available and is marked as Br - Brain.

fig, ax = plot_umap(
    cell_with_umap['x'],
    cell_with_umap['y'],
    cc=cell_with_umap['region_of_interest_label_color'],
    labels=cell_with_umap['region_of_interest_label'],
    term_orders=cell_with_umap['region_of_interest_label_order'],
    fig_width=12,
    fig_height=12
)
res = ax.set_title("region_of_interest_label")
plt.show()
../_images/42335bab99897d04843572b4d264782be1a96acb2e65eaf942c66bc93ede3dee.png

Taxonomy Information#

The final set of metadata we load into our extended cell metadata file maps the cells into their assigned cluster in the taxonomy. We additionally load metadata regarding the clusters and compute useful information, such as the number of cells in each taxon at each level of the taxonomy.

First, we load the table that describes the levels in the taxonomy from Neighborhood at the highest to Cluster at the lowest level.

cluster = abc_cache.get_metadata_dataframe(
    directory='HMBA-BG-taxonomy-CCN20250428',
    file_name='cluster',
    dtype={'number_of_cells': 'Int64'}
).rename(columns={'label': 'cluster_annotation_term_label'}).set_index('cluster_annotation_term_label')
cluster.head()
cluster_alias number_of_cells
cluster_annotation_term_label
CS20250428_CLUST_0161 Human-143 91
CS20250428_CLUST_0162 Human-145 1783
CS20250428_CLUST_0163 Human-146 172
CS20250428_CLUST_0164 Human-149 2649
CS20250428_CLUST_0165 Human-150 1359

Next, we load information associated with each Cluster in the taxonomy. This includes a useful alias value for each cluster as well as the number of cells in each cluster.

cluster_annotation_term_set = abc_cache.get_metadata_dataframe(
    directory='HMBA-BG-taxonomy-CCN20250428',
    file_name='cluster_annotation_term_set'
).rename(columns={'label': 'cluster_annotation_term_label'})
cluster_annotation_term_set
cluster_annotation_term_label name description order
0 CCN20250428_LEVEL_0 Neighborhood Neighborhood 0
1 CCN20250428_LEVEL_1 Class Class 1
2 CCN20250428_LEVEL_2 Subclass Subclass 2
3 CCN20250428_LEVEL_3 Group Group 3
4 CCN20250428_LEVEL_4 Cluster Cluster 4

Finally, for the clusters, we load information on the annotations for each cluster. This also includes the term order and color information which we will use to plot later.

cluster_annotation_term = abc_cache.get_metadata_dataframe(
    directory='HMBA-BG-taxonomy-CCN20250428',
    file_name='cluster_annotation_term',
).rename(columns={'label': 'cluster_annotation_term_label'}).set_index('cluster_annotation_term_label')
cluster_annotation_term.head()
name cluster_annotation_term_set_label cluster_annotation_term_set_name color_hex_triplet term_order term_set_order parent_term_label parent_term_name parent_term_set_label
cluster_annotation_term_label
CS20250428_NEIGH_0001 Nonneuron CCN20250428_LEVEL_0 Neighborhood #f2ca7d 1 0 NaN NaN NaN
CS20250428_NEIGH_0000 Glut Sero Dopa CCN20250428_LEVEL_0 Neighborhood #91f4bb 2 0 NaN NaN NaN
CS20250428_NEIGH_0002 Subpallium GABA CCN20250428_LEVEL_0 Neighborhood #19613b 3 0 NaN NaN NaN
CS20250428_NEIGH_0003 Subpallium GABA-Glut CCN20250428_LEVEL_0 Neighborhood #7e1d19 4 0 NaN NaN NaN
CS20250428_CLASS_0000 Astro-Epen CCN20250428_LEVEL_1 Class #6ec0da 1 1 CS20250428_NEIGH_0001 Nonneuron CCN20250428_LEVEL_0

We merge this table with information from our clusters. Note that not all clusters are associated into the taxonomy hence we use an left join here to use only those clusters with annotations.

cluster_annotation_term_with_cells = cluster_annotation_term.join(
    cluster, how='left'
)

While we have information on the number of cells in each cluster, we need to sum these to get the number of cells in each of the upper levels of the taxonomy. Below we iterate through each level from the lowest to highest, summing the number of cells as we go.

for cluster_annotation_term_label in reversed(sorted(cluster_annotation_term_with_cells['cluster_annotation_term_set_label'].unique())):
    sub_terms = cluster_annotation_term_with_cells[
        cluster_annotation_term_with_cells['cluster_annotation_term_set_label'] == cluster_annotation_term_label
    ]
    parents = sub_terms['parent_term_label'].unique()
    if np.any(pd.isna(sub_terms['number_of_cells'])):
            print("Warning nan values:", sub_terms[pd.isna(sub_terms['number_of_cells'])].index)
            sub_terms
    for parent in parents:
        if pd.isna(parent):
            continue
        n_cells = sub_terms[
            sub_terms['parent_term_label'] == parent
        ]['number_of_cells'].sum()
        cluster_annotation_term_with_cells.loc[parent, 'number_of_cells'] = n_cells

# show the new number of cells column.
cluster_annotation_term_with_cells[['number_of_cells']].head()
number_of_cells
cluster_annotation_term_label
CS20250428_NEIGH_0001 815521
CS20250428_NEIGH_0000 34315
CS20250428_NEIGH_0002 1045252
CS20250428_NEIGH_0003 1045
CS20250428_CLASS_0000 224824

Finally, we load the cluster to cluster annotation membership table to help us create associated metadata for each cluster to tell us the annotations for a given Cluster at all levels of the taxonomy. We’ll use this in groupbys to give each cluster the metadata for each of its parents. This will be useful when we merge into the cell metadata table.

cluster_to_cluster_annotation_membership = abc_cache.get_metadata_dataframe(
    directory='HMBA-BG-taxonomy-CCN20250428',
    file_name='cluster_to_cluster_annotation_membership',
).set_index('cluster_annotation_term_label')
membership_with_cluster_info = cluster_to_cluster_annotation_membership.join(cluster_annotation_term_with_cells, rsuffix='_anno_term').reset_index()
membership_groupby = membership_with_cluster_info.groupby(
    ['cluster_alias', 'cluster_annotation_term_set_name']
)
membership_with_cluster_info.head()
cluster_annotation_term_label cluster_annotation_term_set_label cluster_alias cluster_annotation_term_set_name cluster_annotation_term_name name cluster_annotation_term_set_label_anno_term cluster_annotation_term_set_name_anno_term color_hex_triplet term_order term_set_order parent_term_label parent_term_name parent_term_set_label cluster_alias_anno_term number_of_cells
0 CS20250428_CLUST_0161 CCN20250428_LEVEL_4 Human-143 Cluster Human-143 Human-143 CCN20250428_LEVEL_4 Cluster #4ac0ed 0 4 CS20250428_GROUP_0039 Astrocyte CCN20250428_LEVEL_3 Human-143 91
1 CS20250428_CLUST_0162 CCN20250428_LEVEL_4 Human-145 Cluster Human-145 Human-145 CCN20250428_LEVEL_4 Cluster #8af851 1 4 CS20250428_GROUP_0039 Astrocyte CCN20250428_LEVEL_3 Human-145 1783
2 CS20250428_CLUST_0163 CCN20250428_LEVEL_4 Human-146 Cluster Human-146 Human-146 CCN20250428_LEVEL_4 Cluster #d1dd68 2 4 CS20250428_GROUP_0039 Astrocyte CCN20250428_LEVEL_3 Human-146 172
3 CS20250428_CLUST_0164 CCN20250428_LEVEL_4 Human-149 Cluster Human-149 Human-149 CCN20250428_LEVEL_4 Cluster #95daf6 3 4 CS20250428_GROUP_0039 Astrocyte CCN20250428_LEVEL_3 Human-149 2649
4 CS20250428_CLUST_0165 CCN20250428_LEVEL_4 Human-150 Cluster Human-150 Human-150 CCN20250428_LEVEL_4 Cluster #26827e 4 4 CS20250428_GROUP_0039 Astrocyte CCN20250428_LEVEL_3 Human-150 1359

From the membership table, we create three tables via a groupby. First the name of each cluster and its parents.

# term_sets = abc_cache.get_metadata_dataframe(directory='WHB-taxonomy', file_name='cluster_annotation_term_set').set_index('label')
cluster_details = membership_groupby['cluster_annotation_term_name'].first().unstack()
cluster_details = cluster_details[cluster_annotation_term_set['name']] # order columns
cluster_details.fillna('Other', inplace=True)
cluster_details.sort_values(['Neighborhood', 'Class', 'Subclass', 'Group', 'Cluster'], inplace=True)
cluster_details.head()
cluster_annotation_term_set_name Neighborhood Class Subclass Group Cluster
cluster_alias
Human-128 Glut Sero Dopa F M Glut F Glut BF SKOR1 Glut Human-128
Human-129 Glut Sero Dopa F M Glut F Glut BF SKOR1 Glut Human-129
Human-130 Glut Sero Dopa F M Glut F Glut BF SKOR1 Glut Human-130
Human-423 Glut Sero Dopa F M Glut F Glut BF SKOR1 Glut Human-423
Human-426 Glut Sero Dopa F M Glut F Glut BF SKOR1 Glut Human-426

Next the plotting order of each of the clusters and their parents.

cluster_order = membership_groupby['term_order'].first().unstack()
cluster_order.sort_values(['Neighborhood', 'Class', 'Subclass', 'Group', 'Cluster'], inplace=True)
cluster_order.rename(
    columns={'Neighborhood': 'Neighborhood_order',
             'Class': 'Class_order',
             'Subclass': 'Subclass_order',
             'Group': 'Group_order',
             'Cluster': 'Cluster_order'},
    inplace=True
)
cluster_order.head()
cluster_annotation_term_set_name Class_order Cluster_order Group_order Neighborhood_order Subclass_order
cluster_alias
Human-143 1 0 1 1 1
Human-145 1 1 1 1 1
Human-146 1 2 1 1 1
Human-149 1 3 1 1 1
Human-150 1 4 1 1 1

Finally, the colors we will use to plot for each of the unique taxons at all levels.

cluster_colors = membership_groupby['color_hex_triplet'].first().unstack()
cluster_colors = cluster_colors[cluster_annotation_term_set['name']]
cluster_colors.sort_values(['Neighborhood', 'Class', 'Subclass', 'Group', 'Cluster'], inplace=True)
cluster_colors.head()
cluster_annotation_term_set_name Neighborhood Class Subclass Group Cluster
cluster_alias
Marmoset-323 #19613b #26e8bb #1bc06a #fc2b80 #00d86e
Marmoset-307 #19613b #26e8bb #1bc06a #fc2b80 #0454b2
Marmoset-325 #19613b #26e8bb #1bc06a #fc2b80 #094a6f
Human-346 #19613b #26e8bb #1bc06a #fc2b80 #0f6331
Marmoset-301 #19613b #26e8bb #1bc06a #fc2b80 #1140be

Next, we bring it all together by loading the mapping of cells to cluster and join into our final metadata table. Note here that not every cell is currently associated into the taxonomy hence the NaN values for many of the taxonomy information columns.

cell_to_cluster_membership = abc_cache.get_metadata_dataframe(
    directory='HMBA-BG-taxonomy-CCN20250428',
    file_name='cell_to_cluster_membership',
).set_index('cell_label')
cell_to_cluster_membership.head()
cluster_alias cluster_label
cell_label
AAACAGCCAAATGCCC-2362_A05 Human-451 CS20250428_CLUST_0268
AAACAGCCAATTGAGA-2362_A05 Human-1 CS20250428_CLUST_0227
AAACAGCCAGCATGTC-2362_A05 Human-153 CS20250428_CLUST_0215
AAACAGCCATTGACAT-2362_A05 Human-1 CS20250428_CLUST_0227
AAACAGCCATTGTGGC-2362_A05 Human-14 CS20250428_CLUST_0249
cell_extended_with_tax = cell_extended.join(cell_to_cluster_membership, rsuffix='_cell_to_cluster_membership')
cell_extended_with_tax = cell_extended_with_tax.join(cluster_details, on='cluster_alias')
cell_extended_with_tax = cell_extended_with_tax.join(cluster_colors, on='cluster_alias', rsuffix='_color')
cell_extended_with_tax = cell_extended_with_tax.join(cluster_order, on='cluster_alias')
cell_extended_with_tax.head()
cell_barcode donor_label barcoded_cell_sample_label library_label alignment_job_id doublet_score umi_count feature_matrix_label dataset_label abc_sample_id ... Neighborhood_color Class_color Subclass_color Group_color Cluster_color Class_order Cluster_order Group_order Neighborhood_order Subclass_order
cell_label
GTCAGGCTCAGGCCTA-2305_C01 GTCAGGCTCAGGCCTA H24.30.003 2305_C01 L8XR_240705_01_H06 93c6ec45bd2a1ca18516918a7f1a448e84006372 0.110000 23061.0 HMBA-10xMultiome-BG-Aligned HMBA-10xMultiome-BG-Aligned d59b11a7-a50d-472e-bc89-9fb5358044fa ... #19613b #d0b83c #1655f2 #339933 #dafd5a 10.0 1078.0 51.0 3.0 31.0
TATTACCTCAAAGGCA-2295_D02 TATTACCTCAAAGGCA H24.30.003 2295_D02 L8XR_240705_01_G05 c9b036ef4c6ddb8545718e5ea3747fc20c51ebfb 0.026667 12546.0 HMBA-10xMultiome-BG-Aligned HMBA-10xMultiome-BG-Aligned 8e66d017-0e3a-462b-bb7a-fb4dc8f297ec ... #19613b #d0b83c #253c8c #aec7e8 #c33059 10.0 1176.0 53.0 3.0 32.0
TAACCAGGTTGGCGTG-2019_C02 TAACCAGGTTGGCGTG QM23.50.002 2019_C02 L8XR_240118_21_C03 8268a0d34c53fa6bdd8887d7a6b9b612732c8777 0.000000 4430.0 HMBA-10xMultiome-BG-Aligned HMBA-10xMultiome-BG-Aligned e4f0f167-cce3-4797-8e95-9abf424ae81e ... #f2ca7d #6ec0da #604f15 #e16c95 #83d8d6 1.0 26.0 1.0 1.0 1.0
GTGTGTTAGCGCATTG-1750_B01 GTGTGTTAGCGCATTG QM23.50.001 1750_B01 L8XR_230629_22_H11 50c29f65910dc6faa1990f1ff9601c25bb909822 0.000000 9746.0 HMBA-10xMultiome-BG-Aligned HMBA-10xMultiome-BG-Aligned f3ce9856-7e08-451b-afce-94f3923224d5 ... #19613b #ce4c27 #eea495 #708297 #ef6b52 7.0 549.0 29.0 3.0 19.0
CTCCGGACATAATCGT-1959_C02 CTCCGGACATAATCGT H23.30.001 1959_C02 L8XR_231116_02_B09 d7924909ac635d27a5aaf99c11ea0373b9c3bb07 0.010753 10972.0 HMBA-10xMultiome-BG-Aligned HMBA-10xMultiome-BG-Aligned efc4e716-8a28-4bdb-8ca1-771bdcc10c17 ... #f2ca7d #6ec0da #604f15 #e16c95 #238c7a 1.0 12.0 1.0 1.0 1.0

5 rows × 53 columns

print_column_info(cell_extended)
Number of unique cell_barcode = 690037 
Number of unique donor_label = 22 ['CJ23.56.002', 'CJ23.56.003', 'CJ24.56.001', 'CJ24.56.004', 'H18.30.001', 'H19.30.004', 'H20.30.001', 'H20.30.002', 'H21.30.004', 'H23.30.001', 'H24.30.001', 'H24.30.003', 'H24.30.004', 'H24.30.007', 'Q19.26.010', 'Q21.26.002', 'Q21.26.010', 'QM21.26.001', 'QM21.26.003', 'QM23.50.001', 'QM23.50.002', 'QM23.50.003']
Number of unique barcoded_cell_sample_label = 417 
Number of unique library_label = 417 
Number of unique alignment_job_id = 303 
Number of unique doublet_score = 1426 
Number of unique umi_count = 86532 
Number of unique feature_matrix_label = 1 ['HMBA-10xMultiome-BG-Aligned']
Number of unique dataset_label = 1 ['HMBA-10xMultiome-BG-Aligned']
Number of unique abc_sample_id = 1898081 
Number of unique donor_species = 4 ['NCBITaxon:9483', 'NCBITaxon:9544', 'NCBITaxon:9545', 'NCBITaxon:9606']
Number of unique species_scientific_name = 4 ['Callithrix jacchus', 'Homo sapiens', 'Macaca mulatta', 'Macaca nemestrina']
Number of unique species_common_name = 3 ['Human', 'Macaque', 'Marmoset']
Number of unique donor_sex = 2 ['Female', 'Male']
Number of unique donor_age = 20 ['10 yrs', '11 yrs', '14 yrs', '18 yrs', '19 yrs', '2.0 yrs', '25 yrs', '38 yrs', '4.0 yrs', '44 yrs', '5 yrs', '5.4 yrs', '50 yrs', '57 yrs', '58 yrs', '6 yrs', '6.6 yrs', '60 yrs', '61 yrs', '7 yrs']
Number of unique donor_age_value = 20 [1.95890411, 3.956284153, 5.0, 5.391780822, 6.0, 6.555924845, 7.0, 10.0, 11.0, 14.0, 18.0, 19.0, 25.0, 38.0, 44.0, 50.0, 57.0, 58.0, 60.0, 61.0]
Number of unique donor_age_unit = 1 ['years']
Number of unique library_method = 1 ['10xMultiome;GEX']
Number of unique barcoded_cell_sample_label_library_table = 417 
Number of unique enrichment_population = 83 
Number of unique cell_specimen_type = 1 ['Nuclei']
Number of unique parcellation_term_identifier = 20 ['DHBA:10155', 'DHBA:10334', 'DHBA:10335', 'DHBA:10336', 'DHBA:10337', 'DHBA:10338', 'DHBA:10339', 'DHBA:10340', 'DHBA:10341', 'DHBA:10342', 'DHBA:10343', 'DHBA:10344', 'DHBA:10345', 'DHBA:10466', 'DHBA:11537', 'DHBA:11538', 'DHBA:12251', 'DHBA:12261', 'DHBA:146034742', 'DHBA:146034754']
Number of unique region_of_interest_name = 20 ['Ventral pallidus', 'body of caudate', 'brain', 'caudal putamen', 'caudate nucleus', 'core of nucleus accumbens', 'external segment of globus pallidus', 'globus pallidus', 'head of caudate', 'internal segment of globus pallidus', 'nucleus accumbens', 'peri-caudate ependymal and subependymal zone', 'posteroventral putamen', 'putamen', 'rostral putamen', 'shell of nucleus accumbens', 'substantia nigra', 'subthalamic nucleus', 'tail of caudate', 'ventral tegmental area']
Number of unique region_of_interest_label = 20 ['Br', 'Ca', 'CaB', 'CaH', 'CaT', 'Eca', 'GP', 'GPe', 'GPi', 'NAC', 'NACc', 'NACs', 'Pu', 'PuC', 'PuPV', 'PuR', 'SN', 'STH', 'VTA', 'VeP']
Number of unique anatomical_division_label = 2 ['Basal ganglia', 'Brain']
Number of unique donor_label_library_table = 22 ['CJ23.56.002', 'CJ23.56.003', 'CJ24.56.001', 'CJ24.56.004', 'H18.30.001', 'H19.30.004', 'H20.30.001', 'H20.30.002', 'H21.30.004', 'H23.30.001', 'H24.30.001', 'H24.30.003', 'H24.30.004', 'H24.30.007', 'Q19.26.010', 'Q21.26.002', 'Q21.26.010', 'QM21.26.001', 'QM21.26.003', 'QM23.50.001', 'QM23.50.002', 'QM23.50.003']
Number of unique region_of_interest_label_color = 20 ['#0010d9', '#0060ff', '#00bfa0', '#00bfff', '#53DB33', '#7AE361', '#89522A', '#907991', '#9BA2FF', '#9b19f5', '#A7DB33', '#B8003A', '#C29515', '#E0E0E0', '#E60049', '#c941a7', '#e6d800', '#fe0ccf', '#ff6b9a', '#ffa300']
Number of unique region_of_interest_label_order = 20 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
Number of unique species_common_name_color = 3 ['#377eb8', '#4daf4a', '#FF5F5D']
Number of unique species_common_name_order = 3 [1, 2, 3]
Number of unique species_scientific_name_color = 4 ['#377eb8', '#4daf4a', '#FF5F5D', '#b2df8a']
Number of unique species_scientific_name_order = 4 [1, 2, 3, 4]
Number of unique donor_sex_color = 2 ['#565353', '#ADC4C3']
Number of unique donor_sex_order = 2 [1, 2]
Number of unique x = 1844471 
Number of unique y = 1820193 

Plotting the taxonomy#

Now that we have our cells with associated taxonomy information, we’ll plot them into the UMAP we showed previously. First, we trim down the cells to those with taxonomy information by selecting those that where the taxonomy information is not NaN.

cell_with_taxonomy = cell_extended_with_tax[~pd.isna(cell_extended_with_tax['Class_color'])]

Below we plot the taxonomy mapping of the cells for each level in the taxonomy.

fig, ax = plot_umap(
    cell_with_taxonomy['x'],
    cell_with_taxonomy['y'],
    cc=cell_with_taxonomy['Neighborhood_color'],
    labels=cell_with_taxonomy['Neighborhood'],
    term_orders=cell_with_taxonomy['Neighborhood_order'],
    fig_width=12,
    fig_height=12
)
res = ax.set_title("Neighborhood")
plt.show()
../_images/9ff05db31d6eb540fccd9ae3cee1056b7d717f793330feec111a5a8871dd1e3b.png
fig, ax = plot_umap(
    cell_with_taxonomy['x'],
    cell_with_taxonomy['y'],
    cc=cell_with_taxonomy['Class_color'],
    labels=cell_with_taxonomy['Class'],
    term_orders=cell_with_taxonomy['Class_order'],
    fig_width=12,
    fig_height=12
)
res = ax.set_title("Class")
plt.show()
../_images/c3e9d881733ade24ab2c601c14a856d42f647a6475528c17231019d305a89ddf.png
fig, ax = plot_umap(
    cell_with_taxonomy['x'],
    cell_with_taxonomy['y'],
    cc=cell_with_taxonomy['Subclass_color'],
    labels=cell_with_taxonomy['Subclass'],
    term_orders=cell_with_taxonomy['Subclass_order'],
    fig_width=12,
    fig_height=12
)
res = ax.set_title("Subclass")
plt.show()
../_images/89ba0422e4de62a2b9a583fc1b540afeef698eb866f46912cd09e4f150ad8911.png
fig, ax = plot_umap(
    cell_with_taxonomy['x'],
    cell_with_taxonomy['y'],
    cc=cell_with_taxonomy['Group_color'],
    labels=cell_with_taxonomy['Group'],
    term_orders=cell_with_taxonomy['Group_order'],
    fig_width=16,
    fig_height=16
)
res = ax.set_title("Group")
plt.show()
../_images/503261870c1f96caafd54495e82d34daa122da5190ce218be36bd5e60deb79fb.png
fig, ax = plot_umap(
    cell_with_taxonomy['x'],
    cell_with_taxonomy['y'],
    cc=cell_with_taxonomy['Cluster_color'],
    fig_width=12,
    fig_height=12
)
res = ax.set_title("Cluster")
plt.show()
../_images/a4dba33f98cdc417792d15dbb62da8a6f7a9966236ad497f897f32562258d5ea.png

Aggregating cluster and cells counts per term#

Let’s investigate the taxonomy information a bit more. In this section, we’ll create bar plots showing the number of clusters and cells at each level in the taxonomy.

First, we need to compute the number of clusters that are in each of the cell type taxons above it.

# Count the number of clusters associated with each cluster annotation term
term_cluster_count = membership_with_cluster_info.reset_index().groupby(['cluster_annotation_term_label'])[['cluster_alias']].count()
term_cluster_count.columns = ['number_of_clusters']
term_cluster_count.head()
number_of_clusters
cluster_annotation_term_label
CS20250428_CLASS_0000 86
CS20250428_CLASS_0001 93
CS20250428_CLASS_0002 12
CS20250428_CLASS_0003 386
CS20250428_CLASS_0005 199

We already have our number of cells computed previously, so we’ll just use them here and cluster counts into the cluster_annotation table.

# Join counts with the term dataframe
term_with_counts = cluster_annotation_term_with_cells.join(term_cluster_count)
term_with_counts[['name', 'cluster_annotation_term_set_name', 'number_of_clusters', 'number_of_cells']].head()
name cluster_annotation_term_set_name number_of_clusters number_of_cells
cluster_annotation_term_label
CS20250428_NEIGH_0001 Nonneuron Neighborhood 295 815521
CS20250428_NEIGH_0000 Glut Sero Dopa Neighborhood 171 34315
CS20250428_NEIGH_0002 Subpallium GABA Neighborhood 957 1045252
CS20250428_NEIGH_0003 Subpallium GABA-Glut Neighborhood 12 1045
CS20250428_CLASS_0000 Astro-Epen Class 86 224824

Below we create a function to plot the cluster and cell counts in a bar graph, coloring by the associated taxon level.

def bar_plot_by_level_and_type(df, level, fig_width = 8.5, fig_height = 4):
    
    fig, ax = plt.subplots(1, 2)
    fig.set_size_inches(fig_width, fig_height)
    
    for idx, ctype in enumerate(['clusters', 'cells']):

        pred = (df['cluster_annotation_term_set_name'] == level)
        sort_order = np.argsort(df[pred]['term_order'])
        names = df[pred]['name'].iloc[sort_order]
        counts = df[pred]['number_of_%s' % ctype].iloc[sort_order]
        colors = df[pred]['color_hex_triplet'].iloc[sort_order]
        
        ax[idx].barh(names, counts, color=colors)
        ax[idx].set_title('Number of %s by %s' % (ctype,level)),
        ax[idx].set_xscale('log')
        
        if idx > 0:
            ax[idx].set_yticklabels([])

    plt.show()

Now let’s plot the counts for each of the taxonomy levels above Cluster.

bar_plot_by_level_and_type(term_with_counts, 'Neighborhood')
plt.show()
../_images/139446471f8135cd351e12f2bd778a7a1f796b94f8f660486ddf1112897d5460.png
bar_plot_by_level_and_type(term_with_counts, 'Class')
plt.show()
../_images/cb2d7213d30649b8c2210616ab55ef4b41bb72e0ea090296432fe80fb3f3b1c9.png
bar_plot_by_level_and_type(term_with_counts, 'Subclass', 8.5, 6)
plt.show()
../_images/e685ef3dd6a396329e54934f45ee8c2005cad89c3730d1d3d05dbe4a50cd5ccc.png
bar_plot_by_level_and_type(term_with_counts, 'Group', 8.5, 10)
plt.show()
../_images/b3888c1f7f18f6f4599da55bdb0c7ba1f10e61912d09dadf32c86b265837958e.png

Distribution of lower taxonomy levels in their parents.#

We can also use a similar visualization to show a given taxonomy level in another. Below we define functions to manipulate our data and plot a stacked bar plot.

def distribution(A, B):
    cluster_order_details = cluster_details.join(cluster_order)
    
    AxB = cluster_details.groupby([A, B])[['Cluster']].count()
    AxB.columns = ['number_of_clusters']
    AxB = AxB.unstack().fillna(0)
    AxB = AxB.loc[
        cluster_order_details[A].unique()[
            np.argsort(cluster_order_details[f'{A}_order'].unique())
        ]
    ]
    AxB = AxB[[
            ('number_of_clusters', name)
            for name in cluster_order_details[B].unique()[
                np.argsort(cluster_order_details[f'{B}_order'].unique())
            ]
    ]]

    B_names = [x[1] for x in list(AxB.columns)]
    pred = (cluster_annotation_term_with_cells['cluster_annotation_term_set_name'] == B)
    term_by_name = cluster_annotation_term_with_cells[pred].set_index('name')
    B_colors = term_by_name.loc[B_names, 'color_hex_triplet']
    
    return AxB, B_names, B_colors
def stacked_bar_distribution(AxB, B_names, B_colors, fig_width = 6, fig_height = 6):

    fig, ax = plt.subplots()
    fig.set_size_inches(fig_width, fig_height)

    bottom = np.zeros(len(AxB))

    for i, col in enumerate(AxB.columns):
        ax.barh(AxB.index[::-1], AxB[col][::-1], left=bottom, label=col[1], color=B_colors[i])
        bottom += np.array(AxB[col][::-1])

    ax.set_title('Distribution of %s in each %s' % (AxB.columns.names[1], AxB.index.name))
    ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

    plt.show()
    
    return fig, ax

We can now visualize how each lower level in the taxonomy is distributed by cluster in the upper portions of the taxonomy.

AxB, B_names, B_colors = distribution('Class', 'Subclass')
fig, ax = stacked_bar_distribution(AxB, B_names, B_colors, 6, 8)
plt.show()
/var/folders/kc/7glrmt5n67x16yj_tg86t49c0000gp/T/ipykernel_24913/3364940061.py:9: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  ax.barh(AxB.index[::-1], AxB[col][::-1], left=bottom, label=col[1], color=B_colors[i])
../_images/c646832f8e544b3de121e8c3d2fdcaf32c607a5b364d17650e9b91e9678faa98.png

Visualizing the BG taxonomy#

Term sets: Class, Subclass, Group and Cluster forms a four level Basil Ganglia taxonomy. We can visualize the taxonomy as a sunburst diagram that shows the single inheritance hierarchy through a series of rings, that are sliced for each annotation term. Each ring corresponds to a level in the hierarchy. We have ordered the rings so that the Class level is the outer most ring so that we can add in labels. Rings are sliced up and divided based on their hierarchical relationship to the parent slice. The angle of each slice is proportional to the number of clusters belonging to the term. Note that we exclude Neighborhood here as it is a much less interesting plot.

levels = ['Class', 'Subclass', 'Group', 'Cluster']
df = {}

# Copy the term order of the parent into each of the level below it.
term_with_counts['parent_order'] = ""
for idx, row in term_with_counts.iterrows():
    if pd.isna(row['parent_term_label']):
        continue
    term_with_counts.loc[idx, 'parent_order'] = term_with_counts.loc[row['parent_term_label']]['term_order']

term_with_counts = term_with_counts.reset_index()
for lvl in levels :
    pred = term_with_counts['cluster_annotation_term_set_name'] == lvl
    df[lvl] = term_with_counts[pred]
    df[lvl] = df[lvl].sort_values(['parent_order', 'term_order'])

fig, ax = plt.subplots()
fig.set_size_inches(10, 10)
size = 0.15

for i, lvl in enumerate(levels):
    
    if lvl == 'Class':
        ax.pie(df[lvl]['number_of_clusters'],
               colors=df[lvl]['color_hex_triplet'],
               labels = df[lvl]['name'],
               rotatelabels=True,
               labeldistance=1.025,
               radius=1,
               wedgeprops=dict(width=size, edgecolor=None),
               startangle=0)
    else :
        ax.pie(df[lvl]['number_of_clusters'],
               colors=df[lvl]['color_hex_triplet'],
               radius=1-i*size,
               wedgeprops=dict(width=size, edgecolor=None),
               startangle=0)

plt.show()
../_images/9d6011414504c35c6e7017a9418e6a07b5b462a0c362e72ea641da52c969ce97.png

In the next tutorial, we show how to access and use HMBA-BG gene expression data.