Human-Mammalian Brain - Basal Ganglia 10X snRANSeq analysis: clustering and annotations

Human-Mammalian Brain - Basal Ganglia 10X snRANSeq analysis: clustering and annotations#

The basal ganglia (BG) are a system of interconnected brain structures that play a crucial role in motor control, learning, behavior, and emotion. With approximately 200 million neurons in the human basal ganglia alone, these structures are involved in a wide range of neurological processes and are implicated in numerous disorders affecting human health, including Parkinson’s disease, Huntington’s disease, and substance abuse disorders. To further understand the complexity of the basal ganglia, researchers have historically classified its neurons into various types based on their cytoarchitecture, connectivity, molecular profile, and functional properties. However, recent advancements in high-throughput transcriptomic profiling have revolutionized our ability to systematically categorize these cell types within species, while the maturation of machine learning technologies have enabled the integration of these taxonomies across species.

Our consensus basal ganglia cell type taxonomy is the result of iterative clustering and cross-species integration of transcriptomic data. The taxonomy encompasses neurons from key structures within the basal ganglia, including the caudate (Ca), putamen (Pu), nucleus accumbens (NAc), the external and internal segments of the globus pallidus (GPe, GPi), subthalamic nucleus (STN), and substantia nigra (SN). By combining data from multiple primate and rodent species, we have developed a consensus taxonomy that highlights both conserved and species-specific cell types. We validate our taxonomy through marker gene expression analysis, comparison with previously published taxonomies, and self-projection, ensuring the accuracy and robustness of each level in the taxonomic hierarchy.

In a related tutorial, we also show how to access and use HMBA-BG gene expression data.

You need to be connected to the internet to run this notebook or connected to a cache that has the HMBA-BG data downloaded already.

The notebook presented here shows quick visualizations from precomputed metadata in the atlas. For examples on accessing the expression matrices, specifically selecting genes from expression matrices, see the general_acessing_10x_snRNASeq_tutorial.ipynb tutorial/example.

import pandas as pd
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt

from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache

We will interact with the data using the AbcProjectCache. This cache object downloads data requested by the user, tracks which files have already been downloaded to your local system, and serves the path to the requested data on disk. For metadata, the cache can also directly serve a up a Pandas DataFrame. See the getting_started notebook for more details on using the cache including installing it if it has not already been.

Change the download_base variable to where you would like to download the data to your system.

download_base = Path('../../data/abc_atlas')
abc_cache = AbcProjectCache.from_cache_dir(download_base)

abc_cache.current_manifest

'releases/20250531/manifest.json'

Data overview#

We’ll quickly walk through the data we will be using in this notebook. The HMBA-BG 10X datasets are located across several directories listed in the ABCProjectCache. In this notebook, we will be looking at the Aligned HMBA-BG 10X dataset. These data combine the there species with an aligned set of ~16k across the species. We will be using data and metadata from the following directories:

HMBA-10xMultiome-BG-Aligned
HMBA-BG-taxonomy-CCN20250428

Note that data for each individual species is available in the directory HMBA-10xMultiome-BG-Aligned.

Below we list the data and metadata in the HMBA-10xMultiome-BG-Aligned dataset.

print("HMBA-10xMultiome-BG-Aligned: gene expression data (h5ad)\n\t", abc_cache.list_data_files(directory='HMBA-10xMultiome-BG-Aligned'))
print("HMBA-10xMultiome-BG-Aligned: metadata (csv)\n\t", abc_cache.list_metadata_files(directory='HMBA-10xMultiome-BG-Aligned'))

HMBA-10xMultiome-BG-Aligned: gene expression data (h5ad)
	 ['HMBA-10xMultiome-BG-Aligned/log2', 'HMBA-10xMultiome-BG-Aligned/raw']
HMBA-10xMultiome-BG-Aligned: metadata (csv)
	 ['cell_metadata', 'donor', 'example_gene_expression', 'gene', 'library', 'value_sets']

We will also use metadata from the HMBA-BG taxonomy directory. Below is the list of available files:

print("HMBA-BG-taxonomy-CCN20250428: metadata (csv)\n\t", abc_cache.list_metadata_files(directory='HMBA-BG-taxonomy-CCN20250428'))

HMBA-BG-taxonomy-CCN20250428: metadata (csv)
	 ['abbreviation_term', 'cell_2d_embedding_coordinates', 'cell_to_cluster_membership', 'cluster', 'cluster_annotation_term', 'cluster_annotation_term_set', 'cluster_annotation_to_abbreviation_map', 'cluster_to_cluster_annotation_membership']

Cell metadata#

Essential cell metadata is stored as a CSV file that we load as a Pandas DataFrame. Each row represents one cell indexed by a cell label. The cell label is the concatenation of barcode and name of the sample. In this context, the sample is the barcoded cell sample that represents a single load into one port of the 10x Chromium. Note that cell barcodes are only unique within a single barcoded cell sample and that the same barcode can be reused. This metadata file contains cells across all species in the HMBA-BG dataset.

Each cell is associated with a library label, donor label, alignment_job_id, feature_matrix_label and dataset_label identifying which data package this cell is part of. This metadata file will be combined with other metadata files that ship with this package to add information associated with the donor, UMAP coordinates, cluster assignments, and more.

Below, we load the first of the metadata used in this tutorial. This represents the cell metadata for the aligned dataset.

The command we use below both downloads the data if it is not already available in the local cache and loads the data as a Pandas DataFrame. This pattern of loading metadata is repeated throughout the tutorials.

cell = abc_cache.get_metadata_dataframe(
    directory='HMBA-10xMultiome-BG-Aligned',
    file_name='cell_metadata',
    dtype={'cell_label': str}
).set_index('cell_label')
print("Number of cells = ", len(cell))
cell.head()

/Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.12/site-packages/abc_atlas_access/abc_atlas_cache/abc_project_cache.py:429: DtypeWarning: Columns (5) have mixed types. Specify dtype option on import or set low_memory=False.
  return pd.read_csv(path, **kwargs)

Number of cells =  2035898

	cell_barcode	donor_label	barcoded_cell_sample_label	library_label	alignment_job_id	doublet_score	umi_count	feature_matrix_label	dataset_label	abc_sample_id
cell_label
AAACAGCCAAATGCCC-2362_A05	AAACAGCCAAATGCCC	H24.30.001	2362_A05	L8XR_240808_01_E02	8a4bf81821a0f425be8ba9c15dfad6b509020312	0.027027	15259.0	HMBA-10xMultiome-BG-Aligned	HMBA-10xMultiome-BG-Aligned	38447b19-a5dc-4eca-9021-56ae191e8809
AAACAGCCAATTGAGA-2362_A05	AAACAGCCAATTGAGA	H24.30.001	2362_A05	L8XR_240808_01_E02	8a4bf81821a0f425be8ba9c15dfad6b509020312	0.054795	20645.0	HMBA-10xMultiome-BG-Aligned	HMBA-10xMultiome-BG-Aligned	3dc4dc1f-4b8a-4012-bb07-e3906ad70da0
AAACAGCCAGCATGTC-2362_A05	AAACAGCCAGCATGTC	H24.30.001	2362_A05	L8XR_240808_01_E02	8a4bf81821a0f425be8ba9c15dfad6b509020312	0.000000	2551.0	HMBA-10xMultiome-BG-Aligned	HMBA-10xMultiome-BG-Aligned	291d6ebf-ca70-4d9d-8dde-10fa841dba93
AAACAGCCATTGACAT-2362_A05	AAACAGCCATTGACAT	H24.30.001	2362_A05	L8XR_240808_01_E02	8a4bf81821a0f425be8ba9c15dfad6b509020312	0.000000	2341.0	HMBA-10xMultiome-BG-Aligned	HMBA-10xMultiome-BG-Aligned	12153419-9b49-4d5f-a10d-ddb7837a729d
AAACAGCCATTGTGGC-2362_A05	AAACAGCCATTGTGGC	H24.30.001	2362_A05	L8XR_240808_01_E02	8a4bf81821a0f425be8ba9c15dfad6b509020312	0.027397	8326.0	HMBA-10xMultiome-BG-Aligned	HMBA-10xMultiome-BG-Aligned	5716499d-d69d-489e-b4ed-6841b6ba051d

We can use pandas groupby function to see how many unique items are associated for each field and list them out if the number of items is small.

def print_column_info(df):
    
    for c in df.columns:
        grouped = df[[c]].groupby(c).count()
        members = ''
        if len(grouped) < 30:
            members = str(list(grouped.index))
        print("Number of unique %s = %d %s" % (c, len(grouped), members))

print_column_info(cell)

Number of unique cell_barcode = 690037 
Number of unique donor_label = 22 ['CJ23.56.002', 'CJ23.56.003', 'CJ24.56.001', 'CJ24.56.004', 'H18.30.001', 'H19.30.004', 'H20.30.001', 'H20.30.002', 'H21.30.004', 'H23.30.001', 'H24.30.001', 'H24.30.003', 'H24.30.004', 'H24.30.007', 'Q19.26.010', 'Q21.26.002', 'Q21.26.010', 'QM21.26.001', 'QM21.26.003', 'QM23.50.001', 'QM23.50.002', 'QM23.50.003']
Number of unique barcoded_cell_sample_label = 417 
Number of unique library_label = 417 
Number of unique alignment_job_id = 303 
Number of unique doublet_score = 1426 
Number of unique umi_count = 86532 
Number of unique feature_matrix_label = 1 ['HMBA-10xMultiome-BG-Aligned']
Number of unique dataset_label = 1 ['HMBA-10xMultiome-BG-Aligned']
Number of unique abc_sample_id = 1898081 

Donor and Library metadata#

The first two associated metadata we load are the donor and library tables. The donor table contains species, sex, and age information. The library table contains information on 10X methods and brain region of interest the tissue was extracted from.

donor = abc_cache.get_metadata_dataframe(
    directory='HMBA-10xMultiome-BG-Aligned',
    file_name='donor'
).set_index('donor_label')
donor.head()

	donor_species	species_scientific_name	species_common_name	donor_sex	donor_age	donor_age_value	donor_age_unit
donor_label
QM23.50.003	NCBITaxon:9544	Macaca mulatta	Macaque	Male	6 yrs	6.0	years
QM21.26.001	NCBITaxon:9544	Macaca mulatta	Macaque	Male	6 yrs	6.0	years
Q19.26.010	NCBITaxon:9544	Macaca mulatta	Macaque	Female	10 yrs	10.0	years
H24.30.003	NCBITaxon:9606	Homo sapiens	Human	Female	19 yrs	19.0	years
H21.30.004	NCBITaxon:9606	Homo sapiens	Human	Male	57 yrs	57.0	years

library = abc_cache.get_metadata_dataframe(
    directory='HMBA-10xMultiome-BG-Aligned',
    file_name='library'
).set_index('library_label')
library.head()

	library_method	barcoded_cell_sample_label	enrichment_population	cell_specimen_type	parcellation_term_identifier	region_of_interest_name	region_of_interest_label	anatomical_division_label	donor_label
library_label
L8XR_240222_21_H03	10xMultiome;GEX	2077_A02	50% NeuN+, 35% OLIG2-, 15% OLIG2+	Nuclei	DHBA:11538	caudal putamen	PuC	Basal ganglia	QM23.50.003
L8XR_220428_02_A05	10xMultiome;GEX	1224_A04	10% NeuN+, 49% OLIG2-, 25% OLIG2+, 16% Nurr1+	Nuclei	DHBA:12261	ventral tegmental area	VTA	Basal ganglia	QM21.26.001
L8XR_220428_02_C04	10xMultiome;GEX	1222_A02	60% NeuN+, 27% OLIG2-, 13% OLIG2+	Nuclei	DHBA:10466	subthalamic nucleus	STH	Basal ganglia	Q19.26.010
L8XR_240705_01_A06	10xMultiome;GEX	2305_E01	50% NeuN+, 35% OLIG2-, 15% OLIG2+	Nuclei	DHBA:11537	rostral putamen	PuR	Basal ganglia	H24.30.003
L8XR_240919_21_B05	10xMultiome;GEX	2453_A02	70% NeuN+, 20% OLIG2-, 10% OLIG2+	Nuclei	DHBA:10345	Ventral pallidus	VeP	Basal ganglia	H21.30.004

We combine these into an extended cell metadata table.

cell_extended = cell.join(donor, on='donor_label')
cell_extended = cell_extended.join(library, on='library_label', rsuffix='_library_table')

Below we use a pandas groupby to show the number of cells from species.

cell_extended.groupby('species_common_name')[['feature_matrix_label']].count()

	feature_matrix_label
species_common_name
Human	1051428
Macaque	569895
Marmoset	414575

We can also use the groupby function to show the number of cells in each region of interest extracted from the BG.

cell_extended.groupby('region_of_interest_name')[['region_of_interest_label']].count()

	region_of_interest_label
region_of_interest_name
Ventral pallidus	39593
body of caudate	188438
brain	414575
caudal putamen	166294
caudate nucleus	27964
core of nucleus accumbens	19283
external segment of globus pallidus	205322
globus pallidus	4900
head of caudate	132325
internal segment of globus pallidus	134560
nucleus accumbens	157981
peri-caudate ependymal and subependymal zone	10428
posteroventral putamen	74681
putamen	35092
rostral putamen	196452
shell of nucleus accumbens	9525
substantia nigra	40396
subthalamic nucleus	63348
tail of caudate	90271
ventral tegmental area	24470

Adding color and feature order#

Each feature in the donor and library table comes is associated with unique colors and an ordering with the set of values. Below we load the value_sets DataFrame which is a mapping from the various value in the donor and species tables to those colors and orderings. We incorporate these values into the cell metadata table.

value_sets = abc_cache.get_metadata_dataframe(
    directory='HMBA-10xMultiome-BG-Aligned',
    file_name='value_sets'
).set_index('label')
value_sets.head()

	table	field	description	order	external_identifier	parent_label	color_hex_triplet	comment	Unnamed: 9	Unnamed: 10	Unnamed: 11	Unnamed: 12
label
Female	donor	donor_sex	Female	1	NaN	NaN	#565353	NaN	NaN	NaN	NaN	NaN
Male	donor	donor_sex	Male	2	NaN	NaN	#ADC4C3	NaN	NaN	NaN	NaN	NaN
Human	donor	species_common_name	Human	1	NaN	NaN	#377eb8	NaN	NaN	NaN	NaN	NaN
Macaque	donor	species_common_name	Macaque	2	NaN	NaN	#4daf4a	NaN	NaN	NaN	NaN	NaN
Marmoset	donor	species_common_name	Marmoset	3	NaN	NaN	#FF5F5D	NaN	NaN	NaN	NaN	NaN

We define a convince function to add colors for the various values in the data (e.g. unique region of interest or donor sex values).

def extract_value_set(cell_metadata_df: pd.DataFrame, input_value_set: pd.DataFrame, input_value_set_label: str):
    """Add color and order columns to the cell metadata dataframe based on the input
    value set.

    Columns are added as {input_value_set_label}_color and {input_value_set_label}_order.

    Parameters
    ----------
    cell_metadata_df : pd.DataFrame
        DataFrame containing cell metadata.
    input_value_set : pd.DataFrame
        DataFrame containing the value set information.
    input_value_set_label : str
        The the column name to extract color and order information for. will be added to the cell metadata.
    """
    cell_metadata_df[f'{input_value_set_label}_color'] = input_value_set[
        input_value_set['field'] == input_value_set_label
    ].loc[cell_metadata_df[input_value_set_label]]['color_hex_triplet'].values
    cell_metadata_df[f'{input_value_set_label}_order'] = input_value_set[
        input_value_set['field'] == input_value_set_label
    ].loc[cell_metadata_df[input_value_set_label]]['order'].values

Use our function to add the relevant color and order columns to our cell_metadata table.

# Add region of interest color and order
extract_value_set(cell_extended, value_sets, 'region_of_interest_label')
# Add species common name color and order
extract_value_set(cell_extended, value_sets, 'species_common_name')
# Add species scientific name color and order
extract_value_set(cell_extended, value_sets, 'species_scientific_name')
# Add donor sex color and order
extract_value_set(cell_extended, value_sets, 'donor_sex')
cell_extended.head()

	cell_barcode	donor_label	barcoded_cell_sample_label	library_label	alignment_job_id	doublet_score	umi_count	feature_matrix_label	dataset_label	abc_sample_id	...	anatomical_division_label	donor_label_library_table	region_of_interest_label_color	region_of_interest_label_order	species_common_name_color	species_common_name_order	species_scientific_name_color	species_scientific_name_order	donor_sex_color	donor_sex_order
cell_label
AAACAGCCAAATGCCC-2362_A05	AAACAGCCAAATGCCC	H24.30.001	2362_A05	L8XR_240808_01_E02	8a4bf81821a0f425be8ba9c15dfad6b509020312	0.027027	15259.0	HMBA-10xMultiome-BG-Aligned	HMBA-10xMultiome-BG-Aligned	38447b19-a5dc-4eca-9021-56ae191e8809	...	Basal ganglia	H24.30.001	#53DB33	15	#377eb8	1	#377eb8	1	#ADC4C3	2
AAACAGCCAATTGAGA-2362_A05	AAACAGCCAATTGAGA	H24.30.001	2362_A05	L8XR_240808_01_E02	8a4bf81821a0f425be8ba9c15dfad6b509020312	0.054795	20645.0	HMBA-10xMultiome-BG-Aligned	HMBA-10xMultiome-BG-Aligned	3dc4dc1f-4b8a-4012-bb07-e3906ad70da0	...	Basal ganglia	H24.30.001	#53DB33	15	#377eb8	1	#377eb8	1	#ADC4C3	2
AAACAGCCAGCATGTC-2362_A05	AAACAGCCAGCATGTC	H24.30.001	2362_A05	L8XR_240808_01_E02	8a4bf81821a0f425be8ba9c15dfad6b509020312	0.000000	2551.0	HMBA-10xMultiome-BG-Aligned	HMBA-10xMultiome-BG-Aligned	291d6ebf-ca70-4d9d-8dde-10fa841dba93	...	Basal ganglia	H24.30.001	#53DB33	15	#377eb8	1	#377eb8	1	#ADC4C3	2
AAACAGCCATTGACAT-2362_A05	AAACAGCCATTGACAT	H24.30.001	2362_A05	L8XR_240808_01_E02	8a4bf81821a0f425be8ba9c15dfad6b509020312	0.000000	2341.0	HMBA-10xMultiome-BG-Aligned	HMBA-10xMultiome-BG-Aligned	12153419-9b49-4d5f-a10d-ddb7837a729d	...	Basal ganglia	H24.30.001	#53DB33	15	#377eb8	1	#377eb8	1	#ADC4C3	2
AAACAGCCATTGTGGC-2362_A05	AAACAGCCATTGTGGC	H24.30.001	2362_A05	L8XR_240808_01_E02	8a4bf81821a0f425be8ba9c15dfad6b509020312	0.027397	8326.0	HMBA-10xMultiome-BG-Aligned	HMBA-10xMultiome-BG-Aligned	5716499d-d69d-489e-b4ed-6841b6ba051d	...	Basal ganglia	H24.30.001	#53DB33	15	#377eb8	1	#377eb8	1	#ADC4C3	2

5 rows × 34 columns

UMAP spatial embedding#

Now that we’ve merged our donor and library metadata into the main cells data, our next step is to put these values in a Uniform Manifold Approximation and Projection (UMAP) for cells in the dataset. The UMAP is a dimension reduction technique that can be used for visualizing and exploring large-dimension datasets.

Below we load this 2-D embedding for a sub selection of our cells and merge the x-y coordinates into the extended cell metadata we are creating.

cell_2d_embedding_coordinates = abc_cache.get_metadata_dataframe(
    directory='HMBA-BG-taxonomy-CCN20250428',
    file_name='cell_2d_embedding_coordinates'
).set_index('cell_label')
cell_2d_embedding_coordinates.head()

	x	y
cell_label
AAACAGCCAAATGCCC-2362_A05	0.452302	2.938630
AAACAGCCAATTGAGA-2362_A05	0.051495	16.282684
AAACAGCCAGCATGTC-2362_A05	-1.233702	8.666612
AAACAGCCATTGACAT-2362_A05	0.517126	15.368365
AAACAGCCATTGTGGC-2362_A05	-3.697715	-1.647361

cell_extended = cell_extended.join(cell_2d_embedding_coordinates)
cell_extended = cell_extended.sample(frac=1)

We define a small helper function plot umap to visualize the cells on the UMAP. In the examples below we will plot associated cell information colorized by dissection donor species, sex, and region of interest.

def plot_umap(
    xx,
    yy,
    cc=None,
    val=None,
    fig_width=8,
    fig_height=8,
    cmap=None,
    labels=None,
    term_orders=None,
    colorbar=False,
    sizes=None
):
    """
    Plot a scatter plot of the UMAP coordinates.

    Parameters
    ----------
    xx : array-like
        x-coordinates of the points to plot.
    yy : array-like
        y-coordinates of the points to plot.
    cc : array-like, optional
        colors of the points to plot. If None, the points will be colored by the values in `val`.
    val : array-like, optional
        values of the points to plot. If None, the points will be colored by the values in `cc`.
    fig_width : float, optional
        width of the figure in inches. Default is 8.
    fig_height : float, optional
        height of the figure in inches. Default is 8.
    cmap : str, optional
        colormap to use for coloring the points. If None, the points will be colored by the values in `cc`.
    labels : array-like, optional
        labels for the points to plot. If None, no labels will be added to the plot.
    term_orders : array-like, optional
        order of the labels for the legend. If None, the labels will be ordered by their appearance in `labels`.
    colorbar : bool, optional
        whether to add a colorbar to the plot. Default is False.
    sizes : array-like, optional
        sizes of the points to plot. If None, all points will have the same size.
    """
    if sizes is None:
        sizes = 1
    fig, ax = plt.subplots()
    fig.set_size_inches(fig_width, fig_height)

    if cmap is not None:
        scatt = ax.scatter(xx, yy, c=val, s=0.5, marker='.', cmap=cmap, alpha=sizes)
    elif cc is not None:
        scatt = ax.scatter(xx, yy, c=cc, s=0.5, marker='.', alpha=sizes)

    if labels is not None:
        from matplotlib.patches import Rectangle
        unique_label_colors = (labels + ',' + cc).unique()
        unique_labels = np.array([label_color.split(',')[0] for label_color in unique_label_colors])
        unique_colors = np.array([label_color.split(',')[1] for label_color in unique_label_colors])

        if term_orders is not None:
            unique_order = term_orders.unique()
            term_order = np.argsort(unique_order)
            unique_labels = unique_labels[term_order]
            unique_colors = unique_colors[term_order]
            
        rects = []
        for color in unique_colors:
            rects.append(Rectangle((0, 0), 1, 1, fc=color))

        legend = ax.legend(rects, unique_labels, loc=1)
        # ax.add_artist(legend)

    if colorbar:
        fig.colorbar(scatt, ax=ax)
    
    return fig, ax

cell_with_umap = cell_extended[~pd.isna(cell_extended['x'])]
fig, ax = plot_umap(
    cell_with_umap['x'],
    cell_with_umap['y'],
    cc=cell_with_umap['species_common_name_color'],
    labels=cell_with_umap['species_common_name'],
    term_orders=cell_with_umap['species_common_name_order'],
    fig_width=12,
    fig_height=12
)
res = ax.set_title("species_common_name")
plt.show()

../_images/7cf0499e216150af565118cf480a2a264cd4e0b07e5172cdcadac4b748382e77.png

fig, ax = plot_umap(
    cell_with_umap['x'],
    cell_with_umap['y'],
    cc=cell_with_umap['species_scientific_name_color'],
    labels=cell_with_umap['species_scientific_name'],
    term_orders=cell_with_umap['species_scientific_name_order'],
    fig_width=12,
    fig_height=12
)
res = ax.set_title("species_scientific_name")
plt.show()

../_images/1ea9754b2e84e956c5fc92b6014a434da317a8bf14e2f26c97adc41a7a2d77e6.png

fig, ax = plot_umap(
    cell_with_umap['x'],
    cell_with_umap['y'],
    cc=cell_with_umap['donor_sex_color'],
    labels=cell_with_umap['donor_sex'],
    term_orders=cell_with_umap['donor_sex_order'],
    fig_width=12,
    fig_height=12
)
res = ax.set_title("donor_sex")
plt.show()

../_images/df3eead785d51177d4cdf498c89f55c6109ed8e4d3c5a86edbacd42bb332f730.png

Below we show the region of interest for the three species. Note, however, that Marmoset does not currently have fine grained ROIs available and is marked as Br - Brain.

fig, ax = plot_umap(
    cell_with_umap['x'],
    cell_with_umap['y'],
    cc=cell_with_umap['region_of_interest_label_color'],
    labels=cell_with_umap['region_of_interest_label'],
    term_orders=cell_with_umap['region_of_interest_label_order'],
    fig_width=12,
    fig_height=12
)
res = ax.set_title("region_of_interest_label")
plt.show()

../_images/42335bab99897d04843572b4d264782be1a96acb2e65eaf942c66bc93ede3dee.png

Taxonomy Information#

The final set of metadata we load into our extended cell metadata file maps the cells into their assigned cluster in the taxonomy. We additionally load metadata regarding the clusters and compute useful information, such as the number of cells in each taxon at each level of the taxonomy.

First, we load the table that describes the levels in the taxonomy from Neighborhood at the highest to Cluster at the lowest level.

cluster = abc_cache.get_metadata_dataframe(
    directory='HMBA-BG-taxonomy-CCN20250428',
    file_name='cluster',
    dtype={'number_of_cells': 'Int64'}
).rename(columns={'label': 'cluster_annotation_term_label'}).set_index('cluster_annotation_term_label')
cluster.head()

	cluster_alias	number_of_cells
cluster_annotation_term_label
CS20250428_CLUST_0161	Human-143	91
CS20250428_CLUST_0162	Human-145	1783
CS20250428_CLUST_0163	Human-146	172
CS20250428_CLUST_0164	Human-149	2649
CS20250428_CLUST_0165	Human-150	1359

Next, we load information associated with each Cluster in the taxonomy. This includes a useful alias value for each cluster as well as the number of cells in each cluster.

cluster_annotation_term_set = abc_cache.get_metadata_dataframe(
    directory='HMBA-BG-taxonomy-CCN20250428',
    file_name='cluster_annotation_term_set'
).rename(columns={'label': 'cluster_annotation_term_label'})
cluster_annotation_term_set

	cluster_annotation_term_label	name	description	order
0	CCN20250428_LEVEL_0	Neighborhood	Neighborhood	0
1	CCN20250428_LEVEL_1	Class	Class	1
2	CCN20250428_LEVEL_2	Subclass	Subclass	2
3	CCN20250428_LEVEL_3	Group	Group	3
4	CCN20250428_LEVEL_4	Cluster	Cluster	4

Finally, for the clusters, we load information on the annotations for each cluster. This also includes the term order and color information which we will use to plot later.

cluster_annotation_term = abc_cache.get_metadata_dataframe(
    directory='HMBA-BG-taxonomy-CCN20250428',
    file_name='cluster_annotation_term',
).rename(columns={'label': 'cluster_annotation_term_label'}).set_index('cluster_annotation_term_label')
cluster_annotation_term.head()

	name	cluster_annotation_term_set_label	cluster_annotation_term_set_name	color_hex_triplet	term_order	term_set_order	parent_term_label	parent_term_name	parent_term_set_label
cluster_annotation_term_label
CS20250428_NEIGH_0001	Nonneuron	CCN20250428_LEVEL_0	Neighborhood	#f2ca7d	1	0	NaN	NaN	NaN
CS20250428_NEIGH_0000	Glut Sero Dopa	CCN20250428_LEVEL_0	Neighborhood	#91f4bb	2	0	NaN	NaN	NaN
CS20250428_NEIGH_0002	Subpallium GABA	CCN20250428_LEVEL_0	Neighborhood	#19613b	3	0	NaN	NaN	NaN
CS20250428_NEIGH_0003	Subpallium GABA-Glut	CCN20250428_LEVEL_0	Neighborhood	#7e1d19	4	0	NaN	NaN	NaN
CS20250428_CLASS_0000	Astro-Epen	CCN20250428_LEVEL_1	Class	#6ec0da	1	1	CS20250428_NEIGH_0001	Nonneuron	CCN20250428_LEVEL_0

We merge this table with information from our clusters. Note that not all clusters are associated into the taxonomy hence we use an left join here to use only those clusters with annotations.

cluster_annotation_term_with_cells = cluster_annotation_term.join(
    cluster, how='left'
)

While we have information on the number of cells in each cluster, we need to sum these to get the number of cells in each of the upper levels of the taxonomy. Below we iterate through each level from the lowest to highest, summing the number of cells as we go.

for cluster_annotation_term_label in reversed(sorted(cluster_annotation_term_with_cells['cluster_annotation_term_set_label'].unique())):
    sub_terms = cluster_annotation_term_with_cells[
        cluster_annotation_term_with_cells['cluster_annotation_term_set_label'] == cluster_annotation_term_label
    ]
    parents = sub_terms['parent_term_label'].unique()
    if np.any(pd.isna(sub_terms['number_of_cells'])):
            print("Warning nan values:", sub_terms[pd.isna(sub_terms['number_of_cells'])].index)
            sub_terms
    for parent in parents:
        if pd.isna(parent):
            continue
        n_cells = sub_terms[
            sub_terms['parent_term_label'] == parent
        ]['number_of_cells'].sum()
        cluster_annotation_term_with_cells.loc[parent, 'number_of_cells'] = n_cells

# show the new number of cells column.
cluster_annotation_term_with_cells[['number_of_cells']].head()

	number_of_cells
cluster_annotation_term_label
CS20250428_NEIGH_0001	815521
CS20250428_NEIGH_0000	34315
CS20250428_NEIGH_0002	1045252
CS20250428_NEIGH_0003	1045
CS20250428_CLASS_0000	224824

Finally, we load the cluster to cluster annotation membership table to help us create associated metadata for each cluster to tell us the annotations for a given Cluster at all levels of the taxonomy. We’ll use this in groupbys to give each cluster the metadata for each of its parents. This will be useful when we merge into the cell metadata table.

cluster_to_cluster_annotation_membership = abc_cache.get_metadata_dataframe(
    directory='HMBA-BG-taxonomy-CCN20250428',
    file_name='cluster_to_cluster_annotation_membership',
).set_index('cluster_annotation_term_label')
membership_with_cluster_info = cluster_to_cluster_annotation_membership.join(cluster_annotation_term_with_cells, rsuffix='_anno_term').reset_index()
membership_groupby = membership_with_cluster_info.groupby(
    ['cluster_alias', 'cluster_annotation_term_set_name']
)
membership_with_cluster_info.head()

	cluster_annotation_term_label	cluster_annotation_term_set_label	cluster_alias	cluster_annotation_term_set_name	cluster_annotation_term_name	name	cluster_annotation_term_set_label_anno_term	cluster_annotation_term_set_name_anno_term	color_hex_triplet	term_order	term_set_order	parent_term_label	parent_term_name	parent_term_set_label	cluster_alias_anno_term	number_of_cells
0	CS20250428_CLUST_0161	CCN20250428_LEVEL_4	Human-143	Cluster	Human-143	Human-143	CCN20250428_LEVEL_4	Cluster	#4ac0ed	0	4	CS20250428_GROUP_0039	Astrocyte	CCN20250428_LEVEL_3	Human-143	91
1	CS20250428_CLUST_0162	CCN20250428_LEVEL_4	Human-145	Cluster	Human-145	Human-145	CCN20250428_LEVEL_4	Cluster	#8af851	1	4	CS20250428_GROUP_0039	Astrocyte	CCN20250428_LEVEL_3	Human-145	1783
2	CS20250428_CLUST_0163	CCN20250428_LEVEL_4	Human-146	Cluster	Human-146	Human-146	CCN20250428_LEVEL_4	Cluster	#d1dd68	2	4	CS20250428_GROUP_0039	Astrocyte	CCN20250428_LEVEL_3	Human-146	172
3	CS20250428_CLUST_0164	CCN20250428_LEVEL_4	Human-149	Cluster	Human-149	Human-149	CCN20250428_LEVEL_4	Cluster	#95daf6	3	4	CS20250428_GROUP_0039	Astrocyte	CCN20250428_LEVEL_3	Human-149	2649
4	CS20250428_CLUST_0165	CCN20250428_LEVEL_4	Human-150	Cluster	Human-150	Human-150	CCN20250428_LEVEL_4	Cluster	#26827e	4	4	CS20250428_GROUP_0039	Astrocyte	CCN20250428_LEVEL_3	Human-150	1359

From the membership table, we create three tables via a groupby. First the name of each cluster and its parents.

# term_sets = abc_cache.get_metadata_dataframe(directory='WHB-taxonomy', file_name='cluster_annotation_term_set').set_index('label')
cluster_details = membership_groupby['cluster_annotation_term_name'].first().unstack()
cluster_details = cluster_details[cluster_annotation_term_set['name']] # order columns
cluster_details.fillna('Other', inplace=True)
cluster_details.sort_values(['Neighborhood', 'Class', 'Subclass', 'Group', 'Cluster'], inplace=True)
cluster_details.head()

cluster_annotation_term_set_name	Neighborhood	Class	Subclass	Group	Cluster
cluster_alias
Human-128	Glut Sero Dopa	F M Glut	F Glut	BF SKOR1 Glut	Human-128
Human-129	Glut Sero Dopa	F M Glut	F Glut	BF SKOR1 Glut	Human-129
Human-130	Glut Sero Dopa	F M Glut	F Glut	BF SKOR1 Glut	Human-130
Human-423	Glut Sero Dopa	F M Glut	F Glut	BF SKOR1 Glut	Human-423
Human-426	Glut Sero Dopa	F M Glut	F Glut	BF SKOR1 Glut	Human-426

Next the plotting order of each of the clusters and their parents.

cluster_order = membership_groupby['term_order'].first().unstack()
cluster_order.sort_values(['Neighborhood', 'Class', 'Subclass', 'Group', 'Cluster'], inplace=True)
cluster_order.rename(
    columns={'Neighborhood': 'Neighborhood_order',
             'Class': 'Class_order',
             'Subclass': 'Subclass_order',
             'Group': 'Group_order',
             'Cluster': 'Cluster_order'},
    inplace=True
)
cluster_order.head()

cluster_annotation_term_set_name	Class_order	Cluster_order	Group_order	Neighborhood_order	Subclass_order
cluster_alias
Human-143	1	0	1	1	1
Human-145	1	1	1	1	1
Human-146	1	2	1	1	1
Human-149	1	3	1	1	1
Human-150	1	4	1	1	1

Finally, the colors we will use to plot for each of the unique taxons at all levels.

cluster_colors = membership_groupby['color_hex_triplet'].first().unstack()
cluster_colors = cluster_colors[cluster_annotation_term_set['name']]
cluster_colors.sort_values(['Neighborhood', 'Class', 'Subclass', 'Group', 'Cluster'], inplace=True)
cluster_colors.head()

cluster_annotation_term_set_name	Neighborhood	Class	Subclass	Group	Cluster
cluster_alias
Marmoset-323	#19613b	#26e8bb	#1bc06a	#fc2b80	#00d86e
Marmoset-307	#19613b	#26e8bb	#1bc06a	#fc2b80	#0454b2
Marmoset-325	#19613b	#26e8bb	#1bc06a	#fc2b80	#094a6f
Human-346	#19613b	#26e8bb	#1bc06a	#fc2b80	#0f6331
Marmoset-301	#19613b	#26e8bb	#1bc06a	#fc2b80	#1140be

Next, we bring it all together by loading the mapping of cells to cluster and join into our final metadata table. Note here that not every cell is currently associated into the taxonomy hence the NaN values for many of the taxonomy information columns.

cell_to_cluster_membership = abc_cache.get_metadata_dataframe(
    directory='HMBA-BG-taxonomy-CCN20250428',
    file_name='cell_to_cluster_membership',
).set_index('cell_label')
cell_to_cluster_membership.head()

	cluster_alias	cluster_label
cell_label
AAACAGCCAAATGCCC-2362_A05	Human-451	CS20250428_CLUST_0268
AAACAGCCAATTGAGA-2362_A05	Human-1	CS20250428_CLUST_0227
AAACAGCCAGCATGTC-2362_A05	Human-153	CS20250428_CLUST_0215
AAACAGCCATTGACAT-2362_A05	Human-1	CS20250428_CLUST_0227
AAACAGCCATTGTGGC-2362_A05	Human-14	CS20250428_CLUST_0249

cell_extended_with_tax = cell_extended.join(cell_to_cluster_membership, rsuffix='_cell_to_cluster_membership')
cell_extended_with_tax = cell_extended_with_tax.join(cluster_details, on='cluster_alias')
cell_extended_with_tax = cell_extended_with_tax.join(cluster_colors, on='cluster_alias', rsuffix='_color')
cell_extended_with_tax = cell_extended_with_tax.join(cluster_order, on='cluster_alias')
cell_extended_with_tax.head()

	cell_barcode	donor_label	barcoded_cell_sample_label	library_label	alignment_job_id	doublet_score	umi_count	feature_matrix_label	dataset_label	abc_sample_id	...	Neighborhood_color	Class_color	Subclass_color	Group_color	Cluster_color	Class_order	Cluster_order	Group_order	Neighborhood_order	Subclass_order
cell_label
GTCAGGCTCAGGCCTA-2305_C01	GTCAGGCTCAGGCCTA	H24.30.003	2305_C01	L8XR_240705_01_H06	93c6ec45bd2a1ca18516918a7f1a448e84006372	0.110000	23061.0	HMBA-10xMultiome-BG-Aligned	HMBA-10xMultiome-BG-Aligned	d59b11a7-a50d-472e-bc89-9fb5358044fa	...	#19613b	#d0b83c	#1655f2	#339933	#dafd5a	10.0	1078.0	51.0	3.0	31.0
TATTACCTCAAAGGCA-2295_D02	TATTACCTCAAAGGCA	H24.30.003	2295_D02	L8XR_240705_01_G05	c9b036ef4c6ddb8545718e5ea3747fc20c51ebfb	0.026667	12546.0	HMBA-10xMultiome-BG-Aligned	HMBA-10xMultiome-BG-Aligned	8e66d017-0e3a-462b-bb7a-fb4dc8f297ec	...	#19613b	#d0b83c	#253c8c	#aec7e8	#c33059	10.0	1176.0	53.0	3.0	32.0
TAACCAGGTTGGCGTG-2019_C02	TAACCAGGTTGGCGTG	QM23.50.002	2019_C02	L8XR_240118_21_C03	8268a0d34c53fa6bdd8887d7a6b9b612732c8777	0.000000	4430.0	HMBA-10xMultiome-BG-Aligned	HMBA-10xMultiome-BG-Aligned	e4f0f167-cce3-4797-8e95-9abf424ae81e	...	#f2ca7d	#6ec0da	#604f15	#e16c95	#83d8d6	1.0	26.0	1.0	1.0	1.0
GTGTGTTAGCGCATTG-1750_B01	GTGTGTTAGCGCATTG	QM23.50.001	1750_B01	L8XR_230629_22_H11	50c29f65910dc6faa1990f1ff9601c25bb909822	0.000000	9746.0	HMBA-10xMultiome-BG-Aligned	HMBA-10xMultiome-BG-Aligned	f3ce9856-7e08-451b-afce-94f3923224d5	...	#19613b	#ce4c27	#eea495	#708297	#ef6b52	7.0	549.0	29.0	3.0	19.0
CTCCGGACATAATCGT-1959_C02	CTCCGGACATAATCGT	H23.30.001	1959_C02	L8XR_231116_02_B09	d7924909ac635d27a5aaf99c11ea0373b9c3bb07	0.010753	10972.0	HMBA-10xMultiome-BG-Aligned	HMBA-10xMultiome-BG-Aligned	efc4e716-8a28-4bdb-8ca1-771bdcc10c17	...	#f2ca7d	#6ec0da	#604f15	#e16c95	#238c7a	1.0	12.0	1.0	1.0	1.0

5 rows × 53 columns

print_column_info(cell_extended)

Number of unique cell_barcode = 690037 
Number of unique donor_label = 22 ['CJ23.56.002', 'CJ23.56.003', 'CJ24.56.001', 'CJ24.56.004', 'H18.30.001', 'H19.30.004', 'H20.30.001', 'H20.30.002', 'H21.30.004', 'H23.30.001', 'H24.30.001', 'H24.30.003', 'H24.30.004', 'H24.30.007', 'Q19.26.010', 'Q21.26.002', 'Q21.26.010', 'QM21.26.001', 'QM21.26.003', 'QM23.50.001', 'QM23.50.002', 'QM23.50.003']
Number of unique barcoded_cell_sample_label = 417 
Number of unique library_label = 417 
Number of unique alignment_job_id = 303 
Number of unique doublet_score = 1426 
Number of unique umi_count = 86532 
Number of unique feature_matrix_label = 1 ['HMBA-10xMultiome-BG-Aligned']
Number of unique dataset_label = 1 ['HMBA-10xMultiome-BG-Aligned']
Number of unique abc_sample_id = 1898081 
Number of unique donor_species = 4 ['NCBITaxon:9483', 'NCBITaxon:9544', 'NCBITaxon:9545', 'NCBITaxon:9606']
Number of unique species_scientific_name = 4 ['Callithrix jacchus', 'Homo sapiens', 'Macaca mulatta', 'Macaca nemestrina']
Number of unique species_common_name = 3 ['Human', 'Macaque', 'Marmoset']
Number of unique donor_sex = 2 ['Female', 'Male']
Number of unique donor_age = 20 ['10 yrs', '11 yrs', '14 yrs', '18 yrs', '19 yrs', '2.0 yrs', '25 yrs', '38 yrs', '4.0 yrs', '44 yrs', '5 yrs', '5.4 yrs', '50 yrs', '57 yrs', '58 yrs', '6 yrs', '6.6 yrs', '60 yrs', '61 yrs', '7 yrs']
Number of unique donor_age_value = 20 [1.95890411, 3.956284153, 5.0, 5.391780822, 6.0, 6.555924845, 7.0, 10.0, 11.0, 14.0, 18.0, 19.0, 25.0, 38.0, 44.0, 50.0, 57.0, 58.0, 60.0, 61.0]
Number of unique donor_age_unit = 1 ['years']
Number of unique library_method = 1 ['10xMultiome;GEX']
Number of unique barcoded_cell_sample_label_library_table = 417 
Number of unique enrichment_population = 83 
Number of unique cell_specimen_type = 1 ['Nuclei']
Number of unique parcellation_term_identifier = 20 ['DHBA:10155', 'DHBA:10334', 'DHBA:10335', 'DHBA:10336', 'DHBA:10337', 'DHBA:10338', 'DHBA:10339', 'DHBA:10340', 'DHBA:10341', 'DHBA:10342', 'DHBA:10343', 'DHBA:10344', 'DHBA:10345', 'DHBA:10466', 'DHBA:11537', 'DHBA:11538', 'DHBA:12251', 'DHBA:12261', 'DHBA:146034742', 'DHBA:146034754']
Number of unique region_of_interest_name = 20 ['Ventral pallidus', 'body of caudate', 'brain', 'caudal putamen', 'caudate nucleus', 'core of nucleus accumbens', 'external segment of globus pallidus', 'globus pallidus', 'head of caudate', 'internal segment of globus pallidus', 'nucleus accumbens', 'peri-caudate ependymal and subependymal zone', 'posteroventral putamen', 'putamen', 'rostral putamen', 'shell of nucleus accumbens', 'substantia nigra', 'subthalamic nucleus', 'tail of caudate', 'ventral tegmental area']
Number of unique region_of_interest_label = 20 ['Br', 'Ca', 'CaB', 'CaH', 'CaT', 'Eca', 'GP', 'GPe', 'GPi', 'NAC', 'NACc', 'NACs', 'Pu', 'PuC', 'PuPV', 'PuR', 'SN', 'STH', 'VTA', 'VeP']
Number of unique anatomical_division_label = 2 ['Basal ganglia', 'Brain']
Number of unique donor_label_library_table = 22 ['CJ23.56.002', 'CJ23.56.003', 'CJ24.56.001', 'CJ24.56.004', 'H18.30.001', 'H19.30.004', 'H20.30.001', 'H20.30.002', 'H21.30.004', 'H23.30.001', 'H24.30.001', 'H24.30.003', 'H24.30.004', 'H24.30.007', 'Q19.26.010', 'Q21.26.002', 'Q21.26.010', 'QM21.26.001', 'QM21.26.003', 'QM23.50.001', 'QM23.50.002', 'QM23.50.003']
Number of unique region_of_interest_label_color = 20 ['#0010d9', '#0060ff', '#00bfa0', '#00bfff', '#53DB33', '#7AE361', '#89522A', '#907991', '#9BA2FF', '#9b19f5', '#A7DB33', '#B8003A', '#C29515', '#E0E0E0', '#E60049', '#c941a7', '#e6d800', '#fe0ccf', '#ff6b9a', '#ffa300']
Number of unique region_of_interest_label_order = 20 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
Number of unique species_common_name_color = 3 ['#377eb8', '#4daf4a', '#FF5F5D']
Number of unique species_common_name_order = 3 [1, 2, 3]
Number of unique species_scientific_name_color = 4 ['#377eb8', '#4daf4a', '#FF5F5D', '#b2df8a']
Number of unique species_scientific_name_order = 4 [1, 2, 3, 4]
Number of unique donor_sex_color = 2 ['#565353', '#ADC4C3']
Number of unique donor_sex_order = 2 [1, 2]
Number of unique x = 1844471 
Number of unique y = 1820193 

Plotting the taxonomy#

Now that we have our cells with associated taxonomy information, we’ll plot them into the UMAP we showed previously. First, we trim down the cells to those with taxonomy information by selecting those that where the taxonomy information is not NaN.

cell_with_taxonomy = cell_extended_with_tax[~pd.isna(cell_extended_with_tax['Class_color'])]

Below we plot the taxonomy mapping of the cells for each level in the taxonomy.

fig, ax = plot_umap(
    cell_with_taxonomy['x'],
    cell_with_taxonomy['y'],
    cc=cell_with_taxonomy['Neighborhood_color'],
    labels=cell_with_taxonomy['Neighborhood'],
    term_orders=cell_with_taxonomy['Neighborhood_order'],
    fig_width=12,
    fig_height=12
)
res = ax.set_title("Neighborhood")
plt.show()

../_images/9ff05db31d6eb540fccd9ae3cee1056b7d717f793330feec111a5a8871dd1e3b.png

fig, ax = plot_umap(
    cell_with_taxonomy['x'],
    cell_with_taxonomy['y'],
    cc=cell_with_taxonomy['Class_color'],
    labels=cell_with_taxonomy['Class'],
    term_orders=cell_with_taxonomy['Class_order'],
    fig_width=12,
    fig_height=12
)
res = ax.set_title("Class")
plt.show()

../_images/c3e9d881733ade24ab2c601c14a856d42f647a6475528c17231019d305a89ddf.png

fig, ax = plot_umap(
    cell_with_taxonomy['x'],
    cell_with_taxonomy['y'],
    cc=cell_with_taxonomy['Subclass_color'],
    labels=cell_with_taxonomy['Subclass'],
    term_orders=cell_with_taxonomy['Subclass_order'],
    fig_width=12,
    fig_height=12
)
res = ax.set_title("Subclass")
plt.show()

../_images/89ba0422e4de62a2b9a583fc1b540afeef698eb866f46912cd09e4f150ad8911.png

fig, ax = plot_umap(
    cell_with_taxonomy['x'],
    cell_with_taxonomy['y'],
    cc=cell_with_taxonomy['Group_color'],
    labels=cell_with_taxonomy['Group'],
    term_orders=cell_with_taxonomy['Group_order'],
    fig_width=16,
    fig_height=16
)
res = ax.set_title("Group")
plt.show()

../_images/503261870c1f96caafd54495e82d34daa122da5190ce218be36bd5e60deb79fb.png

fig, ax = plot_umap(
    cell_with_taxonomy['x'],
    cell_with_taxonomy['y'],
    cc=cell_with_taxonomy['Cluster_color'],
    fig_width=12,
    fig_height=12
)
res = ax.set_title("Cluster")
plt.show()

../_images/a4dba33f98cdc417792d15dbb62da8a6f7a9966236ad497f897f32562258d5ea.png

Aggregating cluster and cells counts per term#

Let’s investigate the taxonomy information a bit more. In this section, we’ll create bar plots showing the number of clusters and cells at each level in the taxonomy.

First, we need to compute the number of clusters that are in each of the cell type taxons above it.

# Count the number of clusters associated with each cluster annotation term
term_cluster_count = membership_with_cluster_info.reset_index().groupby(['cluster_annotation_term_label'])[['cluster_alias']].count()
term_cluster_count.columns = ['number_of_clusters']
term_cluster_count.head()

	number_of_clusters
cluster_annotation_term_label
CS20250428_CLASS_0000	86
CS20250428_CLASS_0001	93
CS20250428_CLASS_0002	12
CS20250428_CLASS_0003	386
CS20250428_CLASS_0005	199

We already have our number of cells computed previously, so we’ll just use them here and cluster counts into the cluster_annotation table.

# Join counts with the term dataframe
term_with_counts = cluster_annotation_term_with_cells.join(term_cluster_count)
term_with_counts[['name', 'cluster_annotation_term_set_name', 'number_of_clusters', 'number_of_cells']].head()

	name	cluster_annotation_term_set_name	number_of_clusters	number_of_cells
cluster_annotation_term_label
CS20250428_NEIGH_0001	Nonneuron	Neighborhood	295	815521
CS20250428_NEIGH_0000	Glut Sero Dopa	Neighborhood	171	34315
CS20250428_NEIGH_0002	Subpallium GABA	Neighborhood	957	1045252
CS20250428_NEIGH_0003	Subpallium GABA-Glut	Neighborhood	12	1045
CS20250428_CLASS_0000	Astro-Epen	Class	86	224824

Below we create a function to plot the cluster and cell counts in a bar graph, coloring by the associated taxon level.

def bar_plot_by_level_and_type(df, level, fig_width = 8.5, fig_height = 4):
    
    fig, ax = plt.subplots(1, 2)
    fig.set_size_inches(fig_width, fig_height)
    
    for idx, ctype in enumerate(['clusters', 'cells']):

        pred = (df['cluster_annotation_term_set_name'] == level)
        sort_order = np.argsort(df[pred]['term_order'])
        names = df[pred]['name'].iloc[sort_order]
        counts = df[pred]['number_of_%s' % ctype].iloc[sort_order]
        colors = df[pred]['color_hex_triplet'].iloc[sort_order]
        
        ax[idx].barh(names, counts, color=colors)
        ax[idx].set_title('Number of %s by %s' % (ctype,level)),
        ax[idx].set_xscale('log')
        
        if idx > 0:
            ax[idx].set_yticklabels([])

    plt.show()

Now let’s plot the counts for each of the taxonomy levels above Cluster.

bar_plot_by_level_and_type(term_with_counts, 'Neighborhood')
plt.show()

../_images/139446471f8135cd351e12f2bd778a7a1f796b94f8f660486ddf1112897d5460.png

bar_plot_by_level_and_type(term_with_counts, 'Class')
plt.show()

../_images/cb2d7213d30649b8c2210616ab55ef4b41bb72e0ea090296432fe80fb3f3b1c9.png

bar_plot_by_level_and_type(term_with_counts, 'Subclass', 8.5, 6)
plt.show()

../_images/e685ef3dd6a396329e54934f45ee8c2005cad89c3730d1d3d05dbe4a50cd5ccc.png

bar_plot_by_level_and_type(term_with_counts, 'Group', 8.5, 10)
plt.show()

../_images/b3888c1f7f18f6f4599da55bdb0c7ba1f10e61912d09dadf32c86b265837958e.png

Distribution of lower taxonomy levels in their parents.#

We can also use a similar visualization to show a given taxonomy level in another. Below we define functions to manipulate our data and plot a stacked bar plot.

def distribution(A, B):
    cluster_order_details = cluster_details.join(cluster_order)
    
    AxB = cluster_details.groupby([A, B])[['Cluster']].count()
    AxB.columns = ['number_of_clusters']
    AxB = AxB.unstack().fillna(0)
    AxB = AxB.loc[
        cluster_order_details[A].unique()[
            np.argsort(cluster_order_details[f'{A}_order'].unique())
        ]
    ]
    AxB = AxB[[
            ('number_of_clusters', name)
            for name in cluster_order_details[B].unique()[
                np.argsort(cluster_order_details[f'{B}_order'].unique())
            ]
    ]]

    B_names = [x[1] for x in list(AxB.columns)]
    pred = (cluster_annotation_term_with_cells['cluster_annotation_term_set_name'] == B)
    term_by_name = cluster_annotation_term_with_cells[pred].set_index('name')
    B_colors = term_by_name.loc[B_names, 'color_hex_triplet']
    
    return AxB, B_names, B_colors

def stacked_bar_distribution(AxB, B_names, B_colors, fig_width = 6, fig_height = 6):

    fig, ax = plt.subplots()
    fig.set_size_inches(fig_width, fig_height)

    bottom = np.zeros(len(AxB))

    for i, col in enumerate(AxB.columns):
        ax.barh(AxB.index[::-1], AxB[col][::-1], left=bottom, label=col[1], color=B_colors[i])
        bottom += np.array(AxB[col][::-1])

    ax.set_title('Distribution of %s in each %s' % (AxB.columns.names[1], AxB.index.name))
    ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

    plt.show()
    
    return fig, ax

We can now visualize how each lower level in the taxonomy is distributed by cluster in the upper portions of the taxonomy.

AxB, B_names, B_colors = distribution('Class', 'Subclass')
fig, ax = stacked_bar_distribution(AxB, B_names, B_colors, 6, 8)
plt.show()

/var/folders/kc/7glrmt5n67x16yj_tg86t49c0000gp/T/ipykernel_24913/3364940061.py:9: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  ax.barh(AxB.index[::-1], AxB[col][::-1], left=bottom, label=col[1], color=B_colors[i])

../_images/c646832f8e544b3de121e8c3d2fdcaf32c607a5b364d17650e9b91e9678faa98.png

Visualizing the BG taxonomy#

Term sets: Class, Subclass, Group and Cluster forms a four level Basil Ganglia taxonomy. We can visualize the taxonomy as a sunburst diagram that shows the single inheritance hierarchy through a series of rings, that are sliced for each annotation term. Each ring corresponds to a level in the hierarchy. We have ordered the rings so that the Class level is the outer most ring so that we can add in labels. Rings are sliced up and divided based on their hierarchical relationship to the parent slice. The angle of each slice is proportional to the number of clusters belonging to the term. Note that we exclude Neighborhood here as it is a much less interesting plot.

levels = ['Class', 'Subclass', 'Group', 'Cluster']
df = {}

# Copy the term order of the parent into each of the level below it.
term_with_counts['parent_order'] = ""
for idx, row in term_with_counts.iterrows():
    if pd.isna(row['parent_term_label']):
        continue
    term_with_counts.loc[idx, 'parent_order'] = term_with_counts.loc[row['parent_term_label']]['term_order']

term_with_counts = term_with_counts.reset_index()
for lvl in levels :
    pred = term_with_counts['cluster_annotation_term_set_name'] == lvl
    df[lvl] = term_with_counts[pred]
    df[lvl] = df[lvl].sort_values(['parent_order', 'term_order'])

fig, ax = plt.subplots()
fig.set_size_inches(10, 10)
size = 0.15

for i, lvl in enumerate(levels):
    
    if lvl == 'Class':
        ax.pie(df[lvl]['number_of_clusters'],
               colors=df[lvl]['color_hex_triplet'],
               labels = df[lvl]['name'],
               rotatelabels=True,
               labeldistance=1.025,
               radius=1,
               wedgeprops=dict(width=size, edgecolor=None),
               startangle=0)
    else :
        ax.pie(df[lvl]['number_of_clusters'],
               colors=df[lvl]['color_hex_triplet'],
               radius=1-i*size,
               wedgeprops=dict(width=size, edgecolor=None),
               startangle=0)

plt.show()

../_images/9d6011414504c35c6e7017a9418e6a07b5b462a0c362e72ea641da52c969ce97.png

In the next tutorial, we show how to access and use HMBA-BG gene expression data.