Human-Mammalian Brain - Basal Ganglia 10X snRANSeq analysis: clustering and annotations#
The basal ganglia (BG) are a system of interconnected brain structures that play a crucial role in motor control, learning, behavior, and emotion. With approximately 200 million neurons in the human basal ganglia alone, these structures are involved in a wide range of neurological processes and are implicated in numerous disorders affecting human health, including Parkinson’s disease, Huntington’s disease, and substance abuse disorders. To further understand the complexity of the basal ganglia, researchers have historically classified its neurons into various types based on their cytoarchitecture, connectivity, molecular profile, and functional properties. However, recent advancements in high-throughput transcriptomic profiling have revolutionized our ability to systematically categorize these cell types within species, while the maturation of machine learning technologies have enabled the integration of these taxonomies across species.
Our consensus basal ganglia cell type taxonomy is the result of iterative clustering and cross-species integration of transcriptomic data. The taxonomy encompasses neurons from key structures within the basal ganglia, including the caudate (Ca), putamen (Pu), nucleus accumbens (NAc), the external and internal segments of the globus pallidus (GPe, GPi), subthalamic nucleus (STN), and substantia nigra (SN). By combining data from multiple primate and rodent species, we have developed a consensus taxonomy that highlights both conserved and species-specific cell types. We validate our taxonomy through marker gene expression analysis, comparison with previously published taxonomies, and self-projection, ensuring the accuracy and robustness of each level in the taxonomic hierarchy.
In a related tutorial, we also show how to access and use HMBA-BG gene expression data.
You need to be connected to the internet to run this notebook or connected to a cache that has the HMBA-BG data downloaded already.
The notebook presented here shows quick visualizations from precomputed metadata in the atlas. For examples on accessing the expression matrices, specifically selecting genes from expression matrices, see the general_acessing_10x_snRNASeq_tutorial.ipynb
tutorial/example.
import pandas as pd
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache
We will interact with the data using the AbcProjectCache. This cache object downloads data requested by the user, tracks which files have already been downloaded to your local system, and serves the path to the requested data on disk. For metadata, the cache can also directly serve a up a Pandas DataFrame. See the getting_started
notebook for more details on using the cache including installing it if it has not already been.
Change the download_base variable to where you would like to download the data to your system.
download_base = Path('../../data/abc_atlas')
abc_cache = AbcProjectCache.from_cache_dir(download_base)
abc_cache.current_manifest
'releases/20250531/manifest.json'
Data overview#
We’ll quickly walk through the data we will be using in this notebook. The HMBA-BG 10X datasets are located across several directories listed in the ABCProjectCache
. In this notebook, we will be looking at the Aligned HMBA-BG 10X dataset. These data combine the there species with an aligned set of ~16k across the species. We will be using data and metadata from the following directories:
HMBA-10xMultiome-BG-Aligned
HMBA-BG-taxonomy-CCN20250428
Note that data for each individual species is available in the directory HMBA-10xMultiome-BG-Aligned
.
Below we list the data and metadata in the HMBA-10xMultiome-BG-Aligned dataset.
print("HMBA-10xMultiome-BG-Aligned: gene expression data (h5ad)\n\t", abc_cache.list_data_files(directory='HMBA-10xMultiome-BG-Aligned'))
print("HMBA-10xMultiome-BG-Aligned: metadata (csv)\n\t", abc_cache.list_metadata_files(directory='HMBA-10xMultiome-BG-Aligned'))
HMBA-10xMultiome-BG-Aligned: gene expression data (h5ad)
['HMBA-10xMultiome-BG-Aligned/log2', 'HMBA-10xMultiome-BG-Aligned/raw']
HMBA-10xMultiome-BG-Aligned: metadata (csv)
['cell_metadata', 'donor', 'example_gene_expression', 'gene', 'library', 'value_sets']
We will also use metadata from the HMBA-BG taxonomy directory. Below is the list of available files:
print("HMBA-BG-taxonomy-CCN20250428: metadata (csv)\n\t", abc_cache.list_metadata_files(directory='HMBA-BG-taxonomy-CCN20250428'))
HMBA-BG-taxonomy-CCN20250428: metadata (csv)
['abbreviation_term', 'cell_2d_embedding_coordinates', 'cell_to_cluster_membership', 'cluster', 'cluster_annotation_term', 'cluster_annotation_term_set', 'cluster_annotation_to_abbreviation_map', 'cluster_to_cluster_annotation_membership']
Cell metadata#
Essential cell metadata is stored as a CSV file that we load as a Pandas DataFrame. Each row represents one cell indexed by a cell label. The cell label is the concatenation of barcode and name of the sample. In this context, the sample is the barcoded cell sample that represents a single load into one port of the 10x Chromium. Note that cell barcodes are only unique within a single barcoded cell sample and that the same barcode can be reused. This metadata file contains cells across all species in the HMBA-BG dataset.
Each cell is associated with a library label, donor label, alignment_job_id, feature_matrix_label and dataset_label identifying which data package this cell is part of. This metadata file will be combined with other metadata files that ship with this package to add information associated with the donor, UMAP coordinates, cluster assignments, and more.
Below, we load the first of the metadata used in this tutorial. This represents the cell metadata for the aligned dataset.
The command we use below both downloads the data if it is not already available in the local cache and loads the data as a Pandas DataFrame. This pattern of loading metadata is repeated throughout the tutorials.
cell = abc_cache.get_metadata_dataframe(
directory='HMBA-10xMultiome-BG-Aligned',
file_name='cell_metadata',
dtype={'cell_label': str}
).set_index('cell_label')
print("Number of cells = ", len(cell))
cell.head()
/Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.12/site-packages/abc_atlas_access/abc_atlas_cache/abc_project_cache.py:429: DtypeWarning: Columns (5) have mixed types. Specify dtype option on import or set low_memory=False.
return pd.read_csv(path, **kwargs)
Number of cells = 2035898
cell_barcode | donor_label | barcoded_cell_sample_label | library_label | alignment_job_id | doublet_score | umi_count | feature_matrix_label | dataset_label | abc_sample_id | |
---|---|---|---|---|---|---|---|---|---|---|
cell_label | ||||||||||
AAACAGCCAAATGCCC-2362_A05 | AAACAGCCAAATGCCC | H24.30.001 | 2362_A05 | L8XR_240808_01_E02 | 8a4bf81821a0f425be8ba9c15dfad6b509020312 | 0.027027 | 15259.0 | HMBA-10xMultiome-BG-Aligned | HMBA-10xMultiome-BG-Aligned | 38447b19-a5dc-4eca-9021-56ae191e8809 |
AAACAGCCAATTGAGA-2362_A05 | AAACAGCCAATTGAGA | H24.30.001 | 2362_A05 | L8XR_240808_01_E02 | 8a4bf81821a0f425be8ba9c15dfad6b509020312 | 0.054795 | 20645.0 | HMBA-10xMultiome-BG-Aligned | HMBA-10xMultiome-BG-Aligned | 3dc4dc1f-4b8a-4012-bb07-e3906ad70da0 |
AAACAGCCAGCATGTC-2362_A05 | AAACAGCCAGCATGTC | H24.30.001 | 2362_A05 | L8XR_240808_01_E02 | 8a4bf81821a0f425be8ba9c15dfad6b509020312 | 0.000000 | 2551.0 | HMBA-10xMultiome-BG-Aligned | HMBA-10xMultiome-BG-Aligned | 291d6ebf-ca70-4d9d-8dde-10fa841dba93 |
AAACAGCCATTGACAT-2362_A05 | AAACAGCCATTGACAT | H24.30.001 | 2362_A05 | L8XR_240808_01_E02 | 8a4bf81821a0f425be8ba9c15dfad6b509020312 | 0.000000 | 2341.0 | HMBA-10xMultiome-BG-Aligned | HMBA-10xMultiome-BG-Aligned | 12153419-9b49-4d5f-a10d-ddb7837a729d |
AAACAGCCATTGTGGC-2362_A05 | AAACAGCCATTGTGGC | H24.30.001 | 2362_A05 | L8XR_240808_01_E02 | 8a4bf81821a0f425be8ba9c15dfad6b509020312 | 0.027397 | 8326.0 | HMBA-10xMultiome-BG-Aligned | HMBA-10xMultiome-BG-Aligned | 5716499d-d69d-489e-b4ed-6841b6ba051d |
We can use pandas groupby
function to see how many unique items are associated for each field and list them out if the number of items is small.
def print_column_info(df):
for c in df.columns:
grouped = df[[c]].groupby(c).count()
members = ''
if len(grouped) < 30:
members = str(list(grouped.index))
print("Number of unique %s = %d %s" % (c, len(grouped), members))
print_column_info(cell)
Number of unique cell_barcode = 690037
Number of unique donor_label = 22 ['CJ23.56.002', 'CJ23.56.003', 'CJ24.56.001', 'CJ24.56.004', 'H18.30.001', 'H19.30.004', 'H20.30.001', 'H20.30.002', 'H21.30.004', 'H23.30.001', 'H24.30.001', 'H24.30.003', 'H24.30.004', 'H24.30.007', 'Q19.26.010', 'Q21.26.002', 'Q21.26.010', 'QM21.26.001', 'QM21.26.003', 'QM23.50.001', 'QM23.50.002', 'QM23.50.003']
Number of unique barcoded_cell_sample_label = 417
Number of unique library_label = 417
Number of unique alignment_job_id = 303
Number of unique doublet_score = 1426
Number of unique umi_count = 86532
Number of unique feature_matrix_label = 1 ['HMBA-10xMultiome-BG-Aligned']
Number of unique dataset_label = 1 ['HMBA-10xMultiome-BG-Aligned']
Number of unique abc_sample_id = 1898081
Donor and Library metadata#
The first two associated metadata we load are the donor
and library
tables. The donor table contains species, sex, and age information. The library table contains information on 10X methods and brain region of interest the tissue was extracted from.
donor = abc_cache.get_metadata_dataframe(
directory='HMBA-10xMultiome-BG-Aligned',
file_name='donor'
).set_index('donor_label')
donor.head()
donor_species | species_scientific_name | species_common_name | donor_sex | donor_age | donor_age_value | donor_age_unit | |
---|---|---|---|---|---|---|---|
donor_label | |||||||
QM23.50.003 | NCBITaxon:9544 | Macaca mulatta | Macaque | Male | 6 yrs | 6.0 | years |
QM21.26.001 | NCBITaxon:9544 | Macaca mulatta | Macaque | Male | 6 yrs | 6.0 | years |
Q19.26.010 | NCBITaxon:9544 | Macaca mulatta | Macaque | Female | 10 yrs | 10.0 | years |
H24.30.003 | NCBITaxon:9606 | Homo sapiens | Human | Female | 19 yrs | 19.0 | years |
H21.30.004 | NCBITaxon:9606 | Homo sapiens | Human | Male | 57 yrs | 57.0 | years |
library = abc_cache.get_metadata_dataframe(
directory='HMBA-10xMultiome-BG-Aligned',
file_name='library'
).set_index('library_label')
library.head()
library_method | barcoded_cell_sample_label | enrichment_population | cell_specimen_type | parcellation_term_identifier | region_of_interest_name | region_of_interest_label | anatomical_division_label | donor_label | |
---|---|---|---|---|---|---|---|---|---|
library_label | |||||||||
L8XR_240222_21_H03 | 10xMultiome;GEX | 2077_A02 | 50% NeuN+, 35% OLIG2-, 15% OLIG2+ | Nuclei | DHBA:11538 | caudal putamen | PuC | Basal ganglia | QM23.50.003 |
L8XR_220428_02_A05 | 10xMultiome;GEX | 1224_A04 | 10% NeuN+, 49% OLIG2-, 25% OLIG2+, 16% Nurr1+ | Nuclei | DHBA:12261 | ventral tegmental area | VTA | Basal ganglia | QM21.26.001 |
L8XR_220428_02_C04 | 10xMultiome;GEX | 1222_A02 | 60% NeuN+, 27% OLIG2-, 13% OLIG2+ | Nuclei | DHBA:10466 | subthalamic nucleus | STH | Basal ganglia | Q19.26.010 |
L8XR_240705_01_A06 | 10xMultiome;GEX | 2305_E01 | 50% NeuN+, 35% OLIG2-, 15% OLIG2+ | Nuclei | DHBA:11537 | rostral putamen | PuR | Basal ganglia | H24.30.003 |
L8XR_240919_21_B05 | 10xMultiome;GEX | 2453_A02 | 70% NeuN+, 20% OLIG2-, 10% OLIG2+ | Nuclei | DHBA:10345 | Ventral pallidus | VeP | Basal ganglia | H21.30.004 |
We combine these into an extended cell metadata table.
cell_extended = cell.join(donor, on='donor_label')
cell_extended = cell_extended.join(library, on='library_label', rsuffix='_library_table')
Below we use a pandas groupby
to show the number of cells from species.
cell_extended.groupby('species_common_name')[['feature_matrix_label']].count()
feature_matrix_label | |
---|---|
species_common_name | |
Human | 1051428 |
Macaque | 569895 |
Marmoset | 414575 |
We can also use the groupby
function to show the number of cells in each region of interest extracted from the BG.
cell_extended.groupby('region_of_interest_name')[['region_of_interest_label']].count()
region_of_interest_label | |
---|---|
region_of_interest_name | |
Ventral pallidus | 39593 |
body of caudate | 188438 |
brain | 414575 |
caudal putamen | 166294 |
caudate nucleus | 27964 |
core of nucleus accumbens | 19283 |
external segment of globus pallidus | 205322 |
globus pallidus | 4900 |
head of caudate | 132325 |
internal segment of globus pallidus | 134560 |
nucleus accumbens | 157981 |
peri-caudate ependymal and subependymal zone | 10428 |
posteroventral putamen | 74681 |
putamen | 35092 |
rostral putamen | 196452 |
shell of nucleus accumbens | 9525 |
substantia nigra | 40396 |
subthalamic nucleus | 63348 |
tail of caudate | 90271 |
ventral tegmental area | 24470 |
Adding color and feature order#
Each feature in the donor and library table comes is associated with unique colors and an ordering with the set of values. Below we load the value_sets
DataFrame which is a mapping from the various value in the donor and species tables to those colors and orderings. We incorporate these values into the cell metadata table.
value_sets = abc_cache.get_metadata_dataframe(
directory='HMBA-10xMultiome-BG-Aligned',
file_name='value_sets'
).set_index('label')
value_sets.head()
table | field | description | order | external_identifier | parent_label | color_hex_triplet | comment | Unnamed: 9 | Unnamed: 10 | Unnamed: 11 | Unnamed: 12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
label | ||||||||||||
Female | donor | donor_sex | Female | 1 | NaN | NaN | #565353 | NaN | NaN | NaN | NaN | NaN |
Male | donor | donor_sex | Male | 2 | NaN | NaN | #ADC4C3 | NaN | NaN | NaN | NaN | NaN |
Human | donor | species_common_name | Human | 1 | NaN | NaN | #377eb8 | NaN | NaN | NaN | NaN | NaN |
Macaque | donor | species_common_name | Macaque | 2 | NaN | NaN | #4daf4a | NaN | NaN | NaN | NaN | NaN |
Marmoset | donor | species_common_name | Marmoset | 3 | NaN | NaN | #FF5F5D | NaN | NaN | NaN | NaN | NaN |
We define a convince function to add colors for the various values in the data (e.g. unique region of interest or donor sex values).
def extract_value_set(cell_metadata_df: pd.DataFrame, input_value_set: pd.DataFrame, input_value_set_label: str):
"""Add color and order columns to the cell metadata dataframe based on the input
value set.
Columns are added as {input_value_set_label}_color and {input_value_set_label}_order.
Parameters
----------
cell_metadata_df : pd.DataFrame
DataFrame containing cell metadata.
input_value_set : pd.DataFrame
DataFrame containing the value set information.
input_value_set_label : str
The the column name to extract color and order information for. will be added to the cell metadata.
"""
cell_metadata_df[f'{input_value_set_label}_color'] = input_value_set[
input_value_set['field'] == input_value_set_label
].loc[cell_metadata_df[input_value_set_label]]['color_hex_triplet'].values
cell_metadata_df[f'{input_value_set_label}_order'] = input_value_set[
input_value_set['field'] == input_value_set_label
].loc[cell_metadata_df[input_value_set_label]]['order'].values
Use our function to add the relevant color and order columns to our cell_metadata table.
# Add region of interest color and order
extract_value_set(cell_extended, value_sets, 'region_of_interest_label')
# Add species common name color and order
extract_value_set(cell_extended, value_sets, 'species_common_name')
# Add species scientific name color and order
extract_value_set(cell_extended, value_sets, 'species_scientific_name')
# Add donor sex color and order
extract_value_set(cell_extended, value_sets, 'donor_sex')
cell_extended.head()
cell_barcode | donor_label | barcoded_cell_sample_label | library_label | alignment_job_id | doublet_score | umi_count | feature_matrix_label | dataset_label | abc_sample_id | ... | anatomical_division_label | donor_label_library_table | region_of_interest_label_color | region_of_interest_label_order | species_common_name_color | species_common_name_order | species_scientific_name_color | species_scientific_name_order | donor_sex_color | donor_sex_order | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cell_label | |||||||||||||||||||||
AAACAGCCAAATGCCC-2362_A05 | AAACAGCCAAATGCCC | H24.30.001 | 2362_A05 | L8XR_240808_01_E02 | 8a4bf81821a0f425be8ba9c15dfad6b509020312 | 0.027027 | 15259.0 | HMBA-10xMultiome-BG-Aligned | HMBA-10xMultiome-BG-Aligned | 38447b19-a5dc-4eca-9021-56ae191e8809 | ... | Basal ganglia | H24.30.001 | #53DB33 | 15 | #377eb8 | 1 | #377eb8 | 1 | #ADC4C3 | 2 |
AAACAGCCAATTGAGA-2362_A05 | AAACAGCCAATTGAGA | H24.30.001 | 2362_A05 | L8XR_240808_01_E02 | 8a4bf81821a0f425be8ba9c15dfad6b509020312 | 0.054795 | 20645.0 | HMBA-10xMultiome-BG-Aligned | HMBA-10xMultiome-BG-Aligned | 3dc4dc1f-4b8a-4012-bb07-e3906ad70da0 | ... | Basal ganglia | H24.30.001 | #53DB33 | 15 | #377eb8 | 1 | #377eb8 | 1 | #ADC4C3 | 2 |
AAACAGCCAGCATGTC-2362_A05 | AAACAGCCAGCATGTC | H24.30.001 | 2362_A05 | L8XR_240808_01_E02 | 8a4bf81821a0f425be8ba9c15dfad6b509020312 | 0.000000 | 2551.0 | HMBA-10xMultiome-BG-Aligned | HMBA-10xMultiome-BG-Aligned | 291d6ebf-ca70-4d9d-8dde-10fa841dba93 | ... | Basal ganglia | H24.30.001 | #53DB33 | 15 | #377eb8 | 1 | #377eb8 | 1 | #ADC4C3 | 2 |
AAACAGCCATTGACAT-2362_A05 | AAACAGCCATTGACAT | H24.30.001 | 2362_A05 | L8XR_240808_01_E02 | 8a4bf81821a0f425be8ba9c15dfad6b509020312 | 0.000000 | 2341.0 | HMBA-10xMultiome-BG-Aligned | HMBA-10xMultiome-BG-Aligned | 12153419-9b49-4d5f-a10d-ddb7837a729d | ... | Basal ganglia | H24.30.001 | #53DB33 | 15 | #377eb8 | 1 | #377eb8 | 1 | #ADC4C3 | 2 |
AAACAGCCATTGTGGC-2362_A05 | AAACAGCCATTGTGGC | H24.30.001 | 2362_A05 | L8XR_240808_01_E02 | 8a4bf81821a0f425be8ba9c15dfad6b509020312 | 0.027397 | 8326.0 | HMBA-10xMultiome-BG-Aligned | HMBA-10xMultiome-BG-Aligned | 5716499d-d69d-489e-b4ed-6841b6ba051d | ... | Basal ganglia | H24.30.001 | #53DB33 | 15 | #377eb8 | 1 | #377eb8 | 1 | #ADC4C3 | 2 |
5 rows × 34 columns
UMAP spatial embedding#
Now that we’ve merged our donor and library metadata into the main cells data, our next step is to put these values in a Uniform Manifold Approximation and Projection (UMAP) for cells in the dataset. The UMAP is a dimension reduction technique that can be used for visualizing and exploring large-dimension datasets.
Below we load this 2-D embedding for a sub selection of our cells and merge the x-y coordinates into the extended cell metadata we are creating.
cell_2d_embedding_coordinates = abc_cache.get_metadata_dataframe(
directory='HMBA-BG-taxonomy-CCN20250428',
file_name='cell_2d_embedding_coordinates'
).set_index('cell_label')
cell_2d_embedding_coordinates.head()
x | y | |
---|---|---|
cell_label | ||
AAACAGCCAAATGCCC-2362_A05 | 0.452302 | 2.938630 |
AAACAGCCAATTGAGA-2362_A05 | 0.051495 | 16.282684 |
AAACAGCCAGCATGTC-2362_A05 | -1.233702 | 8.666612 |
AAACAGCCATTGACAT-2362_A05 | 0.517126 | 15.368365 |
AAACAGCCATTGTGGC-2362_A05 | -3.697715 | -1.647361 |
cell_extended = cell_extended.join(cell_2d_embedding_coordinates)
cell_extended = cell_extended.sample(frac=1)
We define a small helper function plot umap to visualize the cells on the UMAP. In the examples below we will plot associated cell information colorized by dissection donor species, sex, and region of interest.
def plot_umap(
xx,
yy,
cc=None,
val=None,
fig_width=8,
fig_height=8,
cmap=None,
labels=None,
term_orders=None,
colorbar=False,
sizes=None
):
"""
Plot a scatter plot of the UMAP coordinates.
Parameters
----------
xx : array-like
x-coordinates of the points to plot.
yy : array-like
y-coordinates of the points to plot.
cc : array-like, optional
colors of the points to plot. If None, the points will be colored by the values in `val`.
val : array-like, optional
values of the points to plot. If None, the points will be colored by the values in `cc`.
fig_width : float, optional
width of the figure in inches. Default is 8.
fig_height : float, optional
height of the figure in inches. Default is 8.
cmap : str, optional
colormap to use for coloring the points. If None, the points will be colored by the values in `cc`.
labels : array-like, optional
labels for the points to plot. If None, no labels will be added to the plot.
term_orders : array-like, optional
order of the labels for the legend. If None, the labels will be ordered by their appearance in `labels`.
colorbar : bool, optional
whether to add a colorbar to the plot. Default is False.
sizes : array-like, optional
sizes of the points to plot. If None, all points will have the same size.
"""
if sizes is None:
sizes = 1
fig, ax = plt.subplots()
fig.set_size_inches(fig_width, fig_height)
if cmap is not None:
scatt = ax.scatter(xx, yy, c=val, s=0.5, marker='.', cmap=cmap, alpha=sizes)
elif cc is not None:
scatt = ax.scatter(xx, yy, c=cc, s=0.5, marker='.', alpha=sizes)
if labels is not None:
from matplotlib.patches import Rectangle
unique_label_colors = (labels + ',' + cc).unique()
unique_labels = np.array([label_color.split(',')[0] for label_color in unique_label_colors])
unique_colors = np.array([label_color.split(',')[1] for label_color in unique_label_colors])
if term_orders is not None:
unique_order = term_orders.unique()
term_order = np.argsort(unique_order)
unique_labels = unique_labels[term_order]
unique_colors = unique_colors[term_order]
rects = []
for color in unique_colors:
rects.append(Rectangle((0, 0), 1, 1, fc=color))
legend = ax.legend(rects, unique_labels, loc=1)
# ax.add_artist(legend)
if colorbar:
fig.colorbar(scatt, ax=ax)
return fig, ax
cell_with_umap = cell_extended[~pd.isna(cell_extended['x'])]
fig, ax = plot_umap(
cell_with_umap['x'],
cell_with_umap['y'],
cc=cell_with_umap['species_common_name_color'],
labels=cell_with_umap['species_common_name'],
term_orders=cell_with_umap['species_common_name_order'],
fig_width=12,
fig_height=12
)
res = ax.set_title("species_common_name")
plt.show()

fig, ax = plot_umap(
cell_with_umap['x'],
cell_with_umap['y'],
cc=cell_with_umap['species_scientific_name_color'],
labels=cell_with_umap['species_scientific_name'],
term_orders=cell_with_umap['species_scientific_name_order'],
fig_width=12,
fig_height=12
)
res = ax.set_title("species_scientific_name")
plt.show()

fig, ax = plot_umap(
cell_with_umap['x'],
cell_with_umap['y'],
cc=cell_with_umap['donor_sex_color'],
labels=cell_with_umap['donor_sex'],
term_orders=cell_with_umap['donor_sex_order'],
fig_width=12,
fig_height=12
)
res = ax.set_title("donor_sex")
plt.show()

Below we show the region of interest for the three species. Note, however, that Marmoset does not currently have fine grained ROIs available and is marked as Br - Brain
.
fig, ax = plot_umap(
cell_with_umap['x'],
cell_with_umap['y'],
cc=cell_with_umap['region_of_interest_label_color'],
labels=cell_with_umap['region_of_interest_label'],
term_orders=cell_with_umap['region_of_interest_label_order'],
fig_width=12,
fig_height=12
)
res = ax.set_title("region_of_interest_label")
plt.show()

Taxonomy Information#
The final set of metadata we load into our extended cell metadata file maps the cells into their assigned cluster in the taxonomy. We additionally load metadata regarding the clusters and compute useful information, such as the number of cells in each taxon at each level of the taxonomy.
First, we load the table that describes the levels in the taxonomy from Neighborhood at the highest to Cluster at the lowest level.
cluster = abc_cache.get_metadata_dataframe(
directory='HMBA-BG-taxonomy-CCN20250428',
file_name='cluster',
dtype={'number_of_cells': 'Int64'}
).rename(columns={'label': 'cluster_annotation_term_label'}).set_index('cluster_annotation_term_label')
cluster.head()
cluster_alias | number_of_cells | |
---|---|---|
cluster_annotation_term_label | ||
CS20250428_CLUST_0161 | Human-143 | 91 |
CS20250428_CLUST_0162 | Human-145 | 1783 |
CS20250428_CLUST_0163 | Human-146 | 172 |
CS20250428_CLUST_0164 | Human-149 | 2649 |
CS20250428_CLUST_0165 | Human-150 | 1359 |
Next, we load information associated with each Cluster in the taxonomy. This includes a useful alias value for each cluster as well as the number of cells in each cluster.
cluster_annotation_term_set = abc_cache.get_metadata_dataframe(
directory='HMBA-BG-taxonomy-CCN20250428',
file_name='cluster_annotation_term_set'
).rename(columns={'label': 'cluster_annotation_term_label'})
cluster_annotation_term_set
cluster_annotation_term_label | name | description | order | |
---|---|---|---|---|
0 | CCN20250428_LEVEL_0 | Neighborhood | Neighborhood | 0 |
1 | CCN20250428_LEVEL_1 | Class | Class | 1 |
2 | CCN20250428_LEVEL_2 | Subclass | Subclass | 2 |
3 | CCN20250428_LEVEL_3 | Group | Group | 3 |
4 | CCN20250428_LEVEL_4 | Cluster | Cluster | 4 |
Finally, for the clusters, we load information on the annotations for each cluster. This also includes the term order and color information which we will use to plot later.
cluster_annotation_term = abc_cache.get_metadata_dataframe(
directory='HMBA-BG-taxonomy-CCN20250428',
file_name='cluster_annotation_term',
).rename(columns={'label': 'cluster_annotation_term_label'}).set_index('cluster_annotation_term_label')
cluster_annotation_term.head()
name | cluster_annotation_term_set_label | cluster_annotation_term_set_name | color_hex_triplet | term_order | term_set_order | parent_term_label | parent_term_name | parent_term_set_label | |
---|---|---|---|---|---|---|---|---|---|
cluster_annotation_term_label | |||||||||
CS20250428_NEIGH_0001 | Nonneuron | CCN20250428_LEVEL_0 | Neighborhood | #f2ca7d | 1 | 0 | NaN | NaN | NaN |
CS20250428_NEIGH_0000 | Glut Sero Dopa | CCN20250428_LEVEL_0 | Neighborhood | #91f4bb | 2 | 0 | NaN | NaN | NaN |
CS20250428_NEIGH_0002 | Subpallium GABA | CCN20250428_LEVEL_0 | Neighborhood | #19613b | 3 | 0 | NaN | NaN | NaN |
CS20250428_NEIGH_0003 | Subpallium GABA-Glut | CCN20250428_LEVEL_0 | Neighborhood | #7e1d19 | 4 | 0 | NaN | NaN | NaN |
CS20250428_CLASS_0000 | Astro-Epen | CCN20250428_LEVEL_1 | Class | #6ec0da | 1 | 1 | CS20250428_NEIGH_0001 | Nonneuron | CCN20250428_LEVEL_0 |
We merge this table with information from our clusters. Note that not all clusters are associated into the taxonomy hence we use an left join here to use only those clusters with annotations.
cluster_annotation_term_with_cells = cluster_annotation_term.join(
cluster, how='left'
)
While we have information on the number of cells in each cluster, we need to sum these to get the number of cells in each of the upper levels of the taxonomy. Below we iterate through each level from the lowest to highest, summing the number of cells as we go.
for cluster_annotation_term_label in reversed(sorted(cluster_annotation_term_with_cells['cluster_annotation_term_set_label'].unique())):
sub_terms = cluster_annotation_term_with_cells[
cluster_annotation_term_with_cells['cluster_annotation_term_set_label'] == cluster_annotation_term_label
]
parents = sub_terms['parent_term_label'].unique()
if np.any(pd.isna(sub_terms['number_of_cells'])):
print("Warning nan values:", sub_terms[pd.isna(sub_terms['number_of_cells'])].index)
sub_terms
for parent in parents:
if pd.isna(parent):
continue
n_cells = sub_terms[
sub_terms['parent_term_label'] == parent
]['number_of_cells'].sum()
cluster_annotation_term_with_cells.loc[parent, 'number_of_cells'] = n_cells
# show the new number of cells column.
cluster_annotation_term_with_cells[['number_of_cells']].head()
number_of_cells | |
---|---|
cluster_annotation_term_label | |
CS20250428_NEIGH_0001 | 815521 |
CS20250428_NEIGH_0000 | 34315 |
CS20250428_NEIGH_0002 | 1045252 |
CS20250428_NEIGH_0003 | 1045 |
CS20250428_CLASS_0000 | 224824 |
Finally, we load the cluster to cluster annotation membership table to help us create associated metadata for each cluster to tell us the annotations for a given Cluster at all levels of the taxonomy. We’ll use this in groupby
s to give each cluster the metadata for each of its parents. This will be useful when we merge into the cell metadata table.
cluster_to_cluster_annotation_membership = abc_cache.get_metadata_dataframe(
directory='HMBA-BG-taxonomy-CCN20250428',
file_name='cluster_to_cluster_annotation_membership',
).set_index('cluster_annotation_term_label')
membership_with_cluster_info = cluster_to_cluster_annotation_membership.join(cluster_annotation_term_with_cells, rsuffix='_anno_term').reset_index()
membership_groupby = membership_with_cluster_info.groupby(
['cluster_alias', 'cluster_annotation_term_set_name']
)
membership_with_cluster_info.head()
cluster_annotation_term_label | cluster_annotation_term_set_label | cluster_alias | cluster_annotation_term_set_name | cluster_annotation_term_name | name | cluster_annotation_term_set_label_anno_term | cluster_annotation_term_set_name_anno_term | color_hex_triplet | term_order | term_set_order | parent_term_label | parent_term_name | parent_term_set_label | cluster_alias_anno_term | number_of_cells | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | CS20250428_CLUST_0161 | CCN20250428_LEVEL_4 | Human-143 | Cluster | Human-143 | Human-143 | CCN20250428_LEVEL_4 | Cluster | #4ac0ed | 0 | 4 | CS20250428_GROUP_0039 | Astrocyte | CCN20250428_LEVEL_3 | Human-143 | 91 |
1 | CS20250428_CLUST_0162 | CCN20250428_LEVEL_4 | Human-145 | Cluster | Human-145 | Human-145 | CCN20250428_LEVEL_4 | Cluster | #8af851 | 1 | 4 | CS20250428_GROUP_0039 | Astrocyte | CCN20250428_LEVEL_3 | Human-145 | 1783 |
2 | CS20250428_CLUST_0163 | CCN20250428_LEVEL_4 | Human-146 | Cluster | Human-146 | Human-146 | CCN20250428_LEVEL_4 | Cluster | #d1dd68 | 2 | 4 | CS20250428_GROUP_0039 | Astrocyte | CCN20250428_LEVEL_3 | Human-146 | 172 |
3 | CS20250428_CLUST_0164 | CCN20250428_LEVEL_4 | Human-149 | Cluster | Human-149 | Human-149 | CCN20250428_LEVEL_4 | Cluster | #95daf6 | 3 | 4 | CS20250428_GROUP_0039 | Astrocyte | CCN20250428_LEVEL_3 | Human-149 | 2649 |
4 | CS20250428_CLUST_0165 | CCN20250428_LEVEL_4 | Human-150 | Cluster | Human-150 | Human-150 | CCN20250428_LEVEL_4 | Cluster | #26827e | 4 | 4 | CS20250428_GROUP_0039 | Astrocyte | CCN20250428_LEVEL_3 | Human-150 | 1359 |
From the membership table, we create three tables via a groupby
. First the name of each cluster and its parents.
# term_sets = abc_cache.get_metadata_dataframe(directory='WHB-taxonomy', file_name='cluster_annotation_term_set').set_index('label')
cluster_details = membership_groupby['cluster_annotation_term_name'].first().unstack()
cluster_details = cluster_details[cluster_annotation_term_set['name']] # order columns
cluster_details.fillna('Other', inplace=True)
cluster_details.sort_values(['Neighborhood', 'Class', 'Subclass', 'Group', 'Cluster'], inplace=True)
cluster_details.head()
cluster_annotation_term_set_name | Neighborhood | Class | Subclass | Group | Cluster |
---|---|---|---|---|---|
cluster_alias | |||||
Human-128 | Glut Sero Dopa | F M Glut | F Glut | BF SKOR1 Glut | Human-128 |
Human-129 | Glut Sero Dopa | F M Glut | F Glut | BF SKOR1 Glut | Human-129 |
Human-130 | Glut Sero Dopa | F M Glut | F Glut | BF SKOR1 Glut | Human-130 |
Human-423 | Glut Sero Dopa | F M Glut | F Glut | BF SKOR1 Glut | Human-423 |
Human-426 | Glut Sero Dopa | F M Glut | F Glut | BF SKOR1 Glut | Human-426 |
Next the plotting order of each of the clusters and their parents.
cluster_order = membership_groupby['term_order'].first().unstack()
cluster_order.sort_values(['Neighborhood', 'Class', 'Subclass', 'Group', 'Cluster'], inplace=True)
cluster_order.rename(
columns={'Neighborhood': 'Neighborhood_order',
'Class': 'Class_order',
'Subclass': 'Subclass_order',
'Group': 'Group_order',
'Cluster': 'Cluster_order'},
inplace=True
)
cluster_order.head()
cluster_annotation_term_set_name | Class_order | Cluster_order | Group_order | Neighborhood_order | Subclass_order |
---|---|---|---|---|---|
cluster_alias | |||||
Human-143 | 1 | 0 | 1 | 1 | 1 |
Human-145 | 1 | 1 | 1 | 1 | 1 |
Human-146 | 1 | 2 | 1 | 1 | 1 |
Human-149 | 1 | 3 | 1 | 1 | 1 |
Human-150 | 1 | 4 | 1 | 1 | 1 |
Finally, the colors we will use to plot for each of the unique taxons at all levels.
cluster_colors = membership_groupby['color_hex_triplet'].first().unstack()
cluster_colors = cluster_colors[cluster_annotation_term_set['name']]
cluster_colors.sort_values(['Neighborhood', 'Class', 'Subclass', 'Group', 'Cluster'], inplace=True)
cluster_colors.head()
cluster_annotation_term_set_name | Neighborhood | Class | Subclass | Group | Cluster |
---|---|---|---|---|---|
cluster_alias | |||||
Marmoset-323 | #19613b | #26e8bb | #1bc06a | #fc2b80 | #00d86e |
Marmoset-307 | #19613b | #26e8bb | #1bc06a | #fc2b80 | #0454b2 |
Marmoset-325 | #19613b | #26e8bb | #1bc06a | #fc2b80 | #094a6f |
Human-346 | #19613b | #26e8bb | #1bc06a | #fc2b80 | #0f6331 |
Marmoset-301 | #19613b | #26e8bb | #1bc06a | #fc2b80 | #1140be |
Next, we bring it all together by loading the mapping of cells to cluster and join into our final metadata table. Note here that not every cell is currently associated into the taxonomy hence the NaN values for many of the taxonomy information columns.
cell_to_cluster_membership = abc_cache.get_metadata_dataframe(
directory='HMBA-BG-taxonomy-CCN20250428',
file_name='cell_to_cluster_membership',
).set_index('cell_label')
cell_to_cluster_membership.head()
cluster_alias | cluster_label | |
---|---|---|
cell_label | ||
AAACAGCCAAATGCCC-2362_A05 | Human-451 | CS20250428_CLUST_0268 |
AAACAGCCAATTGAGA-2362_A05 | Human-1 | CS20250428_CLUST_0227 |
AAACAGCCAGCATGTC-2362_A05 | Human-153 | CS20250428_CLUST_0215 |
AAACAGCCATTGACAT-2362_A05 | Human-1 | CS20250428_CLUST_0227 |
AAACAGCCATTGTGGC-2362_A05 | Human-14 | CS20250428_CLUST_0249 |
cell_extended_with_tax = cell_extended.join(cell_to_cluster_membership, rsuffix='_cell_to_cluster_membership')
cell_extended_with_tax = cell_extended_with_tax.join(cluster_details, on='cluster_alias')
cell_extended_with_tax = cell_extended_with_tax.join(cluster_colors, on='cluster_alias', rsuffix='_color')
cell_extended_with_tax = cell_extended_with_tax.join(cluster_order, on='cluster_alias')
cell_extended_with_tax.head()
cell_barcode | donor_label | barcoded_cell_sample_label | library_label | alignment_job_id | doublet_score | umi_count | feature_matrix_label | dataset_label | abc_sample_id | ... | Neighborhood_color | Class_color | Subclass_color | Group_color | Cluster_color | Class_order | Cluster_order | Group_order | Neighborhood_order | Subclass_order | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cell_label | |||||||||||||||||||||
GTCAGGCTCAGGCCTA-2305_C01 | GTCAGGCTCAGGCCTA | H24.30.003 | 2305_C01 | L8XR_240705_01_H06 | 93c6ec45bd2a1ca18516918a7f1a448e84006372 | 0.110000 | 23061.0 | HMBA-10xMultiome-BG-Aligned | HMBA-10xMultiome-BG-Aligned | d59b11a7-a50d-472e-bc89-9fb5358044fa | ... | #19613b | #d0b83c | #1655f2 | #339933 | #dafd5a | 10.0 | 1078.0 | 51.0 | 3.0 | 31.0 |
TATTACCTCAAAGGCA-2295_D02 | TATTACCTCAAAGGCA | H24.30.003 | 2295_D02 | L8XR_240705_01_G05 | c9b036ef4c6ddb8545718e5ea3747fc20c51ebfb | 0.026667 | 12546.0 | HMBA-10xMultiome-BG-Aligned | HMBA-10xMultiome-BG-Aligned | 8e66d017-0e3a-462b-bb7a-fb4dc8f297ec | ... | #19613b | #d0b83c | #253c8c | #aec7e8 | #c33059 | 10.0 | 1176.0 | 53.0 | 3.0 | 32.0 |
TAACCAGGTTGGCGTG-2019_C02 | TAACCAGGTTGGCGTG | QM23.50.002 | 2019_C02 | L8XR_240118_21_C03 | 8268a0d34c53fa6bdd8887d7a6b9b612732c8777 | 0.000000 | 4430.0 | HMBA-10xMultiome-BG-Aligned | HMBA-10xMultiome-BG-Aligned | e4f0f167-cce3-4797-8e95-9abf424ae81e | ... | #f2ca7d | #6ec0da | #604f15 | #e16c95 | #83d8d6 | 1.0 | 26.0 | 1.0 | 1.0 | 1.0 |
GTGTGTTAGCGCATTG-1750_B01 | GTGTGTTAGCGCATTG | QM23.50.001 | 1750_B01 | L8XR_230629_22_H11 | 50c29f65910dc6faa1990f1ff9601c25bb909822 | 0.000000 | 9746.0 | HMBA-10xMultiome-BG-Aligned | HMBA-10xMultiome-BG-Aligned | f3ce9856-7e08-451b-afce-94f3923224d5 | ... | #19613b | #ce4c27 | #eea495 | #708297 | #ef6b52 | 7.0 | 549.0 | 29.0 | 3.0 | 19.0 |
CTCCGGACATAATCGT-1959_C02 | CTCCGGACATAATCGT | H23.30.001 | 1959_C02 | L8XR_231116_02_B09 | d7924909ac635d27a5aaf99c11ea0373b9c3bb07 | 0.010753 | 10972.0 | HMBA-10xMultiome-BG-Aligned | HMBA-10xMultiome-BG-Aligned | efc4e716-8a28-4bdb-8ca1-771bdcc10c17 | ... | #f2ca7d | #6ec0da | #604f15 | #e16c95 | #238c7a | 1.0 | 12.0 | 1.0 | 1.0 | 1.0 |
5 rows × 53 columns
print_column_info(cell_extended)
Number of unique cell_barcode = 690037
Number of unique donor_label = 22 ['CJ23.56.002', 'CJ23.56.003', 'CJ24.56.001', 'CJ24.56.004', 'H18.30.001', 'H19.30.004', 'H20.30.001', 'H20.30.002', 'H21.30.004', 'H23.30.001', 'H24.30.001', 'H24.30.003', 'H24.30.004', 'H24.30.007', 'Q19.26.010', 'Q21.26.002', 'Q21.26.010', 'QM21.26.001', 'QM21.26.003', 'QM23.50.001', 'QM23.50.002', 'QM23.50.003']
Number of unique barcoded_cell_sample_label = 417
Number of unique library_label = 417
Number of unique alignment_job_id = 303
Number of unique doublet_score = 1426
Number of unique umi_count = 86532
Number of unique feature_matrix_label = 1 ['HMBA-10xMultiome-BG-Aligned']
Number of unique dataset_label = 1 ['HMBA-10xMultiome-BG-Aligned']
Number of unique abc_sample_id = 1898081
Number of unique donor_species = 4 ['NCBITaxon:9483', 'NCBITaxon:9544', 'NCBITaxon:9545', 'NCBITaxon:9606']
Number of unique species_scientific_name = 4 ['Callithrix jacchus', 'Homo sapiens', 'Macaca mulatta', 'Macaca nemestrina']
Number of unique species_common_name = 3 ['Human', 'Macaque', 'Marmoset']
Number of unique donor_sex = 2 ['Female', 'Male']
Number of unique donor_age = 20 ['10 yrs', '11 yrs', '14 yrs', '18 yrs', '19 yrs', '2.0 yrs', '25 yrs', '38 yrs', '4.0 yrs', '44 yrs', '5 yrs', '5.4 yrs', '50 yrs', '57 yrs', '58 yrs', '6 yrs', '6.6 yrs', '60 yrs', '61 yrs', '7 yrs']
Number of unique donor_age_value = 20 [1.95890411, 3.956284153, 5.0, 5.391780822, 6.0, 6.555924845, 7.0, 10.0, 11.0, 14.0, 18.0, 19.0, 25.0, 38.0, 44.0, 50.0, 57.0, 58.0, 60.0, 61.0]
Number of unique donor_age_unit = 1 ['years']
Number of unique library_method = 1 ['10xMultiome;GEX']
Number of unique barcoded_cell_sample_label_library_table = 417
Number of unique enrichment_population = 83
Number of unique cell_specimen_type = 1 ['Nuclei']
Number of unique parcellation_term_identifier = 20 ['DHBA:10155', 'DHBA:10334', 'DHBA:10335', 'DHBA:10336', 'DHBA:10337', 'DHBA:10338', 'DHBA:10339', 'DHBA:10340', 'DHBA:10341', 'DHBA:10342', 'DHBA:10343', 'DHBA:10344', 'DHBA:10345', 'DHBA:10466', 'DHBA:11537', 'DHBA:11538', 'DHBA:12251', 'DHBA:12261', 'DHBA:146034742', 'DHBA:146034754']
Number of unique region_of_interest_name = 20 ['Ventral pallidus', 'body of caudate', 'brain', 'caudal putamen', 'caudate nucleus', 'core of nucleus accumbens', 'external segment of globus pallidus', 'globus pallidus', 'head of caudate', 'internal segment of globus pallidus', 'nucleus accumbens', 'peri-caudate ependymal and subependymal zone', 'posteroventral putamen', 'putamen', 'rostral putamen', 'shell of nucleus accumbens', 'substantia nigra', 'subthalamic nucleus', 'tail of caudate', 'ventral tegmental area']
Number of unique region_of_interest_label = 20 ['Br', 'Ca', 'CaB', 'CaH', 'CaT', 'Eca', 'GP', 'GPe', 'GPi', 'NAC', 'NACc', 'NACs', 'Pu', 'PuC', 'PuPV', 'PuR', 'SN', 'STH', 'VTA', 'VeP']
Number of unique anatomical_division_label = 2 ['Basal ganglia', 'Brain']
Number of unique donor_label_library_table = 22 ['CJ23.56.002', 'CJ23.56.003', 'CJ24.56.001', 'CJ24.56.004', 'H18.30.001', 'H19.30.004', 'H20.30.001', 'H20.30.002', 'H21.30.004', 'H23.30.001', 'H24.30.001', 'H24.30.003', 'H24.30.004', 'H24.30.007', 'Q19.26.010', 'Q21.26.002', 'Q21.26.010', 'QM21.26.001', 'QM21.26.003', 'QM23.50.001', 'QM23.50.002', 'QM23.50.003']
Number of unique region_of_interest_label_color = 20 ['#0010d9', '#0060ff', '#00bfa0', '#00bfff', '#53DB33', '#7AE361', '#89522A', '#907991', '#9BA2FF', '#9b19f5', '#A7DB33', '#B8003A', '#C29515', '#E0E0E0', '#E60049', '#c941a7', '#e6d800', '#fe0ccf', '#ff6b9a', '#ffa300']
Number of unique region_of_interest_label_order = 20 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
Number of unique species_common_name_color = 3 ['#377eb8', '#4daf4a', '#FF5F5D']
Number of unique species_common_name_order = 3 [1, 2, 3]
Number of unique species_scientific_name_color = 4 ['#377eb8', '#4daf4a', '#FF5F5D', '#b2df8a']
Number of unique species_scientific_name_order = 4 [1, 2, 3, 4]
Number of unique donor_sex_color = 2 ['#565353', '#ADC4C3']
Number of unique donor_sex_order = 2 [1, 2]
Number of unique x = 1844471
Number of unique y = 1820193
Plotting the taxonomy#
Now that we have our cells with associated taxonomy information, we’ll plot them into the UMAP we showed previously. First, we trim down the cells to those with taxonomy information by selecting those that where the taxonomy information is not NaN.
cell_with_taxonomy = cell_extended_with_tax[~pd.isna(cell_extended_with_tax['Class_color'])]
Below we plot the taxonomy mapping of the cells for each level in the taxonomy.
fig, ax = plot_umap(
cell_with_taxonomy['x'],
cell_with_taxonomy['y'],
cc=cell_with_taxonomy['Neighborhood_color'],
labels=cell_with_taxonomy['Neighborhood'],
term_orders=cell_with_taxonomy['Neighborhood_order'],
fig_width=12,
fig_height=12
)
res = ax.set_title("Neighborhood")
plt.show()

fig, ax = plot_umap(
cell_with_taxonomy['x'],
cell_with_taxonomy['y'],
cc=cell_with_taxonomy['Class_color'],
labels=cell_with_taxonomy['Class'],
term_orders=cell_with_taxonomy['Class_order'],
fig_width=12,
fig_height=12
)
res = ax.set_title("Class")
plt.show()

fig, ax = plot_umap(
cell_with_taxonomy['x'],
cell_with_taxonomy['y'],
cc=cell_with_taxonomy['Subclass_color'],
labels=cell_with_taxonomy['Subclass'],
term_orders=cell_with_taxonomy['Subclass_order'],
fig_width=12,
fig_height=12
)
res = ax.set_title("Subclass")
plt.show()

fig, ax = plot_umap(
cell_with_taxonomy['x'],
cell_with_taxonomy['y'],
cc=cell_with_taxonomy['Group_color'],
labels=cell_with_taxonomy['Group'],
term_orders=cell_with_taxonomy['Group_order'],
fig_width=16,
fig_height=16
)
res = ax.set_title("Group")
plt.show()

fig, ax = plot_umap(
cell_with_taxonomy['x'],
cell_with_taxonomy['y'],
cc=cell_with_taxonomy['Cluster_color'],
fig_width=12,
fig_height=12
)
res = ax.set_title("Cluster")
plt.show()

Aggregating cluster and cells counts per term#
Let’s investigate the taxonomy information a bit more. In this section, we’ll create bar plots showing the number of clusters and cells at each level in the taxonomy.
First, we need to compute the number of clusters that are in each of the cell type taxons above it.
# Count the number of clusters associated with each cluster annotation term
term_cluster_count = membership_with_cluster_info.reset_index().groupby(['cluster_annotation_term_label'])[['cluster_alias']].count()
term_cluster_count.columns = ['number_of_clusters']
term_cluster_count.head()
number_of_clusters | |
---|---|
cluster_annotation_term_label | |
CS20250428_CLASS_0000 | 86 |
CS20250428_CLASS_0001 | 93 |
CS20250428_CLASS_0002 | 12 |
CS20250428_CLASS_0003 | 386 |
CS20250428_CLASS_0005 | 199 |
We already have our number of cells computed previously, so we’ll just use them here and cluster counts into the cluster_annotation table.
# Join counts with the term dataframe
term_with_counts = cluster_annotation_term_with_cells.join(term_cluster_count)
term_with_counts[['name', 'cluster_annotation_term_set_name', 'number_of_clusters', 'number_of_cells']].head()
name | cluster_annotation_term_set_name | number_of_clusters | number_of_cells | |
---|---|---|---|---|
cluster_annotation_term_label | ||||
CS20250428_NEIGH_0001 | Nonneuron | Neighborhood | 295 | 815521 |
CS20250428_NEIGH_0000 | Glut Sero Dopa | Neighborhood | 171 | 34315 |
CS20250428_NEIGH_0002 | Subpallium GABA | Neighborhood | 957 | 1045252 |
CS20250428_NEIGH_0003 | Subpallium GABA-Glut | Neighborhood | 12 | 1045 |
CS20250428_CLASS_0000 | Astro-Epen | Class | 86 | 224824 |
Below we create a function to plot the cluster and cell counts in a bar graph, coloring by the associated taxon level.
def bar_plot_by_level_and_type(df, level, fig_width = 8.5, fig_height = 4):
fig, ax = plt.subplots(1, 2)
fig.set_size_inches(fig_width, fig_height)
for idx, ctype in enumerate(['clusters', 'cells']):
pred = (df['cluster_annotation_term_set_name'] == level)
sort_order = np.argsort(df[pred]['term_order'])
names = df[pred]['name'].iloc[sort_order]
counts = df[pred]['number_of_%s' % ctype].iloc[sort_order]
colors = df[pred]['color_hex_triplet'].iloc[sort_order]
ax[idx].barh(names, counts, color=colors)
ax[idx].set_title('Number of %s by %s' % (ctype,level)),
ax[idx].set_xscale('log')
if idx > 0:
ax[idx].set_yticklabels([])
plt.show()
Now let’s plot the counts for each of the taxonomy levels above Cluster.
bar_plot_by_level_and_type(term_with_counts, 'Neighborhood')
plt.show()

bar_plot_by_level_and_type(term_with_counts, 'Class')
plt.show()

bar_plot_by_level_and_type(term_with_counts, 'Subclass', 8.5, 6)
plt.show()

bar_plot_by_level_and_type(term_with_counts, 'Group', 8.5, 10)
plt.show()

Distribution of lower taxonomy levels in their parents.#
We can also use a similar visualization to show a given taxonomy level in another. Below we define functions to manipulate our data and plot a stacked bar plot.
def distribution(A, B):
cluster_order_details = cluster_details.join(cluster_order)
AxB = cluster_details.groupby([A, B])[['Cluster']].count()
AxB.columns = ['number_of_clusters']
AxB = AxB.unstack().fillna(0)
AxB = AxB.loc[
cluster_order_details[A].unique()[
np.argsort(cluster_order_details[f'{A}_order'].unique())
]
]
AxB = AxB[[
('number_of_clusters', name)
for name in cluster_order_details[B].unique()[
np.argsort(cluster_order_details[f'{B}_order'].unique())
]
]]
B_names = [x[1] for x in list(AxB.columns)]
pred = (cluster_annotation_term_with_cells['cluster_annotation_term_set_name'] == B)
term_by_name = cluster_annotation_term_with_cells[pred].set_index('name')
B_colors = term_by_name.loc[B_names, 'color_hex_triplet']
return AxB, B_names, B_colors
def stacked_bar_distribution(AxB, B_names, B_colors, fig_width = 6, fig_height = 6):
fig, ax = plt.subplots()
fig.set_size_inches(fig_width, fig_height)
bottom = np.zeros(len(AxB))
for i, col in enumerate(AxB.columns):
ax.barh(AxB.index[::-1], AxB[col][::-1], left=bottom, label=col[1], color=B_colors[i])
bottom += np.array(AxB[col][::-1])
ax.set_title('Distribution of %s in each %s' % (AxB.columns.names[1], AxB.index.name))
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()
return fig, ax
We can now visualize how each lower level in the taxonomy is distributed by cluster in the upper portions of the taxonomy.
AxB, B_names, B_colors = distribution('Class', 'Subclass')
fig, ax = stacked_bar_distribution(AxB, B_names, B_colors, 6, 8)
plt.show()
/var/folders/kc/7glrmt5n67x16yj_tg86t49c0000gp/T/ipykernel_24913/3364940061.py:9: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
ax.barh(AxB.index[::-1], AxB[col][::-1], left=bottom, label=col[1], color=B_colors[i])

Visualizing the BG taxonomy#
Term sets: Class, Subclass, Group and Cluster forms a four level Basil Ganglia taxonomy. We can visualize the taxonomy as a sunburst diagram that shows the single inheritance hierarchy through a series of rings, that are sliced for each annotation term. Each ring corresponds to a level in the hierarchy. We have ordered the rings so that the Class level is the outer most ring so that we can add in labels. Rings are sliced up and divided based on their hierarchical relationship to the parent slice. The angle of each slice is proportional to the number of clusters belonging to the term. Note that we exclude Neighborhood here as it is a much less interesting plot.
levels = ['Class', 'Subclass', 'Group', 'Cluster']
df = {}
# Copy the term order of the parent into each of the level below it.
term_with_counts['parent_order'] = ""
for idx, row in term_with_counts.iterrows():
if pd.isna(row['parent_term_label']):
continue
term_with_counts.loc[idx, 'parent_order'] = term_with_counts.loc[row['parent_term_label']]['term_order']
term_with_counts = term_with_counts.reset_index()
for lvl in levels :
pred = term_with_counts['cluster_annotation_term_set_name'] == lvl
df[lvl] = term_with_counts[pred]
df[lvl] = df[lvl].sort_values(['parent_order', 'term_order'])
fig, ax = plt.subplots()
fig.set_size_inches(10, 10)
size = 0.15
for i, lvl in enumerate(levels):
if lvl == 'Class':
ax.pie(df[lvl]['number_of_clusters'],
colors=df[lvl]['color_hex_triplet'],
labels = df[lvl]['name'],
rotatelabels=True,
labeldistance=1.025,
radius=1,
wedgeprops=dict(width=size, edgecolor=None),
startangle=0)
else :
ax.pie(df[lvl]['number_of_clusters'],
colors=df[lvl]['color_hex_triplet'],
radius=1-i*size,
wedgeprops=dict(width=size, edgecolor=None),
startangle=0)
plt.show()

In the next tutorial, we show how to access and use HMBA-BG gene expression data.