ASAP Human Postmortem-Derived Brain Sequencing Collection (PMDBS): Mapping to the Whole Human Brain Taxonomy

ASAP Human Postmortem-Derived Brain Sequencing Collection (PMDBS): Mapping to the Whole Human Brain Taxonomy#

In the previous notebook, we provided an overview of the ASAP PMDBS dataset and associated cell, sample and donor metadata. In this notebook we demonstrated the use of Allen Institute’s MapMyCells tool to provide cell type insight to the PMDBS dataset.

MapMyCells(RRID: SCR_024672) transforms cell types from a concept in publications to a tool for public research. Scientists worldwide can discover what cell types their transcriptomics and spatial data corresponds with by comparing their data to massive, high-quality reference datasets using multiple mapping methods. Currently available taxonomies include 10X whole mouse brain taxonomy, 10X whole human brain taxonomy and 10X Human MTG SEA-AD taxonomy. Available algorithms include correlation mapping (flatmap), hierarchical correlation mapping and a deep generative model-based algorithm.

In this notebook, we will map the approximately 3 million cells of PMDBS to the 10X whole human brain taxonomy (Siletti et. al.) using the hierarchical correlation mapping method where assignments are first made at the top level of the taxonomy using mean gene expression profiles of the reference dataset and pre-selected set of marker genes to distinguish cluster from its sibling. For robustness, the process is repeated 100 times, using 50% of the marker genes each time. A cell is assigned the class that receives the plurality of votes, a probability and correlation score. This process is repeated down the taxonomy until the last finest level. In this notebook we walkthrough the output of the MapMyCells results.

You need to be connected to the internet to run this notebook and should have run through the getting started notebook.

import anndata
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pathlib import Path

from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache

We will interact with the data using the AbcProjectCache. This cache object tracks which data has been downloaded and serves the path to the requested data on disk. For metadata, the cache can also directly serve up a Pandas DataFrame. See the getting started notebook notebook for more details on using the cache including installing it if it has not already been.

Change the download_base variable to where you have downloaded the data in your system.

download_base = Path('../../data/abc_atlas')
abc_cache = AbcProjectCache.from_cache_dir(download_base)

abc_cache.current_manifest

'releases/20250331/manifest.json'

Data Overview#

ASAP Data#

We simplify the ASAP PMDB metadata into a single cell_metadata, csv file and there companion tables describing the donor (including sex, age, race, disease state etc.), sample (including data processing metadata, team information, and dissection or region of interest the sample is from), and value_sets which provides a mapping from unique term name (cell type, region of interest etc.) to a plotting color and order.

cell_metadata = abc_cache.get_metadata_dataframe(
    directory='ASAP-PMDBS-10X',
    file_name='cell_metadata'
).set_index('cell_label')
donor = abc_cache.get_metadata_dataframe(
    directory='ASAP-PMDBS-10X',
    file_name='donor'
).set_index('donor_label')
sample = abc_cache.get_metadata_dataframe(
    directory='ASAP-PMDBS-10X',
    file_name='sample'
).set_index('sample_label')
value_sets = abc_cache.get_metadata_dataframe(
    directory='ASAP-PMDBS-10X',
    file_name='value_sets'
).set_index('label')

As with our previous tutorial, we’ll first use a Pandas join to merge the sample data into the cell metadata and then merge the donor information. We’ll use the value_sets as a lookup.

cell_metadata = cell_metadata.join(sample, on='sample_label')
cell_metadata = cell_metadata.join(donor, on='donor_label')
cell_metadata.head()

	cell_barcode	sample_label	x	y	cluster_label	cluster_label_order	cluster_label_color	dataset_label	feature_matrix_label	abc_sample_id	...	cerad_score_color	cognitive_status	cognitive_status_order	cognitive_status_color	lewy_body_disease_pathology	lewy_body_disease_pathology_order	lewy_body_disease_pathology_color	thal_phase	thal_phase_order	thal_phase_color
cell_label
AAACCCAAGAAACCAT-1_ASAP_PMBDS_000060_s002_Rep1	AAACCCAAGAAACCAT-1	ASAP_PMBDS_000060_s002	0.016373	1.403025	cluster_000	1	#171c97	ASAP-PMDBS-10X	ASAP-PMDBS-10X	cc633260-614c-4e5c-aa18-d3e3a0ef81bb	...	#4daf4a	Unknown	4	#e8e8e8	Absent	1	#ffffd9	Thal 3	4	#df65b0
AAACCCAAGAAACTCA-1_ASAP_PMBDS_000088_s001_Rep1	AAACCCAAGAAACTCA-1	ASAP_PMBDS_000088_s001	10.029368	1.342043	cluster_013	14	#aa5ed0	ASAP-PMDBS-10X	ASAP-PMDBS-10X	4b4f4355-6463-4bf2-9544-3f6d493de741	...	#4daf4a	Unknown	4	#e8e8e8	Absent	1	#ffffd9	Thal 2	3	#c994c7
AAACCCAAGAAAGCGA-1_ASAP_PMBDS_000177_s001_rep1	AAACCCAAGAAAGCGA-1	ASAP_PMBDS_000177_s001	-9.057206	-1.032092	cluster_006	7	#f591a5	ASAP-PMDBS-10X	ASAP-PMDBS-10X	0ea675f3-4b39-4c16-9518-66ddd9c409a1	...	#e41a1c	Mild Cognitive Impairment	2	#377eb8	Brainstem/Limbic	5	#41b6c4	Thal 3	4	#df65b0
AAACCCAAGAAAGCGA-1_ASAP_PMBDS_000185_s001_rep1	AAACCCAAGAAAGCGA-1	ASAP_PMBDS_000185_s001	-0.685645	-4.943793	cluster_001	2	#e4a9ba	ASAP-PMDBS-10X	ASAP-PMDBS-10X	0eb0506b-e76b-4f9a-a3b6-bc6bbccfdd32	...	#e41a1c	Mild Cognitive Impairment	2	#377eb8	Brainstem predominant	4	#7fcdbb	Thal 3	4	#df65b0
AAACCCAAGAAATCCA-1_ASAP_PMBDS_000122_s002_1	AAACCCAAGAAATCCA-1	ASAP_PMBDS_000122_s002	-3.507261	13.855699	cluster_002	3	#a02226	ASAP-PMDBS-10X	ASAP-PMDBS-10X	1f67fc2a-7e81-4448-8437-e2c6ef8b4b1b	...	#e8e8e8	Unknown	4	#e8e8e8	Unknown	10	#e8e8e8	Unknown	8	#e8e8e8

5 rows × 50 columns

Loading MapMyCells Results#

We load the MapMyCells hierarchical correlation mapping to 10X whole human brain taxonomy results. These are calculated for each cell and can be merged into our cell_metadata table on cell_label index. For more information on MapMyCells see here.

siletti_mmc_results = abc_cache.get_metadata_dataframe(
    directory='ASAP-PMDBS-taxonomy',
    file_name='mmc_results_siletti_whb',
    comment='#'
).set_index('cell_label')
siletti_mmc_results.head()

mmc_results_siletti_whb.csv: 100%|███████████████████████████████████████████████████████████| 1.11G/1.11G [03:35<00:00, 5.17MMB/s]

	supercluster_label	supercluster_name	supercluster_bootstrapping_probability	supercluster_correlation_coefficient	supercluster_aggregate_probability	cluster_label	cluster_name	cluster_bootstrapping_probability	cluster_correlation_coefficient	cluster_aggregate_probability	subcluster_label	subcluster_name	cluster_alias	subcluster_bootstrapping_probability	subcluster_correlation_coefficient	subcluster_aggregate_probability
cell_label
AGCGATTCAGAATGTA-1_ASAP_PMBDS_000105_s001_1	CS202210140_467	Oligodendrocyte precursor	1.00	0.8196	1.00	CS202210140_34	OPC_33	1.00	0.5419	1.00	CS202210140_3538	OPC_33_3044	3044	0.75	0.3870	0.7500
GGTGAAGCAGACAATA-1_ASAP_PMBDS_000105_s001_1	CS202210140_469	Oligodendrocyte	1.00	0.8066	1.00	CS202210140_50	Oligo_49	0.97	0.6461	0.97	CS202210140_3494	Oligo_49_3000	3000	0.60	0.3379	0.5820
ATTTACCGTTCTAACG-1_ASAP_PMBDS_000105_s001_1	CS202210140_469	Oligodendrocyte	1.00	0.8252	1.00	CS202210140_45	Oligo_44	0.97	0.6821	0.97	CS202210140_3496	Oligo_44_3002	3002	1.00	0.5014	0.9700
CTCAAGACACGCTTAA-1_ASAP_PMBDS_000105_s001_1	CS202210140_470	Astrocyte	1.00	0.7359	1.00	CS202210140_63	Astro_62	0.94	0.5680	0.94	CS202210140_3618	Astro_62_3124	3124	0.94	0.3598	0.8836
TCGCTTGGTGCGACAA-1_ASAP_PMBDS_000105_s001_1	CS202210140_467	Oligodendrocyte precursor	0.94	0.7004	0.94	CS202210140_34	OPC_33	1.00	0.3904	0.94	CS202210140_3539	OPC_33_3045	3045	0.54	0.2774	0.5076

10X Whole Human Brain Taxonomy (Siletti et al.)#

The 10X Whole Human Brain (WHB) Taxonomy is a three-level, cell types hierarchical tree (31 superclusters, 461 clusters and 3313 subclusters) derived from a > 3 million cell reference dataset derived from samples across the whole human brain. Additionally, categorical neurotransmitter type annotations were assigned to each cluster based on expression (or lack thereof) of one or more neurotransmitter-associated marker genes. For more information on these data see here. Here we load information about the taxonomy (terms, colors, hierarchy, levels and order) from the  abc_atlas_access cache object.

membership = abc_cache.get_metadata_dataframe(
    directory='WHB-taxonomy',
    file_name='cluster_to_cluster_annotation_membership'
)
membership_groupby = membership.groupby(['cluster_alias', 'cluster_annotation_term_set_name'])
membership.head(5)

cluster_to_cluster_annotation_membership.csv: 100%|██████████████████████████████████████████| 1.02M/1.02M [00:00<00:00, 3.29MMB/s]

	cluster_annotation_term_label	cluster_annotation_term_set_label	cluster_alias	cluster_annotation_term_name	cluster_annotation_term_set_name	number_of_cells	color_hex_triplet
0	CS202210140_494	CCN202210140_SUBC	0	URL_297_0	subcluster	34	#7E807A
1	CS202210140_495	CCN202210140_SUBC	1	URL_308_1	subcluster	220	#C54945
2	CS202210140_496	CCN202210140_SUBC	2	URL_308_2	subcluster	187	#5232B7
3	CS202210140_497	CCN202210140_SUBC	3	URL_308_3	subcluster	246	#31BEBA
4	CS202210140_498	CCN202210140_SUBC	4	URL_308_4	subcluster	188	#C8A9BC

term_sets = abc_cache.get_metadata_dataframe(directory='WHB-taxonomy', file_name='cluster_annotation_term_set').set_index('label')
cluster_details = membership_groupby['cluster_annotation_term_name'].first().unstack()
cluster_details = cluster_details[term_sets['name']] # order columns
cluster_details.fillna('Other', inplace=True)
cluster_details.sort_values(['supercluster', 'cluster', 'subcluster'], inplace=True)
cluster_details.head()

cluster_annotation_term_set.csv: 100%|███████████████████████████████████████████████████████| 1.33k/1.33k [00:00<00:00, 14.9kMB/s]

cluster_annotation_term_set_name	subcluster	cluster	supercluster	neurotransmitter
cluster_alias
2461	Amex_153_2461	Amex_153	Amygdala excitatory	VGLUT1 VGLUT2
2462	Amex_153_2462	Amex_153	Amygdala excitatory	VGLUT1 VGLUT2
2463	Amex_153_2463	Amex_153	Amygdala excitatory	VGLUT1 VGLUT2
2464	Amex_153_2464	Amex_153	Amygdala excitatory	VGLUT1 VGLUT2
2465	Amex_153_2465	Amex_153	Amygdala excitatory	VGLUT1 VGLUT2

cluster_colors = membership_groupby['color_hex_triplet'].first().unstack()
cluster_colors = cluster_colors[term_sets['name']]
cluster_colors.sort_values(['supercluster', 'cluster', 'subcluster'], inplace=True)
cluster_colors.head()

cluster_annotation_term_set_name	subcluster	cluster	supercluster	neurotransmitter
cluster_alias
2218	#374B8A	#062463	#003380	#2BDFD1
2216	#4BCBC6	#062463	#003380	#2BDFD1
2217	#83C943	#062463	#003380	#2BDFD1
2219	#BC4440	#062463	#003380	#2BDFD1
2220	#CB7ABA	#062463	#003380	#2BDFD1

cluster_annotation_term = abc_cache.get_metadata_dataframe('WHB-taxonomy', 'cluster_annotation_term')
cluster_annotation_term['name'] = cluster_annotation_term['name'].fillna('Other')
cluster_annotation_term.set_index('name', inplace=True)
cluster_annotation_term.head()

cluster_annotation_term.csv: 100%|█████████████████████████████████████████████████████████████| 547k/547k [00:00<00:00, 1.94MMB/s]

	label	cluster_annotation_term_set_label	parent_term_label	parent_term_set_label	term_set_order	term_order	cluster_annotation_term_set_name	color_hex_triplet	number_of_cells	description
name
Upper-layer intratelencephalic	CS202210140_476	CCN202210140_SUPC	NaN	NaN	0	0	supercluster	#FEA7BA	455006	Upper-layer intratelencephalic
Deep-layer intratelencephalic	CS202210140_477	CCN202210140_SUPC	NaN	NaN	0	1	supercluster	#426600	228467	Deep-layer intratelencephalic
Deep-layer near-projecting	CS202210140_473	CCN202210140_SUPC	NaN	NaN	0	2	supercluster	#EE1010	18856	Deep-layer near-projecting
Deep-layer corticothalamic and 6b	CS202210140_474	CCN202210140_SUPC	NaN	NaN	0	3	supercluster	#5EF0F1	78396	Deep-layer corticothalamic and 6b
MGE interneuron	CS202210140_484	CCN202210140_SUPC	NaN	NaN	0	4	supercluster	#DFFE66	222434	MGE interneuron

Finally, we can merge the cell_metadata with the taxonomy information. We only need to add the Neurotransmitter name/information from our cluster_details. We’ll join all colors.

cell_siletti = cell_metadata.join(siletti_mmc_results, rsuffix='_siletti')
cell_siletti = cell_siletti.join(
    cluster_details[['neurotransmitter']].rename(columns={'neurotransmitter': 'neurotransmitter_name'}),
    on='cluster_alias'
)
cell_siletti = cell_siletti.join(cluster_colors, on='cluster_alias')
cell_siletti.head()

	cell_barcode	sample_label	x	y	cluster_label	cluster_label_order	cluster_label_color	dataset_label	feature_matrix_label	abc_sample_id	...	subcluster_name	cluster_alias	subcluster_bootstrapping_probability	subcluster_correlation_coefficient	subcluster_aggregate_probability	neurotransmitter_name	subcluster	cluster	supercluster	neurotransmitter
cell_label
AAACCCAAGAAACCAT-1_ASAP_PMBDS_000060_s002_Rep1	AAACCCAAGAAACCAT-1	ASAP_PMBDS_000060_s002	0.016373	1.403025	cluster_000	1	#171c97	ASAP-PMDBS-10X	ASAP-PMDBS-10X	cc633260-614c-4e5c-aa18-d3e3a0ef81bb	...	Misc_404_2701	2701	1.00	0.4805	0.990	VGLUT1	#767F80	#B7E52E	#FEFE00	#2BDFD1
AAACCCAAGAAACTCA-1_ASAP_PMBDS_000088_s001_Rep1	AAACCCAAGAAACTCA-1	ASAP_PMBDS_000088_s001	10.029368	1.342043	cluster_013	14	#aa5ed0	ASAP-PMDBS-10X	ASAP-PMDBS-10X	4b4f4355-6463-4bf2-9544-3f6d493de741	...	DLCT6b_109_2894	2894	0.96	0.4384	0.768	VGLUT1	#C446B4	#68E5B9	#5EF0F1	#2BDFD1
AAACCCAAGAAAGCGA-1_ASAP_PMBDS_000177_s001_rep1	AAACCCAAGAAAGCGA-1	ASAP_PMBDS_000177_s001	-9.057206	-1.032092	cluster_006	7	#f591a5	ASAP-PMDBS-10X	ASAP-PMDBS-10X	0ea675f3-4b39-4c16-9518-66ddd9c409a1	...	ULIT_138_2778	2778	0.82	0.4664	0.820	VGLUT1 VGLUT2	#474AC5	#F5DC97	#FEA7BA	#22A5BB
AAACCCAAGAAAGCGA-1_ASAP_PMBDS_000185_s001_rep1	AAACCCAAGAAAGCGA-1	ASAP_PMBDS_000185_s001	-0.685645	-4.943793	cluster_001	2	#e4a9ba	ASAP-PMDBS-10X	ASAP-PMDBS-10X	0eb0506b-e76b-4f9a-a3b6-bc6bbccfdd32	...	ULIT_124_2626	2626	0.69	0.4889	0.276	VGLUT1 VGLUT2	#73CDB8	#F0A4E8	#FEA7BA	#22A5BB
AAACCCAAGAAATCCA-1_ASAP_PMBDS_000122_s002_1	AAACCCAAGAAATCCA-1	ASAP_PMBDS_000122_s002	-3.507261	13.855699	cluster_002	3	#a02226	ASAP-PMDBS-10X	ASAP-PMDBS-10X	1f67fc2a-7e81-4448-8437-e2c6ef8b4b1b	...	Oligo_47_3023	3023	1.00	0.3952	0.810	Other	#85BE78	#93B620	#9CBA19	#666666

5 rows × 71 columns

UMAP Plotting#

We define convenience function to plot the Uniform Manifold Approximation and Projection (UMAP) of the ASAP PMDB data.

def plot_umap(
    xx,
    yy,
    cc=None,
    val=None,
    fig_width=8,
    fig_height=8,
    cmap=None,
    labels=None,
    term_order_lookup=None,
    colorbar=False,
    sizes=None
):
    """
    """
    if sizes is None:
        sizes = 1
    fig, ax = plt.subplots()
    fig.set_size_inches(fig_width, fig_height)

    if cmap is not None:
        scatt = ax.scatter(xx, yy, c=val, s=0.5, marker='.', cmap=cmap, alpha=sizes)
    elif cc is not None:
        scatt = ax.scatter(xx, yy, c=cc, s=0.5, marker='.', alpha=sizes)

    if labels is not None:
        from matplotlib.patches import Rectangle
        unique_labels = labels.unique()
        unique_colors = cc.unique()

        if term_order_lookup is not None:
            term_order = np.argsort(term_order_lookup.loc[unique_labels, 'term_order'])
            unique_labels = unique_labels[term_order]
            unique_colors = unique_colors[term_order]
            
        rects = []
        for color in unique_colors:
            rects.append(Rectangle((0, 0), 1, 1, fc=color))

        legend = ax.legend(rects, unique_labels, loc=2)
        # ax.add_artist(legend)

    if colorbar:
        fig.colorbar(scatt, ax=ax)
    
    return fig, ax

Transferred Neurotransmitter Types#

Below are the PMDBS cells colorized by their neurotransmitter types that were inferred from mapping to the WHB taxonomy. Neurotransmitter assignment provides a high-level gestalt of the dataset by dividing the UMAP into glutamergic (shades of blue), GABAergic (reds), their combination (purples) and non-neuronal cells (gray) domains.

fig, ax = plot_umap(
    cell_siletti['x'],
    cell_siletti['y'],
    cc=cell_siletti['neurotransmitter'],
    labels=cell_siletti['neurotransmitter_name'],
    term_order_lookup=cluster_annotation_term,
    fig_width=16,
    fig_height=16
)
res = ax.set_title("Neurotransmitter")
plt.show()

../_images/6d75e258e4cf30e79a58157a00cddccd14d2e46f836cb7049e399d9dacf7c5cf.png

Transferred Whole Human Brain Taxonomy Types#

The next set of plots colorized the PMDBS cells by WHB taxonomy at the supercluster, cluster and subcluster levels. At the supercluster level, we see large swathe of the same colors shows that clusters derived from the scVI integration corresponds well to cell types defined by the WHB taxonomy.

Major glutamergic cortical supercluster annotation from WHB includes intratelencephalic (IT), corticothalamic (CT), and near-projecting (NP) cortical cells, hippocampal types (dente gyrus, CA1-3 and CA4), amygdala excitatory (AMEX) and upper rhombic lip (URL) cells. GABAergic superclusters include CGE (caudal ganglionic eminence) derived, MGE (medial ganglionic eminence) derived, LAMP5-LHX6 and Chandelier (LLC) cells, medium spiny neurons (MSN) and eccentric MSNs. Non-neuronal superclusters include astrocytes, ependymal, microglia, oligodendrocyte, OPCs (oligodendrocyte precursor), vascular and fibroblast cells.

The center of the UMAP shows a region of highly mixed assignments, with lower mapping probabilities (shown later) that could potentially indicate issues with data quality and/or batch effect.

fig, ax = plot_umap(
    cell_siletti['x'],
    cell_siletti['y'],
    cc=cell_siletti['supercluster'],
    labels=cell_siletti['supercluster_name'],
    term_order_lookup=cluster_annotation_term,
    fig_width=16,
    fig_height=16
)
res = ax.set_title("Supercluster")
plt.show()

../_images/269831789bb9ea22f475cf1f01b82f0a9a42b0aba6eeb110790bbce8d43cddfd.png

fig, ax = plot_umap(
    cell_siletti['x'],
    cell_siletti['y'],
    cc=cell_siletti['cluster'],
    fig_width=16,
    fig_height=16
)
res = ax.set_title("Cluster")
plt.show()

../_images/814d48172720caaa3709ab56cbd2234b8e20fd9f09fbe1f9e7d9af5c03e4d4a1.png

fig, ax = plot_umap(
    cell_siletti['x'],
    cell_siletti['y'],
    cc=cell_siletti['subcluster'],
    fig_width=16,
    fig_height=16
)
res = ax.set_title("subcluster")
plt.show()

../_images/1cede9a1a6fedcd60260e38b175a14377df0e2a63e77a5a060fb8af5e280a699.png

Assignment Correlation and Probabilities#

For each cell, at each level of the taxonomy, the MapMyCells results provide the following mapping quality metrics: (1) bootstrapping probability: the fraction of bootstrap iterations that chose the assigned node at that level of the taxonomy, (2) aggregate probability: multiplication of the bootstrapping probability at the current level with its parent levels to provide a total probability and (3) correlation coefficient: Pearson’s correlation coefficient between the gene expression profile of the cell and the gene expression profile of the assigned node across the set of pre-defined marker genes appropriate to that taxonomic node.

The next set of plots colorized PDMBS cells by the boostrapping probability at each level. Here we see that cells in the center part of the UMAP have the lowest probability mapping at every level and corresponding to where we also see highly mixed supercluster assignments.

fig, ax = plot_umap(
    cell_siletti['x'],
    cell_siletti['y'],
    val=cell_siletti['supercluster_bootstrapping_probability'],
    cmap=plt.cm.magma_r,
    fig_width=16,
    fig_height=12,
    colorbar=True
)
res = ax.set_title("Supercluster Bootstrap Prob")
plt.show()

../_images/211e621b051eadbe8f46a1a8377cf9a1edb82ed8891901ce93a342333847c578.png

fig, ax = plot_umap(
    cell_siletti['x'],
    cell_siletti['y'],
    val=cell_siletti['cluster_bootstrapping_probability'],
    cmap=plt.cm.magma_r,
    fig_width=16,
    fig_height=12,
    colorbar=True
)
res = ax.set_title("Cluster Bootstrap Prob")
plt.show()

../_images/71cf72b00ae1298d8bbcf3c1ee83fdf34d37cd3309e5d9ce2062c537ba94d07b.png

fig, ax = plot_umap(
    cell_siletti['x'],
    cell_siletti['y'],
    val=cell_siletti['subcluster_bootstrapping_probability'],
    cmap=plt.cm.magma_r,
    fig_width=16,
    fig_height=12,
    colorbar=True
)
res = ax.set_title("Subcluster Bootstrap Prob")
plt.show()

../_images/309dbe292c38ce0cac53ce6bb9345b35806de5aa4af0116192ca47c1b3f82877.png

Each of the above plots shows the probability of being assigned to a given celltype at a level in the taxonomy. This assumes that the cell was correctly assigned to the celltype level above. To calculate the confidence that a given cell is assigned to a type at, say, the leaf level, we need the aggregated probability (the multiplication of the bootstrap probability from parent levels).

fig, ax = plot_umap(
    cell_siletti['x'],
    cell_siletti['y'],
    val=cell_siletti['subcluster_aggregate_probability'],
    cmap=plt.cm.magma_r,
    fig_width=16,
    fig_height=12,
    colorbar=True
)
res = ax.set_title("Joint Bootstrap Prob")
plt.show()

../_images/9f41769eea84ff9fc4599e58fff3f11926a3ab218bd543004b6a824f7a9ab64e.png

As well as the bootstrapping probability, the correlation between the gene expression of a cell and its assigned cell type also provides vital mapping quality information. Here we colorized the PDMBS cell by the correlation coefficient of gene expression of the cell and its assigned supercluster over the set of pre-defined marker genes for the specific supercluster.

fig, ax = plot_umap(
    cell_siletti['x'],
    cell_siletti['y'],
    val=cell_siletti['supercluster_correlation_coefficient'],
    cmap=plt.cm.magma_r,
    fig_width=16,
    fig_height=12,
    colorbar=True
)
res = ax.set_title("Supercluster Correlation Coefficient")
plt.show()

../_images/7f72273c91c47c41fd4d5bb1f1bc4c24c442a006a99b46af5efa341b322f86df.png

Removing cells with low quality mapping#

Now that we’ve seen the probabilities and correlations, one can make cuts on this value to only keep cells that MapMyCells is confident of. Here we make a simple cut based on supercluster probability and correlation both of 50%. This removes roughly 1/3 of the cells. These cuts are not definitive, and users should play with them to achieve a cell mapping quality to their liking.

supercluster_prob05_mask = np.logical_and(
    cell_siletti['supercluster_bootstrapping_probability'] > 0.5,
    cell_siletti['supercluster_correlation_coefficient'] > 0.5
)
cell_supercluster_high = cell_siletti[supercluster_prob05_mask]
cell_supercluster_low = cell_siletti[~supercluster_prob05_mask]

After making this cut, we removed 758,985 cells with low quality assignments. As seen in the plots below, most of these cells are removed from the central region of the UMAP. The majority of Siletti clusters, the larger fraction of cells were removed from are those that were assigned Splatter cell types.

fig, ax = plot_umap(
    cell_supercluster_high['x'],
    cell_supercluster_high['y'],
    cc=cell_supercluster_high['supercluster'],
    labels=cell_supercluster_high['supercluster_name'],
    term_order_lookup=cluster_annotation_term,
    fig_width=16,
    fig_height=16
)
res = ax.set_title("High quality mappings")
plt.show()

../_images/8f1bd6d3e0b0c916b1a9b68bd1248be4dc71ca138f51fbc1c899f988ca9da461.png

fig, ax = plot_umap(
    cell_supercluster_low['x'],
    cell_supercluster_low['y'],
    cc=cell_supercluster_low['supercluster'],
    labels=cell_supercluster_low['supercluster_name'],
    term_order_lookup=cluster_annotation_term,
    fig_width=16,
    fig_height=16)
res = ax.set_title("Low quality mappings")
plt.show()

../_images/c1dc1a0152d7cecdba4c9fc8d3317edbd6e5efae0467f5bac98b905a89cc2c58.png

Mapping the Siletti superclusters on to the Leiden Clusters#

Finally, we’ll use the mapping of the cells into the Siletti taxonomy to infer mappings of the Leiden clusters that the ASAP PMDBS data provide. To do this, we’ll compute the fraction of cells that are in a given Leiden cluster that were mapped into a give Siletti supercluster.

# Compute the number of cells assigned in a given leiden cluster, Siletti supercluster pair.
group = cell_supercluster_high.groupby(['cluster_label', 'supercluster_name'])[['supercluster']].count()

# Create an empty dataframe to store our results for the fraction of a cluster assigned to a supercluster.
assignment_df = pd.DataFrame(
    index=cell_supercluster_high['supercluster_name'].unique(),
    columns=cell_supercluster_high['cluster_label'].unique()
).fillna(0.0)

# Compute the fraction for each leiden cluster, supercluster pair the fraction is the relative to the
# counts in the leiden supercluster.
for clust_lab in cell_supercluster_high['cluster_label'].unique():
    subgroup = group.loc[clust_lab]
    assignment_df.loc[subgroup.index, clust_lab] = (subgroup / subgroup.sum()).to_numpy()

# Make sure our dataframe rows are in ondological order.
assignment_df = assignment_df.reindex(
    sorted(assignment_df.columns), axis=1
).loc[cluster_annotation_term[
    cluster_annotation_term['cluster_annotation_term_set_name'] == 'supercluster'
].sort_values('term_order').index]

# For each supertype, retrieve the leiden clusters where that supertype has largest 
# fraction of cells from the leiden cluster assigned to it. Order the columns
# by supertype decending those maxes.
column_order = []
for row_idx, supertype in enumerate(assignment_df.index):
    saved_cols = []
    col_maxes = []
    for col in assignment_df.columns:
        max_arg = np.argmax(assignment_df[col])
        if max_arg == row_idx:
            saved_cols.append(col)
            col_maxes.append(assignment_df[col][max_arg])
    saved_cols = np.array(saved_cols)
    column_order.extend(saved_cols[np.argsort(col_maxes)[::-1]])

assignment_df = assignment_df[column_order]
assignment_df.head()

/var/folders/kc/7glrmt5n67x16yj_tg86t49c0000gp/T/ipykernel_7564/1207378633.py:8: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  ).fillna(0.0)
/var/folders/kc/7glrmt5n67x16yj_tg86t49c0000gp/T/ipykernel_7564/1207378633.py:34: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  col_maxes.append(assignment_df[col][max_arg])

	cluster_028	cluster_001	cluster_021	cluster_006	cluster_015	cluster_016	cluster_005	cluster_011	cluster_019	cluster_013	...	cluster_025	cluster_023	cluster_002	cluster_022	cluster_008	cluster_024	cluster_003	cluster_027	cluster_007	cluster_010
name
Upper-layer intratelencephalic	0.991699	0.986873	0.981704	0.841231	0.358504	0.013095	0.014055	0.046075	0.006404	0.003158	...	0.000339	0.000882	0.000044	0.000507	0.002780	0.0	0.001021	0.005602	0.001725	0.007558
Deep-layer intratelencephalic	0.007324	0.004965	0.009295	0.154667	0.178152	0.970663	0.962783	0.943954	0.003571	0.002002	...	0.000000	0.000294	0.000080	0.000254	0.001510	0.0	0.000911	0.000000	0.000145	0.001393
Deep-layer near-projecting	0.000000	0.000022	0.000098	0.000009	0.003666	0.000000	0.000000	0.000000	0.957153	0.001601	...	0.000000	0.000000	0.000000	0.000000	0.000040	0.0	0.000000	0.000000	0.000000	0.000000
Deep-layer corticothalamic and 6b	0.000977	0.000891	0.000587	0.001016	0.100440	0.002098	0.008950	0.005556	0.021084	0.987768	...	0.000339	0.000294	0.000009	0.000000	0.000200	0.0	0.000066	0.000000	0.000158	0.000697
MGE interneuron	0.000000	0.000093	0.000391	0.000027	0.032258	0.000039	0.000037	0.000000	0.001360	0.000044	...	0.000000	0.000000	0.000030	0.000507	0.000428	0.0	0.000104	0.000000	0.000290	0.001541

5 rows × 30 columns

def heatmap(data, row_labels, col_labels, ax=None,
            cbar_kw=None, cbarlabel="", xtick_colors=None, **kwargs):
    """
    from: https://matplotlib.org/stable/gallery/images_contours_and_fields/image_annotated_heatmap.html

    Create a heatmap from a numpy array and two lists of labels.

    Parameters
    ----------
    data
        A 2D numpy array of shape (M, N).
    row_labels
        A list or array of length M with the labels for the rows.
    col_labels
        A list or array of length N with the labels for the columns.
    ax
        A `matplotlib.axes.Axes` instance to which the heatmap is plotted.  If
        not provided, use current Axes or create a new one.  Optional.
    cbar_kw
        A dictionary with arguments to `matplotlib.Figure.colorbar`.  Optional.
    cbarlabel
        The label for the colorbar.  Optional.
    xtick_colors
        Set of colors to label the x ticklabels by. Must be the same length as
        col_labels. Optional.
    **kwargs
        All other arguments are forwarded to `imshow`.
    """

    if ax is None:
        ax = plt.gca()

    if cbar_kw is None:
        cbar_kw = {}

    # Plot the heatmap
    im = ax.imshow(data, **kwargs)

    # Create colorbar
    cbar = ax.figure.colorbar(im, ax=ax, **cbar_kw)
    cbar.ax.set_ylabel(cbarlabel, rotation=-90, va="bottom")

    # Show all ticks and label them with the respective list entries.
    ax.set_xticks(range(data.shape[1]), labels=col_labels,
                  rotation=-30, ha="right", rotation_mode="anchor")
    ax.set_yticks(range(data.shape[0]), labels=row_labels)


    # Let the horizontal axes labeling appear on top.
    ax.tick_params(top=True, bottom=False,
                   labeltop=True, labelbottom=False)

    # Turn spines off and create white grid.
    ax.spines[:].set_visible(False)

    ax.set_xticks(np.arange(data.shape[1]+1)-.5, minor=True)
    ax.set_yticks(np.arange(data.shape[0]+1)-.5, minor=True)
    ax.grid(which="minor", color="w", linestyle='-', linewidth=3)
    ax.tick_params(which="minor", bottom=False, left=False)

    if xtick_colors is not None:
        xticks = ax.get_xticklabels()

        for idx, tick in enumerate(xticks):
            tick.set_color(xtick_colors[idx])

        ax.set_xticklabels(xticks)

    return im, cbar

Below we plot the heatmap showing the fraction of cells in a Leiden Cluster assigned to a given Siletti supertype. Note the Leiden cluster names are colored with the colors they were assigned with and shown in the ASAP Data Overview tutorial.

We see that some clusters have clear assignments to Siletti. The ordering we produced above shows a step like structure in the mapping. Horizontal sets show multiple Leiden clusters being assigned to a single Siletti supertype while vertical sets show Leiden clusters that have been split across multiple supertypes that are similar ontologically. We can see that there are also a few Leiden clusters that have ambiguous mappings as well, even after our cut on MapMyCells probability.

fig, ax = plt.subplots()
fig.set_size_inches(16, 12)

heatmap(
    data=assignment_df,
    row_labels=assignment_df.index.to_numpy(),
    col_labels=assignment_df.columns,
    cmap=plt.cm.magma_r,
    vmin=0,
    vmax=1,
    cbarlabel="Fraction assigned to supertype",
    ax=ax,
    xtick_colors=value_sets.loc[assignment_df.columns]['color_hex_triplet'].to_numpy()
)
fig.tight_layout()

../_images/d3e0ac26c4409e5011b6462d8a5e485f3b24a7affed000a2d23acde2f74b7a75.png

In the next notebooks, we explore a similar MapMyCells mapping with the SEA-AD taxonomy as well as look at gene expression data in the UMAP.