Developing Mouse - Visual Cortex 10X scRNASeq analysis: clustering and annotations#
The Developing Mouse - Visual Cortex taxonomy is derived from a single cell transcriptomic dataset containing 568,654 cells from donors ranging in age from embryonic to adult. The cell-type assignment is a four level taxonomy defining 15 classes, 40 subclasses, 148 clusters and 714 subclusters. This cell-type assignment is broadly consistent with those from the previous studies from the Allen Institute for Brain Science, Whole Mouse Brain (WMB) and the Broad Institute whole mouse brain taxonomy at the subclass level while providing finer cell-type and temporal resolutions with additional subcluster annotations.
To generate this developing taxonomy, we applied the Quality Control (QC) and post-integration QC pipelines. Integration and label transfer of scRNA-seq between ages was performed using Seurat/scVI. A detailed cell type annotation table accompanies the taxonomy, including hierarchical membership and anatomical localization.
You need to be connected to the internet to run this notebook or connected to a cache that has the Developing Mouse data downloaded already.
The notebook presented here shows quick visualizations from precomputed metadata in the atlas, exploring both cell, donor, and library metadata as well as the taxonomy. We also plot visualizations for the various data and metadata in a Uniform Manifold Approximation and Projection (UMAP). For examples on accessing the expression matrices, specifically selecting genes from expression matrices, see the general_accessing_10x_snRNASeq_tutorial.ipynb tutorial/example. In a related tutorial, we also show how to access and use Developing Mouse - Visual Cortex gene expression data.
import pandas as pd
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple, Optional
from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache
We will interact with the data using the AbcProjectCache. This cache object downloads data requested by the user, tracks which files have already been downloaded to your local system, and serves the path to the requested data on disk. For metadata, the cache can also directly serve up a Pandas DataFrame. See the getting_started notebook for more details on using the cache including installing it if it has not already been.
Change the download_base variable to where you would like to download the data in your system or a location where a cache is already available.
download_base = Path('../../data/abc_atlas')
abc_cache = AbcProjectCache.from_cache_dir(
download_base,
)
abc_cache.current_manifest
'releases/20260131/manifest.json'
Data overview#
We’ll quickly walk through the data we will be using in this notebook. The Developing Mouse 10X dataset is located across two directories listed in the ABCProjectCache. These data are the metadata and data for the cells and the metadata defining the taxonomy. We will be using data and metadata from the following directories:
Developing-Mouse-Vis-Cortex-10X
Developing-Mouse-Vis-Cortex-taxonomy
Below we list the data and metadata in the Developing Mouse dataset.
print("Developing-Mouse-Vis-Cortex-10X: gene expression data (h5ad)\n\t", abc_cache.list_expression_matrix_files(directory='Developing-Mouse-Vis-Cortex-10X'))
print("Developing-Mouse-Vis-Cortex-10X: metadata (csv)\n\t", abc_cache.list_metadata_files(directory='Developing-Mouse-Vis-Cortex-10X'))
Developing-Mouse-Vis-Cortex-10X: gene expression data (h5ad)
['Developing-Mouse-Vis-Cortex-10X/log2', 'Developing-Mouse-Vis-Cortex-10X/raw']
Developing-Mouse-Vis-Cortex-10X: metadata (csv)
['cell_metadata', 'donor', 'example_gene_expression', 'gene', 'library', 'value_sets']
We will also use metadata from the Developing-Mouse-Vis-Cortex-taxonomy directory. Below is the list of available files:
print("Developing-Mouse-Vis-Cortex-taxonomy: metadata (csv)\n\t", abc_cache.list_metadata_files(directory='Developing-Mouse-Vis-Cortex-taxonomy'))
Developing-Mouse-Vis-Cortex-taxonomy: metadata (csv)
['cell_2d_embedding_coordinates', 'cell_to_cluster_membership', 'cluster', 'cluster_annotation_term', 'cluster_annotation_term_set', 'cluster_to_cluster_annotation_membership', 'de_genes']
Cell metadata#
Essential cell metadata is stored as a CSV file that we load as a Pandas DataFrame. Each row represents one cell indexed by a cell label. The cell label is the concatenation of barcode and name of the sample. In this context, the sample is the barcoded cell sample that represents a single load into one port of the 10x Chromium. Note that cell barcodes are only unique within a single barcoded cell sample and that the same barcode can be reused.
Each cell is associated with a library label, donor label, alignment_job_id, feature_matrix_label and dataset_label identifying which data package this cell is part of. This metadata file will be combined with other metadata files that ship with this package to add information associated with the donor, UMAP coordinates, taxonomy assignments, and more.
Below, we load the first of the metadata used in this tutorial. This represents the cell metadata for the aligned dataset.
The command we use below both downloads the data if it is not already available in the local cache and loads the data as a Pandas DataFrame. This pattern of loading metadata is repeated throughout the tutorials.
cell = abc_cache.get_metadata_dataframe(
directory='Developing-Mouse-Vis-Cortex-10X',
file_name='cell_metadata'
).set_index('cell_label')
print("Number of cells = ", len(cell))
cell.head()
cell_metadata.csv: 100%|██████████| 105M/105M [00:03<00:00, 30.0MMB/s]
Number of cells = 568654
| cell_barcode | barcoded_cell_sample_label | library_label | alignment_id | log.qc.score.cr6 | synchronized_age | donor_label | dataset_label | feature_matrix_label | |
|---|---|---|---|---|---|---|---|---|---|
| cell_label | |||||||||
| AAACCCAAGGATTTAG-898_A02 | AAACCCAAGGATTTAG | 898_A02 | L8TX_211028_01_G04 | 1157582505 | 338.986629 | E13.5 | C67BL6J-606627 | Developing-Mouse-Vis-Cortex-10X | Developing-Mouse-Vis-Cortex-10X |
| AAACCCACACATTCTT-898_A02 | AAACCCACACATTCTT | 898_A02 | L8TX_211028_01_G04 | 1157582505 | 425.642457 | E13.5 | C67BL6J-606627 | Developing-Mouse-Vis-Cortex-10X | Developing-Mouse-Vis-Cortex-10X |
| AAACCCACACTTACAG-898_A02 | AAACCCACACTTACAG | 898_A02 | L8TX_211028_01_G04 | 1157582505 | 342.366675 | E13.5 | C67BL6J-606627 | Developing-Mouse-Vis-Cortex-10X | Developing-Mouse-Vis-Cortex-10X |
| AAACGAATCACTCTTA-898_A02 | AAACGAATCACTCTTA | 898_A02 | L8TX_211028_01_G04 | 1157582505 | 353.204604 | E13.5 | C67BL6J-606627 | Developing-Mouse-Vis-Cortex-10X | Developing-Mouse-Vis-Cortex-10X |
| AAACGAATCGACCAAT-898_A02 | AAACGAATCGACCAAT | 898_A02 | L8TX_211028_01_G04 | 1157582505 | 355.052147 | E13.5 | C67BL6J-606627 | Developing-Mouse-Vis-Cortex-10X | Developing-Mouse-Vis-Cortex-10X |
We can use pandas groupby function to see how many unique items are associated for each field and list them out if the number of unique items is small.
def print_column_info(df):
for c in df.columns:
grouped = df[[c]].groupby(c).count()
members = ''
if len(grouped) < 30:
members = str(list(grouped.index))
print("Number of unique %s = %d %s" % (c, len(grouped), members))
print_column_info(cell)
Number of unique cell_barcode = 527176
Number of unique barcoded_cell_sample_label = 91
Number of unique library_label = 91
Number of unique alignment_id = 91
Number of unique log.qc.score.cr6 = 496566
Number of unique synchronized_age = 35
Number of unique donor_label = 53
Number of unique dataset_label = 1 ['Developing-Mouse-Vis-Cortex-10X']
Number of unique feature_matrix_label = 1 ['Developing-Mouse-Vis-Cortex-10X']
Donor and Library metadata#
The first two associated metadata we load are the donor and library tables. The donor table contains species, sex, and age information. The library table contains information on 10X methods and brain region of interest the tissue was extracted from.
Below we load the donor metadata. Note that we flatten the values of P54 and above to P56 or “Adult” to match the published analysis. The original age values are available in the donor table.
donor = abc_cache.get_metadata_dataframe(
directory='Developing-Mouse-Vis-Cortex-10X',
file_name='donor'
).set_index('donor_label')
for idx, row in donor.iterrows():
if row['donor_age'] in ['P54', 'P59', 'P60', 'P61', 'P68']:
donor.loc[idx, 'donor_age'] = 'P56'
donor.head()
donor.csv: 0%| | 0.00/5.08k [00:00<?, ?MB/s]
donor.csv: 100%|██████████| 5.08k/5.08k [00:00<00:00, 96.1kMB/s]
| donor_species | species_scientific_name | species_genus | donor_sex | donor_age_value | donor_age_unit | donor_age_reference_point | donor_age | age_bin | |
|---|---|---|---|---|---|---|---|---|---|
| donor_label | |||||||||
| C57BL6J-603767 | NCBITaxon:10090 | Mus musculus | Mouse | Male | 18.5 | days | conception | E18.5 | E17_E18.5 |
| C57BL6J-620146 | NCBITaxon:10090 | Mus musculus | Mouse | Male | 9.0 | days | birth | P9 | 9 |
| C57BL6J-628080 | NCBITaxon:10090 | Mus musculus | Mouse | Male | 13.5 | days | conception | E13.5 | E13.5_E16.5 |
| C57BL6J-628083 | NCBITaxon:10090 | Mus musculus | Mouse | Male | 13.5 | days | conception | E13.5 | E13.5_E16.5 |
| C57BL6J-628825 | NCBITaxon:10090 | Mus musculus | Mouse | Female | 18.0 | days | conception | E18 | E17_E18.5 |
Next we load the library metadata. The information we will primarily use from this table are the region of interest that each library is associated with
library = abc_cache.get_metadata_dataframe(
directory='Developing-Mouse-Vis-Cortex-10X',
file_name='library'
).set_index('library_label')
library.head()
library.csv: 100%|██████████| 11.2k/11.2k [00:00<00:00, 138kMB/s]
| library_technique | barcoded_cell_sample_label | enrichment_population | cell_specimen_type | parcellation_term_identifier | region_of_interest_name | region_of_interest_label | donor_label | |
|---|---|---|---|---|---|---|---|---|
| library_label | ||||||||
| L8TX_200618_01_D07 | 10xV3.1;GEXOnly | 265_B02 | No FACS | Cells | MBA:385 | Primary visual area | VISp | Snap25-IRES2-Cre;Ai14-536764 |
| L8TX_200618_01_F06 | 10xV3.1;GEXOnly | 264_A01 | RFP-positive, Hoechst-positive | Cells | MBA:385 | Primary visual area | VISp | Snap25-IRES2-Cre;Ai14-536763 |
| L8TX_200618_01_G05 | 10xV3.1;GEXOnly | 264_B01 | RFP-positive, Hoechst-positive | Cells | MBA:385 | Primary visual area | VISp | Snap25-IRES2-Cre;Ai14-536763 |
| L8TX_200618_01_H05 | 10xV3.1;GEXOnly | 265_A02 | No FACS | Cells | MBA:385 | Primary visual area | VISp | Snap25-IRES2-Cre;Ai14-536764 |
| L8TX_200709_01_B03 | 10xV3.1;GEXOnly | 282_B02 | No FACS | Cells | MBA:385 | Primary visual area | VISp | Snap25-IRES2-Cre;Ai14-538743 |
We combine the donor and library tables into an extended cell metadata table.
cell_extended = cell.join(donor, on='donor_label')
cell_extended = cell_extended.join(library, on='library_label', rsuffix='_library_table')
del cell
We use the groupby function to show the number of cells in each region of interest.
cell_extended.groupby('region_of_interest_name')[['region_of_interest_label']].count()
| region_of_interest_label | |
|---|---|
| region_of_interest_name | |
| Cerebrum - Brain stem | 12813 |
| Primary visual area | 128550 |
| Visual areas | 387480 |
| Visual areas - Posterior parietal association areas | 25539 |
| brain | 14272 |
We can use the group by functionality to group the cells by each age.
cell_extended.groupby('age_bin')[['library_label']].count().rename(columns={'library_label': 'number_of_cells'}).loc[
['E11.5_E12.5', 'E13.5_E16.5', 'E17_E18.5',
'0_1', '2', '3', '4', '5_6', '7_8', '9', '10',
'11', '12_13', '14_15', '16', '17_19', '20_28', '54_68']]
| number_of_cells | |
|---|---|
| age_bin | |
| E11.5_E12.5 | 17222 |
| E13.5_E16.5 | 19762 |
| E17_E18.5 | 22970 |
| 0_1 | 28315 |
| 2 | 23329 |
| 3 | 17953 |
| 4 | 12744 |
| 5_6 | 13356 |
| 7_8 | 31637 |
| 9 | 31973 |
| 10 | 12146 |
| 11 | 11885 |
| 12_13 | 54048 |
| 14_15 | 76385 |
| 16 | 24198 |
| 17_19 | 28877 |
| 20_28 | 68990 |
| 54_68 | 72864 |
Adding color and feature order#
Each major feature in the donor and library table is associated with unique colors and an ordering with the set of values. Below we load the value_sets DataFrame which is a mapping from the various value in the donor and species tables to those colors and orderings. We incorporate these values into the cell metadata table.
value_sets = abc_cache.get_metadata_dataframe(
directory='Developing-Mouse-Vis-Cortex-10X',
file_name='value_sets'
).set_index('label')
value_sets
value_sets.csv: 100%|██████████| 4.33k/4.33k [00:00<00:00, 66.6kMB/s]
| table | field | description | order | external_identifier | parent_label | color_hex_triplet | |
|---|---|---|---|---|---|---|---|
| label | |||||||
| Female | donor | donor_sex | Female | 1 | NaN | NaN | #565353 |
| Male | donor | donor_sex | Male | 2 | NaN | NaN | #ADC4C3 |
| Mouse | donor | species_genus | Mouse | 1 | NCBITaxon:10088 | NaN | #c941a7 |
| Mus musculus | donor | species_scientific_name | Mus musculus | 1 | NCBITaxon:10090 | NaN | #ffa300 |
| WholeBrain | library | region_of_interest_label | brain | 1 | MBA:997 | NaN | #bebebe |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 14_15 | donor | age_bin | age bin in days | 14 | NaN | NaN | #2C49AD |
| 16 | donor | age_bin | age bin in days | 15 | NaN | NaN | #0F0F99 |
| 17_19 | donor | age_bin | age bin in days | 16 | NaN | NaN | #6B49F9 |
| 20_28 | donor | age_bin | age bin in days | 17 | NaN | NaN | #AE43C6 |
| 54_68 | donor | age_bin | age bin in days | 18 | NaN | NaN | #C30000 |
68 rows × 7 columns
We define a convenience function to add colors for the various values in the data (e.g. unique region of interest or donor sex values).
def extract_value_set(
cell_metadata_df: pd.DataFrame,
input_value_set: pd.DataFrame,
input_value_set_label: str,
dataframe_column: Optional[str] = None
):
"""Add color and order columns to the cell metadata dataframe based on the input
value set.
Columns are added as {input_value_set_label}_color and {input_value_set_label}_order.
Parameters
----------
cell_metadata_df : pd.DataFrame
DataFrame containing cell metadata.
input_value_set : pd.DataFrame
DataFrame containing the value set information.
input_value_set_label : str
The the column name to extract color and order information for. will be added to the cell metadata.
"""
if dataframe_column is None:
dataframe_column = input_value_set_label
cell_metadata_df[f'{dataframe_column}_color'] = input_value_set[
input_value_set['field'] == input_value_set_label
].loc[cell_metadata_df[dataframe_column]]['color_hex_triplet'].values
cell_metadata_df[f'{dataframe_column}_order'] = input_value_set[
input_value_set['field'] == input_value_set_label
].loc[cell_metadata_df[dataframe_column]]['order'].values
Use our function to add the relevant color and order columns to our cell_metadata table.
# Add region of interest color and order
extract_value_set(cell_extended, value_sets, 'region_of_interest_label')
# Add region of interest color and order
extract_value_set(cell_extended, value_sets, 'synchronized_age', 'donor_age')
extract_value_set(cell_extended, value_sets, 'synchronized_age')
# Add region of interest color and order
extract_value_set(cell_extended, value_sets, 'age_bin')
# Add species common name color and order
extract_value_set(cell_extended, value_sets, 'species_genus')
# Add species scientific name color and order
extract_value_set(cell_extended, value_sets, 'species_scientific_name')
# Add donor sex color and order
extract_value_set(cell_extended, value_sets, 'donor_sex')
cell_extended.head()
| cell_barcode | barcoded_cell_sample_label | library_label | alignment_id | log.qc.score.cr6 | synchronized_age | donor_label | dataset_label | feature_matrix_label | donor_species | ... | synchronized_age_color | synchronized_age_order | age_bin_color | age_bin_order | species_genus_color | species_genus_order | species_scientific_name_color | species_scientific_name_order | donor_sex_color | donor_sex_order | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cell_label | |||||||||||||||||||||
| AAACCCAAGGATTTAG-898_A02 | AAACCCAAGGATTTAG | 898_A02 | L8TX_211028_01_G04 | 1157582505 | 338.986629 | E13.5 | C67BL6J-606627 | Developing-Mouse-Vis-Cortex-10X | Developing-Mouse-Vis-Cortex-10X | NCBITaxon:10090 | ... | #E6865C | 3 | #D38F38 | 2 | #c941a7 | 1 | #ffa300 | 1 | #ADC4C3 | 2 |
| AAACCCACACATTCTT-898_A02 | AAACCCACACATTCTT | 898_A02 | L8TX_211028_01_G04 | 1157582505 | 425.642457 | E13.5 | C67BL6J-606627 | Developing-Mouse-Vis-Cortex-10X | Developing-Mouse-Vis-Cortex-10X | NCBITaxon:10090 | ... | #E6865C | 3 | #D38F38 | 2 | #c941a7 | 1 | #ffa300 | 1 | #ADC4C3 | 2 |
| AAACCCACACTTACAG-898_A02 | AAACCCACACTTACAG | 898_A02 | L8TX_211028_01_G04 | 1157582505 | 342.366675 | E13.5 | C67BL6J-606627 | Developing-Mouse-Vis-Cortex-10X | Developing-Mouse-Vis-Cortex-10X | NCBITaxon:10090 | ... | #E6865C | 3 | #D38F38 | 2 | #c941a7 | 1 | #ffa300 | 1 | #ADC4C3 | 2 |
| AAACGAATCACTCTTA-898_A02 | AAACGAATCACTCTTA | 898_A02 | L8TX_211028_01_G04 | 1157582505 | 353.204604 | E13.5 | C67BL6J-606627 | Developing-Mouse-Vis-Cortex-10X | Developing-Mouse-Vis-Cortex-10X | NCBITaxon:10090 | ... | #E6865C | 3 | #D38F38 | 2 | #c941a7 | 1 | #ffa300 | 1 | #ADC4C3 | 2 |
| AAACGAATCGACCAAT-898_A02 | AAACGAATCGACCAAT | 898_A02 | L8TX_211028_01_G04 | 1157582505 | 355.052147 | E13.5 | C67BL6J-606627 | Developing-Mouse-Vis-Cortex-10X | Developing-Mouse-Vis-Cortex-10X | NCBITaxon:10090 | ... | #E6865C | 3 | #D38F38 | 2 | #c941a7 | 1 | #ffa300 | 1 | #ADC4C3 | 2 |
5 rows × 40 columns
UMAP spatial embedding#
Now that we’ve merged our donor and library metadata into the main cells data, our next step is to plot these values in the Uniform Manifold Approximation and Projection (UMAP) for cells in the dataset. The UMAP is a dimension reduction technique that can be used for visualizing and exploring large-dimension datasets.
Below we load this 2-D embedding for a sub selection of our cells and merge the x-y coordinates into the extended cell metadata we are creating.
cell_2d_embedding_coordinates = abc_cache.get_metadata_dataframe(
directory='Developing-Mouse-Vis-Cortex-taxonomy',
file_name='cell_2d_embedding_coordinates'
).set_index('cell_label')
cell_2d_embedding_coordinates.head()
cell_2d_embedding_coordinates.csv: 100%|██████████| 25.7M/25.7M [00:00<00:00, 31.2MMB/s]
| x | y | |
|---|---|---|
| cell_label | ||
| AAACCCAAGGATTTAG-898_A02 | 17.504883 | 4.327579 |
| AAACCCACACATTCTT-898_A02 | 9.755900 | 5.404129 |
| AAACCCACACTTACAG-898_A02 | 10.902691 | 4.540480 |
| AAACGAATCACTCTTA-898_A02 | 13.066967 | 4.577918 |
| AAACGAATCGACCAAT-898_A02 | 18.801212 | -2.231872 |
cell_extended = cell_extended.join(cell_2d_embedding_coordinates)
cell_extended = cell_extended.sample(frac=1) # shuffle the rows for plotting purposes
del cell_2d_embedding_coordinates
We define a small helper function plot_umap to visualize the cells on the UMAP. In the examples below we will plot associated cell information colorized by donor age, sex, and region of interest.
def plot_umap(
xx: np.ndarray,
yy: np.ndarray,
cc: np.ndarray = None,
val: np.ndarray = None,
fig_width: float = 8,
fig_height: float = 8,
cmap: Optional[plt.Colormap] = None,
labels: np.ndarray = None,
term_orders: np.ndarray = None,
colorbar: bool = False,
sizes: np.ndarray = None
) -> Tuple[plt.Figure, plt.Axes]:
"""
Plot a scatter plot of the UMAP coordinates.
Parameters
----------
xx : array-like
x-coordinates of the points to plot.
yy : array-like
y-coordinates of the points to plot.
cc : array-like, optional
colors of the points to plot. If None, the points will be colored by the values in `val`.
val : array-like, optional
values of the points to plot. If None, the points will be colored by the values in `cc`.
fig_width : float, optional
width of the figure in inches. Default is 8.
fig_height : float, optional
height of the figure in inches. Default is 8.
cmap : str, optional
colormap to use for coloring the points. If None, the points will be colored by the values in `cc`.
labels : array-like, optional
labels for the points to plot. If None, no labels will be added to the plot.
term_orders : array-like, optional
order of the labels for the legend. If None, the labels will be ordered by their appearance in `labels`.
colorbar : bool, optional
whether to add a colorbar to the plot. Default is False.
sizes : array-like, optional
sizes of the points to plot. If None, all points will have the same size.
"""
if sizes is None:
sizes = 1
fig, ax = plt.subplots()
fig.set_size_inches(fig_width, fig_height)
if cmap is not None:
scatt = ax.scatter(xx, yy, c=val, s=0.5, marker='.', cmap=cmap, alpha=sizes)
elif cc is not None:
scatt = ax.scatter(xx, yy, c=cc, s=0.5, marker='.', alpha=sizes)
if labels is not None:
from matplotlib.patches import Rectangle
unique_label_colors = (labels + ',' + cc).unique()
unique_labels = np.array([label_color.split(',')[0] for label_color in unique_label_colors])
unique_colors = np.array([label_color.split(',')[1] for label_color in unique_label_colors])
if term_orders is not None:
unique_order = term_orders.unique()
term_order = np.argsort(unique_order)
unique_labels = unique_labels[term_order]
unique_colors = unique_colors[term_order]
rects = []
for color in unique_colors:
rects.append(Rectangle((0, 0), 1, 1, fc=color))
legend = ax.legend(rects, unique_labels, loc=1)
# ax.add_artist(legend)
if colorbar:
fig.colorbar(scatt, ax=ax)
return fig, ax
Plot the various donor and library metadata available.
fig, ax = plot_umap(
cell_extended['x'],
cell_extended['y'],
cc=cell_extended['donor_sex_color'],
labels=cell_extended['donor_sex'],
term_orders=cell_extended['donor_sex_order'],
fig_width=12,
fig_height=12
)
res = ax.set_title("donor_sex")
plt.show()
Below we show the region of interest for the cells in the dataset.
fig, ax = plot_umap(
cell_extended['x'],
cell_extended['y'],
cc=cell_extended['region_of_interest_label_color'],
labels=cell_extended['region_of_interest_label'],
term_orders=cell_extended['region_of_interest_label_order'],
fig_width=12,
fig_height=12
)
res = ax.set_title("region_of_interest_label")
plt.show()
The UMAP with donor age.
fig, ax = plot_umap(
cell_extended['x'],
cell_extended['y'],
cc=cell_extended['donor_age_color'],
labels=cell_extended['donor_age'],
term_orders=cell_extended['donor_age_order'],
fig_width=12,
fig_height=12
)
res = ax.set_title("donor_age")
plt.show()
Taxonomy Information#
The final set of metadata we load into our extended cell metadata file maps the cells into their assigned cluster in the taxonomy. We additionally load metadata for the clusters and compute useful information, such as the number of cells in each taxon at each level of the taxonomy.
First, we load information associated with the lowest level in the taxonomy in the taxonomy. This includes a useful alias value for each cluster as well as the number of cells in each subcluster.
cluster = abc_cache.get_metadata_dataframe(
directory='Developing-Mouse-Vis-Cortex-taxonomy',
file_name='cluster',
dtype={'number_of_cells': 'Int64'}
).rename(columns={'label': 'cluster_annotation_term_label'}).set_index('cluster_annotation_term_label')
cluster.head()
cluster.csv: 100%|██████████| 20.5k/20.5k [00:00<00:00, 210kMB/s]
| cluster_alias | number_of_cells | |
|---|---|---|
| cluster_annotation_term_label | ||
| CS20260131_SCLU_0001 | 1 | 77 |
| CS20260131_SCLU_0002 | 2 | 126 |
| CS20260131_SCLU_0003 | 3 | 184 |
| CS20260131_SCLU_0004 | 4 | 186 |
| CS20260131_SCLU_0005 | 5 | 25 |
Next, we load the table that describes the levels in the taxonomy from class at the highest to subcluster at the lowest level.
cluster_annotation_term_set = abc_cache.get_metadata_dataframe(
directory='Developing-Mouse-Vis-Cortex-taxonomy',
file_name='cluster_annotation_term_set'
).rename(columns={'label': 'cluster_annotation_term_label'})
cluster_annotation_term_set
cluster_annotation_term_set.csv: 100%|██████████| 268/268 [00:00<00:00, 2.76kMB/s]
| name | cluster_annotation_term_label | description | order | parent_term_set_label | |
|---|---|---|---|---|---|
| 0 | class | CCN20260131_LEVEL_0 | class | 0 | NaN |
| 1 | subclass | CCN20260131_LEVEL_1 | subclass | 1 | CCN20260131_LEVEL_0 |
| 2 | cluster | CCN20260131_LEVEL_2 | cluster | 2 | CCN20260131_LEVEL_1 |
| 3 | subcluster | CCN20260131_LEVEL_3 | subcluster | 3 | CCN20260131_LEVEL_2 |
For the subclusters, we load information on the annotations for each subcluster. This also includes the term order and color information which we will use to plot later. Note the inclusion of the CCN20230722_label column. This value points to the equivalent taxon in the Whole Mouse Brain taxonomy if it exists.
cluster_annotation_term = abc_cache.get_metadata_dataframe(
directory='Developing-Mouse-Vis-Cortex-taxonomy',
file_name='cluster_annotation_term',
).rename(columns={'label': 'cluster_annotation_term_label'}).set_index('cluster_annotation_term_label')
cluster_annotation_term
cluster_annotation_term.csv: 100%|██████████| 129k/129k [00:00<00:00, 1.10MMB/s]
| name | cluster_annotation_term_set_label | cluster_annotation_term_set_name | color_hex_triplet | term_order | term_set_order | parent_term_label | parent_term_name | parent_term_set_label | CCN20230722_label | |
|---|---|---|---|---|---|---|---|---|---|---|
| cluster_annotation_term_label | ||||||||||
| CS20260131_CLAS_009 | Astro-Epen | CCN20260131_LEVEL_0 | class | #594a26 | 9 | 0 | NaN | NaN | NaN | CS20230722_CLAS_30 |
| CS20260131_CLAS_013 | CNU-MGE GABA | CCN20260131_LEVEL_0 | class | #450099 | 13 | 0 | NaN | NaN | NaN | CS20230722_CLAS_08 |
| CS20260131_CLAS_002 | CR Glut | CCN20260131_LEVEL_0 | class | #919900 | 2 | 0 | NaN | NaN | NaN | NaN |
| CS20260131_CLAS_011 | CTX-CGE GABA | CCN20260131_LEVEL_0 | class | #CCFF33 | 11 | 0 | NaN | NaN | NaN | CS20230722_CLAS_06 |
| CS20260131_CLAS_012 | CTX-MGE GABA | CCN20260131_LEVEL_0 | class | #f954ee | 12 | 0 | NaN | NaN | NaN | CS20230722_CLAS_07 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| CS20260131_SCLU_051 | RG_5 | CCN20260131_LEVEL_3 | subcluster | #9e9ac8 | 51 | 3 | CS20260131_CLUS_004 | RG | CCN20260131_LEVEL_2 | NaN |
| CS20260131_SCLU_052 | RG_6 | CCN20260131_LEVEL_3 | subcluster | #bcbddc | 52 | 3 | CS20260131_CLUS_004 | RG | CCN20260131_LEVEL_2 | NaN |
| CS20260131_SCLU_053 | RG_7 | CCN20260131_LEVEL_3 | subcluster | #dadaeb | 53 | 3 | CS20260131_CLUS_004 | RG | CCN20260131_LEVEL_2 | NaN |
| CS20260131_SCLU_054 | RG_8 | CCN20260131_LEVEL_3 | subcluster | #636363 | 54 | 3 | CS20260131_CLUS_004 | RG | CCN20260131_LEVEL_2 | NaN |
| CS20260131_SCLU_055 | RG_9 | CCN20260131_LEVEL_3 | subcluster | #969696 | 55 | 3 | CS20260131_CLUS_004 | RG | CCN20260131_LEVEL_2 | NaN |
917 rows × 10 columns
Finally, we load the cluster to cluster annotation membership table. Each row in this table is a mapping between the subclusters and every level of the taxonomy it belongs to, including itself. We’ll use this table in a groupby to allow us to count up the number of clusters at each taxonomy level and sum the number of cells in each taxon in the taxonomy a all levels.
cluster_to_cluster_annotation_membership = abc_cache.get_metadata_dataframe(
directory='Developing-Mouse-Vis-Cortex-taxonomy',
file_name='cluster_to_cluster_annotation_membership'
).set_index('cluster_annotation_term_label')
membership_with_cluster_info = cluster_to_cluster_annotation_membership.join(
cluster.reset_index().set_index('cluster_alias')[['number_of_cells']],
on='cluster_alias'
)
membership_with_cluster_info = membership_with_cluster_info.join(cluster_annotation_term, rsuffix='_anno_term').reset_index()
membership_groupby = membership_with_cluster_info.groupby(
['cluster_alias', 'cluster_annotation_term_set_name']
)
membership_with_cluster_info.head()
cluster_to_cluster_annotation_membership.csv: 0%| | 0.00/189k [00:00<?, ?MB/s]
cluster_to_cluster_annotation_membership.csv: 100%|██████████| 189k/189k [00:00<00:00, 1.45MMB/s]
| cluster_annotation_term_label | cluster_annotation_term_set_name | cluster_annotation_term_name | cluster_alias | cluster_annotation_term_set_label | number_of_cells | name | cluster_annotation_term_set_label_anno_term | cluster_annotation_term_set_name_anno_term | color_hex_triplet | term_order | term_set_order | parent_term_label | parent_term_name | parent_term_set_label | CCN20230722_label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CS20260131_CLAS_001 | class | NEC | 1 | CCN20260131_LEVEL_0 | 77 | NEC | CCN20260131_LEVEL_0 | class | #6BAED6 | 1 | 0 | NaN | NaN | NaN | NaN |
| 1 | CS20260131_CLAS_001 | class | NEC | 2 | CCN20260131_LEVEL_0 | 126 | NEC | CCN20260131_LEVEL_0 | class | #6BAED6 | 1 | 0 | NaN | NaN | NaN | NaN |
| 2 | CS20260131_CLAS_001 | class | NEC | 3 | CCN20260131_LEVEL_0 | 184 | NEC | CCN20260131_LEVEL_0 | class | #6BAED6 | 1 | 0 | NaN | NaN | NaN | NaN |
| 3 | CS20260131_CLAS_001 | class | NEC | 4 | CCN20260131_LEVEL_0 | 186 | NEC | CCN20260131_LEVEL_0 | class | #6BAED6 | 1 | 0 | NaN | NaN | NaN | NaN |
| 4 | CS20260131_CLAS_001 | class | NEC | 5 | CCN20260131_LEVEL_0 | 25 | NEC | CCN20260131_LEVEL_0 | class | #6BAED6 | 1 | 0 | NaN | NaN | NaN | NaN |
From the membership table, we create three tables via a groupby. First the name of each cluster and its parents.
# term_sets = abc_cache.get_metadata_dataframe(directory='WHB-taxonomy', file_name='cluster_annotation_term_set').set_index('label')
cluster_details = membership_groupby['cluster_annotation_term_name'].first().unstack()
cluster_details = cluster_details[cluster_annotation_term_set['name']] # order columns
cluster_details.fillna('Other', inplace=True)
cluster_details.head()
| cluster_annotation_term_set_name | class | subclass | cluster | subcluster |
|---|---|---|---|---|
| cluster_alias | ||||
| 1 | NEC | NEC | NEC | NEC_1 |
| 2 | NEC | NEC | NEC | NEC_2 |
| 3 | NEC | NEC | NEC | NEC_3 |
| 4 | NEC | NEC | NEC | NEC_4 |
| 5 | NEC | NEC | NEC | NEC_5 |
Next the plotting order of each of the taxons and their parents.
cluster_order = membership_groupby['term_order'].first().unstack()
cluster_order.rename(
columns={'class': 'class_order',
'subclass': 'subclass_order',
'cluster': 'cluster_order',
'subcluster': 'subcluster_order'},
inplace=True
)
cluster_order.head()
| cluster_annotation_term_set_name | class_order | cluster_order | subclass_order | subcluster_order |
|---|---|---|---|---|
| cluster_alias | ||||
| 1 | 1 | 1 | 1 | 1 |
| 2 | 1 | 1 | 1 | 2 |
| 3 | 1 | 1 | 1 | 3 |
| 4 | 1 | 1 | 1 | 4 |
| 5 | 1 | 1 | 1 | 5 |
Finally, the colors we will use to plot for each of the unique taxons at all levels.
cluster_colors = membership_groupby['color_hex_triplet'].first().unstack()
cluster_colors = cluster_colors[cluster_annotation_term_set['name']]
cluster_colors.head()
| cluster_annotation_term_set_name | class | subclass | cluster | subcluster |
|---|---|---|---|---|
| cluster_alias | ||||
| 1 | #6BAED6 | #50B4F0 | #9e9ac8 | #6baed6 |
| 2 | #6BAED6 | #50B4F0 | #9e9ac8 | #9ecae1 |
| 3 | #6BAED6 | #50B4F0 | #9e9ac8 | #c6dbef |
| 4 | #6BAED6 | #50B4F0 | #9e9ac8 | #e6550d |
| 5 | #6BAED6 | #50B4F0 | #9e9ac8 | #fd8d3c |
Next, we bring it all together by loading the mapping of cells to subcluster and join into our final metadata table.
cell_to_cluster_membership = abc_cache.get_metadata_dataframe(
directory='Developing-Mouse-Vis-Cortex-taxonomy',
file_name='cell_to_cluster_membership',
).set_index('cell_label')
cell_to_cluster_membership.head()
cell_to_cluster_membership.csv: 0%| | 0.00/28.5M [00:00<?, ?MB/s]
cell_to_cluster_membership.csv: 100%|██████████| 28.5M/28.5M [00:01<00:00, 26.1MMB/s]
| cluster_alias | label | |
|---|---|---|
| cell_label | ||
| AAACCCAAGGATTTAG-898_A02 | 50 | CS20260131_SCLU_0050 |
| AAACCCACACATTCTT-898_A02 | 122 | CS20260131_SCLU_0122 |
| AAACCCACACTTACAG-898_A02 | 118 | CS20260131_SCLU_0118 |
| AAACGAATCACTCTTA-898_A02 | 108 | CS20260131_SCLU_0108 |
| AAACGAATCGACCAAT-898_A02 | 690 | CS20260131_SCLU_0690 |
We merge this table with information from our taxons.
cell_extended = cell_extended.join(cell_to_cluster_membership, rsuffix='_cell_to_cluster_membership')
cell_extended = cell_extended.join(cluster_details, on='cluster_alias')
cell_extended = cell_extended.join(cluster_colors, on='cluster_alias', rsuffix='_color')
cell_extended = cell_extended.join(cluster_order, on='cluster_alias')
del cell_to_cluster_membership
cell_extended.head()
| cell_barcode | barcoded_cell_sample_label | library_label | alignment_id | log.qc.score.cr6 | synchronized_age | donor_label | dataset_label | feature_matrix_label | donor_species | ... | cluster | subcluster | class_color | subclass_color | cluster_color | subcluster_color | class_order | cluster_order | subclass_order | subcluster_order | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cell_label | |||||||||||||||||||||
| CCTCCAAAGTAAGAGG-482_A06 | CCTCCAAAGTAAGAGG | 482_A06 | L8TX_210107_01_D09 | 1157582376 | 241.8102 | P21 | Snap25-IRES2-Cre;Ai14-562459 | Developing-Mouse-Vis-Cortex-10X | Developing-Mouse-Vis-Cortex-10X | NCBITaxon:10090 | ... | 5312_Microglia NN_1 | 5312_Microglia NN_1_3 | #825f45 | #CC1F4E | #a55194 | #dadaeb | 15 | 146 | 39 | 712 |
| GAGGCAACAGCTTTCC-479_A03 | GAGGCAACAGCTTTCC | 479_A03 | L8TX_210107_01_A09 | 1157582337 | 431.3020 | P56 | Snap25-IRES2-Cre;Ai14-562811 | Developing-Mouse-Vis-Cortex-10X | Developing-Mouse-Vis-Cortex-10X | NCBITaxon:10090 | ... | 741_Pvalb Gaba_3 | 741_Pvalb Gaba_3_8 | #f954ee | #b13667 | #c7e9c0 | #969696 | 12 | 106 | 31 | 617 |
| ATCGGATCACCCATAA-441_B01 | ATCGGATCACCCATAA | 441_B01 | L8TX_201120_01_H09 | 1178468076 | 457.6961 | P56 | Snap25-IRES2-Cre;Ai14-553679 | Developing-Mouse-Vis-Cortex-10X | Developing-Mouse-Vis-Cortex-10X | NCBITaxon:10090 | ... | 437_L6 CT CTX Glut_1 | 437_L6 CT CTX Glut_1_15 | #61e2a4 | #34661F | #c7e9c0 | #cedb9c | 6 | 13 | 11 | 204 |
| AAGAACACATTGCCGG-418_C04 | AAGAACACATTGCCGG | 418_C04 | L8TX_201105_01_C02 | 1157582427 | 416.2293 | P28 | Snap25-IRES2-Cre;Ai14-555355 | Developing-Mouse-Vis-Cortex-10X | Developing-Mouse-Vis-Cortex-10X | NCBITaxon:10090 | ... | 118_L2/3 IT CTX Glut_4 | 118_L2/3 IT CTX Glut_4_2 | #FA0087 | #E9530F | #de9ed6 | #c7e9c0 | 7 | 45 | 18 | 453 |
| GTGAGCCGTCTACACA-634_A03 | GTGAGCCGTCTACACA | 634_A03 | L8TX_210506_01_D07 | 1157582451 | 370.9672 | P5 | Snap25-IRES2-Cre;Ai14-582040 | Developing-Mouse-Vis-Cortex-10X | Developing-Mouse-Vis-Cortex-10X | NCBITaxon:10090 | ... | 439_L6 CT CTX Glut_1 | 439_L6 CT CTX Glut_1_6 | #61e2a4 | #34661F | #756bb1 | #d6616b | 6 | 14 | 11 | 211 |
5 rows × 56 columns
print_column_info(cell_extended)
Number of unique cell_barcode = 527176
Number of unique barcoded_cell_sample_label = 91
Number of unique library_label = 91
Number of unique alignment_id = 91
Number of unique log.qc.score.cr6 = 496566
Number of unique synchronized_age = 35
Number of unique donor_label = 53
Number of unique dataset_label = 1 ['Developing-Mouse-Vis-Cortex-10X']
Number of unique feature_matrix_label = 1 ['Developing-Mouse-Vis-Cortex-10X']
Number of unique donor_species = 1 ['NCBITaxon:10090']
Number of unique species_scientific_name = 1 ['Mus musculus']
Number of unique species_genus = 1 ['Mouse']
Number of unique donor_sex = 2 ['Female', 'Male']
Number of unique donor_age_value = 38
Number of unique donor_age_unit = 1 ['days']
Number of unique donor_age_reference_point = 2 ['birth', 'conception']
Number of unique donor_age = 35
Number of unique age_bin = 18 ['0_1', '10', '11', '12_13', '14_15', '16', '17_19', '2', '20_28', '3', '4', '54_68', '5_6', '7_8', '9', 'E11.5_E12.5', 'E13.5_E16.5', 'E17_E18.5']
Number of unique library_technique = 2 ['10xV3.1;GEXOnly', '10xV3;GEXOnly']
Number of unique barcoded_cell_sample_label_library_table = 91
Number of unique enrichment_population = 5 ['Calcein-positive, Hoechst-positive', 'Hoechst-positive', 'No FACS', 'RFP-positive, DAPI-negative', 'RFP-positive, Hoechst-positive']
Number of unique cell_specimen_type = 1 ['Cells']
Number of unique parcellation_term_identifier = 5 ['MBA:22|MBA:669', 'MBA:343|MBA:567', 'MBA:385', 'MBA:669', 'MBA:997']
Number of unique region_of_interest_name = 5 ['Cerebrum - Brain stem', 'Primary visual area', 'Visual areas', 'Visual areas - Posterior parietal association areas', 'brain']
Number of unique region_of_interest_label = 5 ['CH-BS', 'VIS', 'VIS-PTLp', 'VISp', 'WholeBrain']
Number of unique donor_label_library_table = 53
Number of unique region_of_interest_label_color = 5 ['#0059CC', '#08858C', '#26AEFF', '#B0F0FF', '#bebebe']
Number of unique region_of_interest_label_order = 5 [1, 3, 4, 5, 6]
Number of unique donor_age_color = 35
Number of unique donor_age_order = 35
Number of unique synchronized_age_color = 35
Number of unique synchronized_age_order = 35
Number of unique age_bin_color = 18 ['#00CC00', '#00D4E6', '#0F0F99', '#18F2E1', '#2C49AD', '#31CCA0', '#34D916', '#4EAD73', '#5A93C6', '#6B49F9', '#73BF7F', '#7FEC3C', '#82C5D9', '#AE43C6', '#BCD233', '#C30000', '#D38F38', '#F99389']
Number of unique age_bin_order = 18 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
Number of unique species_genus_color = 1 ['#c941a7']
Number of unique species_genus_order = 1 [1]
Number of unique species_scientific_name_color = 1 ['#ffa300']
Number of unique species_scientific_name_order = 1 [1]
Number of unique donor_sex_color = 2 ['#565353', '#ADC4C3']
Number of unique donor_sex_order = 2 [1, 2]
Number of unique x = 564967
Number of unique y = 565338
Number of unique cluster_alias = 714
Number of unique label = 714
Number of unique class = 15 ['Astro-Epen', 'CNU-MGE GABA', 'CR Glut', 'CTX-CGE GABA', 'CTX-MGE GABA', 'Glioblast', 'IMN', 'IP', 'IT Glut', 'Immune', 'NEC', 'OPC-Oligo', 'RG', 'Vascular', 'nonIT Glut']
Number of unique subclass = 40
Number of unique cluster = 148
Number of unique subcluster = 714
Number of unique class_color = 15 ['#03045E', '#16f2f2', '#450099', '#594a26', '#61e2a4', '#637939', '#6BAED6', '#825f45', '#858881', '#919900', '#CCFF33', '#CEDB9C', '#FA0087', '#FDAE6B', '#f954ee']
Number of unique subclass_color = 40
Number of unique cluster_color = 49
Number of unique subcluster_color = 40
Number of unique class_order = 15 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Number of unique cluster_order = 148
Number of unique subclass_order = 40
Number of unique subcluster_order = 714
Plotting the taxonomy#
Now that we have our cells with associated taxonomy information, we’ll plot them into the UMAP we showed previously.
Below we plot the taxonomy mapping of the cells for each level in the taxonomy.
fig, ax = plot_umap(
cell_extended['x'],
cell_extended['y'],
cc=cell_extended['class_color'],
labels=cell_extended['class'],
term_orders=cell_extended['class_order'],
fig_width=12,
fig_height=12
)
res = ax.set_title("class")
plt.show()
fig, ax = plot_umap(
cell_extended['x'],
cell_extended['y'],
cc=cell_extended['subclass_color'],
labels=cell_extended['subclass'],
term_orders=cell_extended['subclass_order'],
fig_width=12,
fig_height=12
)
res = ax.set_title("subclass")
plt.show()
fig, ax = plot_umap(
cell_extended['x'],
cell_extended['y'],
cc=cell_extended['cluster_color'],
fig_width=12,
fig_height=12
)
res = ax.set_title("cluster")
plt.show()
fig, ax = plot_umap(
cell_extended['x'],
cell_extended['y'],
cc=cell_extended['subcluster_color'],
fig_width=18,
fig_height=18
)
res = ax.set_title("subcluster")
plt.show()
Aggregating cluster and cells counts per term per species.#
Let’s investigate the taxonomy information a bit more. In this section, we’ll create bar plots showing the number of subclusters and cells at each level in the taxonomy.
First, we need to compute the number of subclusters that are in each of the cell type taxons above it. This is accomplished by a simple groupby in Pandas.
term_cluster_count = membership_with_cluster_info.reset_index().groupby(
['cluster_annotation_term_label']
)[['cluster_alias']].count()
term_cluster_count.columns = ['number_of_subclusters']
term_cluster_count.head()
| number_of_subclusters | |
|---|---|
| cluster_annotation_term_label | |
| CS20260131_CLAS_001 | 26 |
| CS20260131_CLAS_002 | 20 |
| CS20260131_CLAS_003 | 30 |
| CS20260131_CLAS_004 | 29 |
| CS20260131_CLAS_005 | 56 |
term_cell_count = membership_with_cluster_info.reset_index().groupby(
['cluster_annotation_term_label']
)[['number_of_cells']].sum()
term_cell_count.head()
| number_of_cells | |
|---|---|
| cluster_annotation_term_label | |
| CS20260131_CLAS_001 | 4605 |
| CS20260131_CLAS_002 | 1611 |
| CS20260131_CLAS_003 | 5656 |
| CS20260131_CLAS_004 | 4957 |
| CS20260131_CLAS_005 | 31718 |
# Join counts with the term dataframe
term_with_counts = cluster_annotation_term.join(term_cluster_count)
term_with_counts = term_with_counts.join(term_cell_count)
term_with_counts.head()
| name | cluster_annotation_term_set_label | cluster_annotation_term_set_name | color_hex_triplet | term_order | term_set_order | parent_term_label | parent_term_name | parent_term_set_label | CCN20230722_label | number_of_subclusters | number_of_cells | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cluster_annotation_term_label | ||||||||||||
| CS20260131_CLAS_009 | Astro-Epen | CCN20260131_LEVEL_0 | class | #594a26 | 9 | 0 | NaN | NaN | NaN | CS20230722_CLAS_30 | 25 | 32116 |
| CS20260131_CLAS_013 | CNU-MGE GABA | CCN20260131_LEVEL_0 | class | #450099 | 13 | 0 | NaN | NaN | NaN | CS20230722_CLAS_08 | 3 | 208 |
| CS20260131_CLAS_002 | CR Glut | CCN20260131_LEVEL_0 | class | #919900 | 2 | 0 | NaN | NaN | NaN | NaN | 20 | 1611 |
| CS20260131_CLAS_011 | CTX-CGE GABA | CCN20260131_LEVEL_0 | class | #CCFF33 | 11 | 0 | NaN | NaN | NaN | CS20230722_CLAS_06 | 52 | 17567 |
| CS20260131_CLAS_012 | CTX-MGE GABA | CCN20260131_LEVEL_0 | class | #f954ee | 12 | 0 | NaN | NaN | NaN | CS20230722_CLAS_07 | 91 | 38296 |
Below we create a function to plot the subcluster and cell counts in a bar graph, coloring by the associated taxon level.
def bar_plot_by_level_and_type(df: pd.DataFrame, level: str, fig_width: float = 8.5, fig_height: float = 4):
"""Plot the number of cells by the specified level.
Parameters
----------
df : pd.DataFrame
DataFrame containing cluster annotation terms with counts.
level : str
The level of the taxonomy to plot (e.g., 'Neighborhood', 'Class', 'Subclass', 'Group', 'Cluster').
fig_width : float, optional
Width of the figure in inches. Default is 8.5.
fig_height : float, optional
Height of the figure in inches. Default is 4.
"""
fig, ax = plt.subplots(1, 2)
fig.set_size_inches(fig_width, fig_height)
for idx, ctype in enumerate(['subclusters', 'cells']):
pred = (df['cluster_annotation_term_set_name'] == level)
sort_order = np.argsort(df[pred]['term_order'])
names = df[pred]['name'].iloc[sort_order]
counts = df[pred]['number_of_%s' % ctype].iloc[sort_order]
colors = df[pred]['color_hex_triplet'].iloc[sort_order]
ax[idx].barh(names, counts, color=colors)
ax[idx].set_title('Number of %s by %s' % (ctype,level))
ax[idx].set_xlabel('Number of %s' % ctype)
if ctype == 'cells':
ax[idx].set_xscale('log')
if idx > 0:
ax[idx].set_yticklabels([])
return fig, ax
Now let’s plot the counts the taxonomy levels class and subclass.
fig, ax = bar_plot_by_level_and_type(term_with_counts, 'class')
plt.show()
fig, ax = bar_plot_by_level_and_type(
term_with_counts,
'subclass',
fig_height=8
)
plt.show()
Cluster and subcluster count vs donor age.#
Below we compute the number of subclusters and clusters for each of the unique values of age in the dataset. We again compute this by Pandas groupbys.
age_per_subcluster = cell_extended.reset_index().groupby(
['donor_age_order']
)[['cluster_alias']].nunique()
age_per_subcluster.columns = ['number_of_subclusters']
age_per_subcluster.head()
| number_of_subclusters | |
|---|---|
| donor_age_order | |
| 1 | 65 |
| 2 | 68 |
| 3 | 150 |
| 4 | 132 |
| 6 | 144 |
age_per_cluster = cell_extended.reset_index().groupby(
['donor_age_order']
)[['cluster']].nunique()
age_per_cluster.columns = ['number_of_clusters']
age_per_cluster.head()
| number_of_clusters | |
|---|---|
| donor_age_order | |
| 1 | 12 |
| 2 | 13 |
| 3 | 28 |
| 4 | 26 |
| 6 | 26 |
We finally plot the number of (sub)clusters for each unique age, showing that both are increasing as a function of age.
age_values = value_sets[
value_sets['field'] == 'synchronized_age'
]
figure, ax = plt.subplots(1, 1)
fig.set_size_inches(10, 8)
plt.plot(
range(35), age_per_subcluster['number_of_subclusters'], linestyle='-', marker='o', label='number of subclusters'
)
plt.plot(
range(35), age_per_cluster['number_of_clusters'], linestyle='-', marker='o', label='number of clusters'
)
_ = ax.set_xticks(range(35))
_ = ax.set_xticklabels(age_values.reset_index().set_index('order').loc[age_per_cluster.index, 'label'], rotation=90)
ax.set_xlabel('synchronized_age')
ax.set_ylabel('count')
plt.title("Number of clusters and subclusters by synchronized_age")
plt.legend(loc=0)
<matplotlib.legend.Legend at 0x169890590>
Visualizing the developmental visual cortex taxonomy#
Term sets: class, subclass, cluster, and subcluster define the Developing Mouse, Visual Cortex taxonomy. We can visualize the taxonomy as a sunburst diagram that shows the single inheritance hierarchy through a series of rings, that are sliced for each annotation term. Each ring corresponds to a level in the hierarchy. We have ordered the rings so that the class level. Rings are divided based on their hierarchical relationship to the parent slice.
levels = ['class', 'subclass', 'cluster', 'subcluster']
df = {}
# Copy the term order of the parent into each of the level below it.
if term_with_counts.index.name != 'cluster_annotation_term_label':
term_with_counts = term_with_counts.set_index('cluster_annotation_term_label')
term_with_counts['parent_order'] = ""
for idx, row in term_with_counts.iterrows():
if pd.isna(row['parent_term_label']):
continue
term_with_counts.loc[idx, 'parent_order'] = term_with_counts.loc[row['parent_term_label']]['term_order']
term_with_counts = term_with_counts.reset_index()
for lvl in levels:
pred = term_with_counts['cluster_annotation_term_set_name'] == lvl
df[lvl] = term_with_counts[pred]
df[lvl] = df[lvl].sort_values(['parent_order', 'term_order'])
fig, ax = plt.subplots()
fig.set_size_inches(10, 10)
size = 0.15
for i, lvl in enumerate(levels):
if lvl == 'class':
ax.pie(df[lvl]['number_of_subclusters'],
colors=df[lvl]['color_hex_triplet'],
labels = df[lvl]['name'],
rotatelabels=True,
labeldistance=1.025,
radius=1,
wedgeprops=dict(width=size, edgecolor=None),
startangle=0)
else :
ax.pie(df[lvl]['number_of_subclusters'],
colors=df[lvl]['color_hex_triplet'],
radius=1-i*size,
wedgeprops=dict(width=size, edgecolor=None),
startangle=0)
term_with_counts = term_with_counts.set_index('cluster_annotation_term_label')
plt.show()
In the next tutorial, we show how to access and use Deveolping Mouse gene expression data.