Using cells selected in the ABC Atlas with Gene Expression data.

Using cells selected in the ABC Atlas with Gene Expression data.#

The Allen Brain Cell (ABC) Atlas is a powerfull tool for visualizing the data that are used throughout these notebooks. One particualar way users can interact with the ABC Atlas is by selecting and downloading specific cells from either a part of the taxonomy UMAP or brain region they are interested. This notebook will show how combine these cells selected from the ABC Atlas with gene expression data and metadata in these notebooks.

Note that the examples in this notebook can only be used with the Whole Mouse Brain (WMB) MERFISH and 10X data, the Zhuang et al. MERFISH data (1-4), and the Whole Human Brain (WHB) 10X data. The Seattle Alzheimer’s Disease (SEA-AD) dataset is not currently available through this repository (Data and documentation for SEA-AD are available here).

For this notebook, the user should:

Have some familiarity with using the ABC Atlas. If you are unsure of any steps used to manipulate data in the atlas visualization, you can refer to the user guide here.
Be using manifest versions equal to or greater than 20241115. Instructions on listing and changing versions can be found in the Getting Started notebook.
Be connected to the internet.
Have run through the Getting Started and other notebooks in this repo.

Initializing required modules and instantiating the AbcProjectCache.#

import anndata
import matplotlib.pyplot as plt
import pandas as pd
from pathlib import Path
from PIL import Image

from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache

We will interact with the data using the AbcProjectCache. This cache object tracks which data has been downloaded and serves the path to the requsted data on disk. For metadata, the cache can also directly serve a up a Pandas Dataframe. See the getting_started notebook for more details on using the cache including installing it if it has not already been.

Change the download_base variable to where you have downloaded the data in your system.

download_base = Path('../../data/abc_atlas')
abc_cache = AbcProjectCache.from_cache_dir(download_base)

abc_cache.current_manifest

'releases/20241130/manifest.json'

Selecting Data in the ABC Atlas.#

For this example we’ll be looking at the WMB MERFISH dataset within ABC Atlas. We’ll filter the data by a specific subtype and region and then select cells from a specific MERFISH slice. You can refer to this linked view of the atlas to see the cells we will be selecting.

The cells we select come from the L5 ET CTX Glut subclass, we’ll also filter on only cells in the VISpm sub-region of the Isocortex. This initial selection leaves 325 cells that overlap with our filters. We’ll use the select tool (dashed box in the upper corner in the linked ABC Atlas view) to select the cells in section 31 of our MERFISH data. Refer to the image below to see our selection and where these selected cells reside (in addition to the above link).

image = Image.open('data/abc_atlas_selection_merfish.png')
image

../_images/ed0714dabfbd7377e9434196c9676d2a229fbcdd69c06e30dde78409368e0d79.png

This gives us 94 cells in our final selection. We download the file by clicking the download arrow on the far right just above our listed, selected cells. This specific selection is provided in the file ABC_Atlas_Class_01 IT-ET Glut_cells_2024_10_23_16_42.csv that is packaged in this repo with the notebooks.

Combining with the cell metadata.#

Now that we have our subset of cells selected from the ABC Atlas, we can combine them with the cell metadata provided by the AbcProjectCache object.

The files cell_metadata in the projects/directories MERFISH-C57BL6J-638850, WHB-10Xv3, WMB-10X, Zhuang-ABCA-[1-4] contain a column abc_sample_id that allows for merging of data downloaded from an ABC Atlas visualization. Note that only the specific files called cell_metadata contain this id. Other derived metadata such as cell_metadata_with_cluster_annotation do not contain this id though it can be easily merged in.

Here we’ll just display a few columns of the cell_metadata table. For specifics on the MERFISH metadata and gene expression data used in this example, see the notebooks related this MERFISH dataset

cell_metadata = abc_cache.get_metadata_dataframe(
    directory='MERFISH-C57BL6J-638850',
    file_name='cell_metadata',
    dtype={"cell_label": str},
)
cell_metadata.set_index('cell_label', inplace=True)
cell_metadata.head()[['brain_section_label', 'abc_sample_id']]

cell_metadata.csv: 100%|██████████████████████████████████████████████████████████████████████████████████| 710M/710M [00:27<00:00, 26.0MMB/s]

	brain_section_label	abc_sample_id
cell_label
1019171907102340387-1	C57BL6J-638850.37	c9881423-76a7-4835-ba8b-7942fd384b6b
1104095349101460194-1	C57BL6J-638850.26	aa815488-6487-4e47-8a5e-d82ac9933bc6
1017092617101450577	C57BL6J-638850.25	91ef7a85-8e3e-4410-8ee2-785788df3ebe
1018093344101130233	C57BL6J-638850.13	18991e17-fbd3-4ba0-9c60-1281f56ac520
1019171912201610094	C57BL6J-638850.27	5e155936-e40d-4c6b-8971-e7fb0079274b

Now we load our selected cells. Note that we rename the Sample Id column to match the one in our cell metadata.

# Change path location below to either your own downloaded cells
# or the location of the csv from this repo (abc_atlas_access/notebooks/ABC_Atlas_Class_01 IT-ET Glut_cells_2024_10_23_16_42.csv)
downloaded_cells_path = Path('data/ABC_Atlas_Class_01 IT-ET Glut_cells_2024_10_23_16_42.csv')

abc_atlas_selection = pd.read_csv(downloaded_cells_path)
abc_atlas_selection.rename(
    columns={'Sample Id': 'abc_sample_id'},
    inplace=True
)
abc_atlas_selection.set_index('abc_sample_id', inplace=True)
abc_atlas_selection.head()

	Supertype	Section Label	Neurotransmitter Type	Anatomical Division	Cluster	Class	Anatomical Structure	Subclass	Anatomical Substructure
abc_sample_id
e23109f2-b70b-48a0-b29f-f4cfd24025ac	0090 L5 ET CTX Glut_1	C57BL6J-638850.31	Glut	Isocortex	0352 L5 ET CTX Glut_1	01 IT-ET Glut	VISpm	022 L5 ET CTX Glut	VISpm5
d42c7e04-95f7-4134-93e8-66acc1a52aa3	0091 L5 ET CTX Glut_2	C57BL6J-638850.31	Glut	Isocortex	0359 L5 ET CTX Glut_2	01 IT-ET Glut	VISpm	022 L5 ET CTX Glut	VISpm5
cfc13cd0-adf7-41c4-a849-f1472ea31567	0091 L5 ET CTX Glut_2	C57BL6J-638850.31	Glut	Isocortex	0359 L5 ET CTX Glut_2	01 IT-ET Glut	VISpm	022 L5 ET CTX Glut	VISpm5
e98716b6-15c2-4f54-8717-589b0da6d246	0090 L5 ET CTX Glut_1	C57BL6J-638850.31	Glut	Isocortex	0352 L5 ET CTX Glut_1	01 IT-ET Glut	VISpm	022 L5 ET CTX Glut	VISpm5
a28ebb39-2849-49ae-80e4-ee135fce1890	0092 L5 ET CTX Glut_3	C57BL6J-638850.31	Glut	Isocortex	0369 L5 ET CTX Glut_3	01 IT-ET Glut	VISpm	022 L5 ET CTX Glut	VISpm5

Now we can use pandas functionality to inner join together to two tables, leaving only the subset of cells, 94, that we selected in the ABC Atlas visualization.

selected_cells = cell_metadata.join(abc_atlas_selection, on='abc_sample_id', how='inner')
print(len(selected_cells))
selected_cells.head()

	brain_section_label	cluster_alias	average_correlation_score	feature_matrix_label	donor_label	donor_genotype	donor_sex	x	y	z	abc_sample_id	Supertype	Section Label	Neurotransmitter Type	Anatomical Division	Cluster	Class	Anatomical Structure	Subclass	Anatomical Substructure
cell_label
1018093345102260085	C57BL6J-638850.31	384	0.651028	C57BL6J-638850	C57BL6J-638850	wt/wt	M	4.087788	3.208372	5.4	21c31354-1979-4b93-b6f2-acaefaa22842	0091 L5 ET CTX Glut_2	C57BL6J-638850.31	Glut	Isocortex	0360 L5 ET CTX Glut_2	01 IT-ET Glut	VISpm	022 L5 ET CTX Glut	VISpm5
1018093345101290596	C57BL6J-638850.31	384	0.737842	C57BL6J-638850	C57BL6J-638850	wt/wt	M	6.753508	3.045020	5.4	484abbf0-56d4-4622-add2-24920cd3cd05	0091 L5 ET CTX Glut_2	C57BL6J-638850.31	Glut	Isocortex	0360 L5 ET CTX Glut_2	01 IT-ET Glut	VISpm	022 L5 ET CTX Glut	VISpm5
1018093345102270314	C57BL6J-638850.31	393	0.685347	C57BL6J-638850	C57BL6J-638850	wt/wt	M	3.936859	3.295320	5.4	147ac3d0-b2f2-4e1c-bd00-e1b740b242bc	0091 L5 ET CTX Glut_2	C57BL6J-638850.31	Glut	Isocortex	0364 L5 ET CTX Glut_2	01 IT-ET Glut	VISpm	022 L5 ET CTX Glut	VISpm5
1018093345102260156	C57BL6J-638850.31	383	0.808770	C57BL6J-638850	C57BL6J-638850	wt/wt	M	4.042362	3.188116	5.4	eaf8a6aa-2ccd-474b-ac40-5bcbc705e2eb	0091 L5 ET CTX Glut_2	C57BL6J-638850.31	Glut	Isocortex	0359 L5 ET CTX Glut_2	01 IT-ET Glut	VISpm	022 L5 ET CTX Glut	VISpm5
1018093345101290474	C57BL6J-638850.31	383	0.536743	C57BL6J-638850	C57BL6J-638850	wt/wt	M	6.833381	2.998880	5.4	e4874d7b-6754-40ff-8792-c0b7627f2bce	0091 L5 ET CTX Glut_2	C57BL6J-638850.31	Glut	Isocortex	0359 L5 ET CTX Glut_2	01 IT-ET Glut	VISpm	022 L5 ET CTX Glut	VISpm5

Let’s do a quick plot to show that we can grossly reproduces the view of MERFISH section 31.

def plot_section(xx, yy, cc = None, val = None, fig_width = 8, fig_height = 8, cmap = None, fig=None, ax=None):
    if fig is None or ax is None:
        fig, ax = plt.subplots()
        fig.set_size_inches(fig_width, fig_height)
    if cmap is not None:
        plt.scatter(xx, yy, s=0.5, c=val, marker='.', cmap=cmap)
    elif cc is not None:
        plt.scatter(xx, yy, s=0.5, color=cc, marker='.')
    ax.set_ylim(11, 0)
    ax.set_xlim(0, 11)
    ax.axis('equal')
    ax.set_xticks([])
    ax.set_yticks([])
    return fig, ax

pred = (cell_metadata['brain_section_label'] == 'C57BL6J-638850.31')
section = cell_metadata[pred]
fig, ax = plt.subplots()
fig.set_size_inches(8, 8)
# Plot all cells as grey.
fig_all, ax_all = plot_section(section['x'], section['y'], cc='#D3D3D3', fig=fig, ax=ax)
# Plot our selected cells as red
fig_sel, ax_sel = plot_section(selected_cells['x'], selected_cells['y'], cc='red', fig=fig, ax=ax)
plt.show()

../_images/bb61ad168b24c512724a4c9d50fc9fc172c2449c332376f43c3d039fe3c2dc76.png

Using the matched data in a cell by gene matrix.#

Now that we have have our matched metadata we can simply use the index of our final combined Pandas DataFrame with our cell by gene data. Since we are working with data from an individual MERFISH slice, we can download that directly. If you are doing this with the 10X data, you should use the get_gene_data function tutorialized in the general_accessing_10x_snRNASeq_tutorial notebook.

First we need to load our gene list giving a mapping from identifier to symbol.

gene = abc_cache.get_metadata_dataframe(directory='MERFISH-C57BL6J-638850', file_name='gene').set_index('gene_identifier')
gene.head()

gene.csv: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 48.4k/48.4k [00:00<00:00, 482kMB/s]

	gene_symbol	transcript_identifier	name	mapped_ncbi_identifier
gene_identifier
ENSMUSG00000026778	Prkcq	ENSMUST00000028118	protein kinase C, theta	NCBIGene:18761
ENSMUSG00000026837	Col5a1	ENSMUST00000028280	collagen, type V, alpha 1	NCBIGene:12831
ENSMUSG00000001985	Grik3	ENSMUST00000030676	glutamate receptor, ionotropic, kainate 3	NCBIGene:14807
ENSMUSG00000039323	Igfbp2	ENSMUST00000047328	insulin-like growth factor binding protein 2	NCBIGene:16008
ENSMUSG00000048387	Osr1	ENSMUST00000057021	odd-skipped related transcription factor 1	NCBIGene:23967

Now we’ll load and process the data. Selecting by cell id in our data files (even for large ones) is mostly safe on machines with smaller memory footprints. This is assuming one uses the backed option and doesn’t attempt to load too many cells (<1000 say). For larger files, the user will want to be carefulas most of the anndata files we release cannot be loaded into memory on a smaller scale computer. For an example of doing this, or loading across 10X that has been divided into multiple files, see the General Accesing 10X data notebook

# Load the anndata, cell by gene matrix.
file_path = abc_cache.get_data_path(
    directory='MERFISH-C57BL6J-638850-sections',
    file_name='C57BL6J-638850.31/log2'
)
adata = anndata.read_h5ad(file_path, backed='r')
# Create a landing Pandas DataFrame for the data.
selected_cell_gene_data = pd.DataFrame(index=selected_cells.index,
                                       columns=adata.var.index)

# Find the subset of cells we selected in the full matrix.
mask = adata.obs.index.isin(selected_cells.index)
# Fill our DataFrame with the 
selected_cell_gene_data.loc[selected_cells.index, adata.var.index] = adata.X[mask].toarray()
selected_cell_gene_data.columns = gene.gene_symbol

C57BL6J-638850.31-log2.h5ad: 100%|████████████████████████████████████████████████████████████████████████| 190M/190M [00:05<00:00, 32.6MMB/s]

selected_cell_gene_data

gene_symbol	Prkcq	Col5a1	Grik3	Igfbp2	Osr1	Syt6	Cntnap3	Lmo3	Ntn1	Otp	...	Blank-16	Blank-32	Blank-18	Blank-4	Blank-28	Blank-33	Blank-34	Blank-45	Blank-23	Blank-48
cell_label
1018093345102260085	0.0	0.0	0.900271	1.847696	0.0	0.900271	0.0	0.900271	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1018093345101290596	0.0	0.0	1.498822	0.685901	0.0	0.0	0.0	1.148877	1.780236	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1018093345102270314	0.0	0.0	1.429679	1.429679	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1018093345102260156	0.0	0.0	2.150413	1.907521	0.0	0.0	0.0	0.755238	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.755238	0.0
1018093345101290474	0.0	0.0	1.591793	0.0	0.0	0.0	1.591793	1.591793	0.0	1.591793	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1018093345102270545	0.0	0.0	0.663865	0.0	0.0	0.663865	0.0	1.116793	0.0	0.0	...	0.0	0.663865	0.0	1.116793	0.0	0.0	0.0	0.0	0.0	0.0
1018093345102270307	0.0	0.0	1.264144	2.094565	0.0	0.553146	0.0	0.952008	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1018093345102270242	0.0	0.0	1.782628	0.0	0.0	0.0	0.0	1.782628	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1018093345101290541	0.0	0.0	0.0	0.0	0.866036	0.0	0.0	1.403429	0.0	0.0	...	0.0	0.0	0.0	0.0	0.866036	0.0	0.0	0.0	0.0	0.0
1018093345101290626	0.0	0.0	2.860291	0.0	1.626302	0.0	0.0	1.626302	1.257898	0.0	...	0.0	0.0	0.0	0.0	0.761911	0.0	0.0	0.0	0.0	0.0

94 rows × 550 columns

This process can be repeated for the all other ABC Atlas visualizations (except SEA-AD which we do not currently released through this package). Here is a list of datasets in the AbcProjectCache that contain the abc_sample_id with links to their Abc Atlas bisualizations: