MERFISH whole brain spatial transcriptomics (part 1)#

The MERFISH data consists of 59 coronal sections (10 micron thick) covering the whole anterior to posterior extent of the brain (~200 micron apart) from a single adult male mouse. Each section is assayed using the MERSCOPE platform with a 500 gene panel designed to enable us to map cells to the whole mouse brain taxonomy.

The MERSCOPE platform uses combinatorial labeling, sequential imaging and error robust barcodes to obtain quantitative (counts) RNA-expression of 100s-1000s of genes. Each probe binds to a specific transcripts resulting in “dots” in the image. Cells are segmented using CellPose software based on DAPI/PolyT stains.

Gene expression data is stored either as individual sections or all sections in a single file. The puprose of this set of notebooks is to provide and overview of the data, the file organization and how to combine the data and metadata through example use cases.

You need to be connected to the internet to run this notebook and have run through the getting started notebook.

import pandas as pd
from pathlib import Path
import numpy as np
import anndata
import time
import matplotlib.pyplot as plt

from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache

We will interact with the data using the AbcProjectCache. This cache object tracks which data has been downloaded and serves the path to the requsted data on disk. For metadata, the cache can also directly serve a up a Pandas Dataframe. See the getting_started notebook for more details on using the cache including installing it if it has not already been.

Change the download_base variable to where you have downloaded the data in your system.

download_base = Path('../../data/abc_atlas')
abc_cache = AbcProjectCache.from_cache_dir(download_base)

abc_cache.current_manifest
'releases/20241115/manifest.json'

Data overview#

Cell metadata#

Essential cell metadata is stored as a dataframe. Each row represents one cell indexed by a cell label.

Each cell is associated with a brain section label, donor label, donor genotype and donor sex. Each cell also has a set of x, y, z coordinates generated by rotating each section so that it is upright with midline approximatedly in the middle of the frame. The z coordinate takes into account the phyiscal spacing between each section.

Each cell is mapped to the whole mouse brain taxonomy using a hierachical correlation method, resulting in the assignment of a *cluster alias” and average correlation score.

cell = abc_cache.get_metadata_dataframe(
    directory='MERFISH-C57BL6J-638850',
    file_name='cell_metadata',
    dtype={"cell_label": str}
)
cell.set_index('cell_label', inplace=True)
print("Number of cells = ", len(cell))
cell.head(5)
cell_metadata.csv: 100%|███████████████████████████████████████████████████████████████████████| 710M/710M [00:37<00:00, 18.8MMB/s]
Number of cells =  3938808
brain_section_label cluster_alias average_correlation_score feature_matrix_label donor_label donor_genotype donor_sex x y z abc_sample_id
cell_label
1019171907102340387-1 C57BL6J-638850.37 1408 0.596276 C57BL6J-638850 C57BL6J-638850 wt/wt M 7.226245 4.148963 6.6 c9881423-76a7-4835-ba8b-7942fd384b6b
1104095349101460194-1 C57BL6J-638850.26 4218 0.641180 C57BL6J-638850 C57BL6J-638850 wt/wt M 5.064889 7.309543 4.2 aa815488-6487-4e47-8a5e-d82ac9933bc6
1017092617101450577 C57BL6J-638850.25 4218 0.763531 C57BL6J-638850 C57BL6J-638850 wt/wt M 5.792921 8.189973 4.0 91ef7a85-8e3e-4410-8ee2-785788df3ebe
1018093344101130233 C57BL6J-638850.13 4218 0.558073 C57BL6J-638850 C57BL6J-638850 wt/wt M 3.195950 5.868655 2.4 18991e17-fbd3-4ba0-9c60-1281f56ac520
1019171912201610094 C57BL6J-638850.27 4218 0.591009 C57BL6J-638850 C57BL6J-638850 wt/wt M 5.635732 7.995842 4.4 5e155936-e40d-4c6b-8971-e7fb0079274b

We define a small helper fuction that uses pandas groupby function to see how many unique items are associated for each field and list them out if the number of items is small.

def print_column_info(df):
    
    for c in df.columns:
        grouped = df[[c]].groupby(c).count()
        members = ''
        if len(grouped) < 30:
            members = str(list(grouped.index))
        print("Number of unique %s = %d %s" % (c,len(grouped), members))
print_column_info(cell)
Number of unique brain_section_label = 59 
Number of unique cluster_alias = 5320 
Number of unique average_correlation_score = 3932069 
Number of unique feature_matrix_label = 1 ['C57BL6J-638850']
Number of unique donor_label = 1 ['C57BL6J-638850']
Number of unique donor_genotype = 1 ['wt/wt']
Number of unique donor_sex = 1 ['M']
Number of unique x = 3938808 
Number of unique y = 3938808 
Number of unique z = 59 
Number of unique abc_sample_id = 3938808 

We can also bring in the pivot table from the “cluster annotation tutorial” to associate each cell with terms at each cell type classification level and the corresponding color.

cluster_details = abc_cache.get_metadata_dataframe(
    directory='WMB-taxonomy',
    file_name='cluster_to_cluster_annotation_membership_pivoted',
    keep_default_na=False
)
cluster_details.set_index('cluster_alias', inplace=True)
cluster_details.head(5)
cluster_to_cluster_annotation_membership_pivoted.csv: 100%|████████████████████████████████████| 531k/531k [00:00<00:00, 3.56MMB/s]
neurotransmitter class subclass supertype cluster
cluster_alias
1 Glut 01 IT-ET Glut 018 L2 IT PPP-APr Glut 0082 L2 IT PPP-APr Glut_3 0326 L2 IT PPP-APr Glut_3
2 Glut 01 IT-ET Glut 018 L2 IT PPP-APr Glut 0082 L2 IT PPP-APr Glut_3 0327 L2 IT PPP-APr Glut_3
3 Glut 01 IT-ET Glut 018 L2 IT PPP-APr Glut 0081 L2 IT PPP-APr Glut_2 0322 L2 IT PPP-APr Glut_2
4 Glut 01 IT-ET Glut 018 L2 IT PPP-APr Glut 0081 L2 IT PPP-APr Glut_2 0323 L2 IT PPP-APr Glut_2
5 Glut 01 IT-ET Glut 018 L2 IT PPP-APr Glut 0081 L2 IT PPP-APr Glut_2 0325 L2 IT PPP-APr Glut_2
cluster_colors = abc_cache.get_metadata_dataframe(directory='WMB-taxonomy', file_name='cluster_to_cluster_annotation_membership_color')
cluster_colors.set_index('cluster_alias', inplace=True)
cluster_colors.head(5)
cluster_to_cluster_annotation_membership_color.csv: 100%|██████████████████████████████████████| 239k/239k [00:00<00:00, 1.93MMB/s]
neurotransmitter_color class_color subclass_color supertype_color cluster_color
cluster_alias
1 #2B93DF #FA0087 #0F6632 #266DFF #64661F
2 #2B93DF #FA0087 #0F6632 #266DFF #CCA73D
3 #2B93DF #FA0087 #0F6632 #002BCC #99000D
4 #2B93DF #FA0087 #0F6632 #002BCC #5C8899
5 #2B93DF #FA0087 #0F6632 #002BCC #473D66
cell_extended = cell.join(cluster_details, on='cluster_alias')
cell_extended = cell_extended.join(cluster_colors, on='cluster_alias')
cell_extended.head(5)
brain_section_label cluster_alias average_correlation_score feature_matrix_label donor_label donor_genotype donor_sex x y z ... neurotransmitter class subclass supertype cluster neurotransmitter_color class_color subclass_color supertype_color cluster_color
cell_label
1019171907102340387-1 C57BL6J-638850.37 1408 0.596276 C57BL6J-638850 C57BL6J-638850 wt/wt M 7.226245 4.148963 6.6 ... 04 DG-IMN Glut 038 DG-PIR Ex IMN 0141 DG-PIR Ex IMN_2 0515 DG-PIR Ex IMN_2 #666666 #16f2f2 #3D53CC #CC7A3D #73FFBF
1104095349101460194-1 C57BL6J-638850.26 4218 0.641180 C57BL6J-638850 C57BL6J-638850 wt/wt M 5.064889 7.309543 4.2 ... Glut 23 P Glut 235 PG-TRN-LRN Fat2 Glut 0953 PG-TRN-LRN Fat2 Glut_1 4199 PG-TRN-LRN Fat2 Glut_1 #2B93DF #6b5ca5 #9B7ACC #990041 #663D63
1017092617101450577 C57BL6J-638850.25 4218 0.763531 C57BL6J-638850 C57BL6J-638850 wt/wt M 5.792921 8.189973 4.0 ... Glut 23 P Glut 235 PG-TRN-LRN Fat2 Glut 0953 PG-TRN-LRN Fat2 Glut_1 4199 PG-TRN-LRN Fat2 Glut_1 #2B93DF #6b5ca5 #9B7ACC #990041 #663D63
1018093344101130233 C57BL6J-638850.13 4218 0.558073 C57BL6J-638850 C57BL6J-638850 wt/wt M 3.195950 5.868655 2.4 ... Glut 23 P Glut 235 PG-TRN-LRN Fat2 Glut 0953 PG-TRN-LRN Fat2 Glut_1 4199 PG-TRN-LRN Fat2 Glut_1 #2B93DF #6b5ca5 #9B7ACC #990041 #663D63
1019171912201610094 C57BL6J-638850.27 4218 0.591009 C57BL6J-638850 C57BL6J-638850 wt/wt M 5.635732 7.995842 4.4 ... Glut 23 P Glut 235 PG-TRN-LRN Fat2 Glut 0953 PG-TRN-LRN Fat2 Glut_1 4199 PG-TRN-LRN Fat2 Glut_1 #2B93DF #6b5ca5 #9B7ACC #990041 #663D63

5 rows × 21 columns

print_column_info(cell_extended)
Number of unique brain_section_label = 59 
Number of unique cluster_alias = 5320 
Number of unique average_correlation_score = 3932069 
Number of unique feature_matrix_label = 1 ['C57BL6J-638850']
Number of unique donor_label = 1 ['C57BL6J-638850']
Number of unique donor_genotype = 1 ['wt/wt']
Number of unique donor_sex = 1 ['M']
Number of unique x = 3938808 
Number of unique y = 3938808 
Number of unique z = 59 
Number of unique abc_sample_id = 3938808 
Number of unique neurotransmitter = 10 ['', 'Chol', 'Dopa', 'GABA', 'GABA-Glyc', 'Glut', 'Glut-GABA', 'Hist', 'Nora', 'Sero']
Number of unique class = 34 
Number of unique subclass = 338 
Number of unique supertype = 1201 
Number of unique cluster = 5320 
Number of unique neurotransmitter_color = 10 ['#03EDFF', '#0a9964', '#2B93DF', '#533691', '#666666', '#73E785', '#820e57', '#FF3358', '#fcf04b', '#ff7621']
Number of unique class_color = 34 
Number of unique subclass_color = 338 
Number of unique supertype_color = 1201 
Number of unique cluster_color = 5296 

The cell_extended dataframe created above is available in the cache as cell_metadata_with_cluster_annotation.

Gene panel#

A 500 gene panel was selected to enable faciliate the mapping to the whole mouse brain taxonomy. Each gene is uniquely identifier by an Ensembl ID. It is best practice to gene identifier to for tracking and data interchange as gene symbols are not unique and can change over time. The expression matrices also includes data for 50 “blanks” codewords in the spatial transcriptomic codebook that are not associated with any gene. They are included to measure the false discovery rate of gene detection.

Each row of the gene dataframe has Ensembl gene identifier, the identifier of the transcript the probe targets, a gene symbol and name.

gene = abc_cache.get_metadata_dataframe(directory='MERFISH-C57BL6J-638850', file_name='gene')
gene.set_index('gene_identifier', inplace=True)
print("Number of genes = ", len(gene))
gene.head(5)
gene.csv: 100%|██████████████████████████████████████████████████████████████████████████████| 48.4k/48.4k [00:00<00:00, 1.10MMB/s]
Number of genes =  550

gene_symbol transcript_identifier name mapped_ncbi_identifier
gene_identifier
ENSMUSG00000026778 Prkcq ENSMUST00000028118 protein kinase C, theta NCBIGene:18761
ENSMUSG00000026837 Col5a1 ENSMUST00000028280 collagen, type V, alpha 1 NCBIGene:12831
ENSMUSG00000001985 Grik3 ENSMUST00000030676 glutamate receptor, ionotropic, kainate 3 NCBIGene:14807
ENSMUSG00000039323 Igfbp2 ENSMUST00000047328 insulin-like growth factor binding protein 2 NCBIGene:16008
ENSMUSG00000048387 Osr1 ENSMUST00000057021 odd-skipped related transcription factor 1 NCBIGene:23967

Gene expression matrix#

Expression values for all sections are stored in an anndata h5ad format with minimal metadata. There are two h5ad files one storing the raw counts and the other log normalization of it. In this next section, we provide example code on how to open the file and connect with the rich cell level metadata discussed above.

abc_cache.list_data_files('MERFISH-C57BL6J-638850')
['C57BL6J-638850/log2', 'C57BL6J-638850/raw']
file = abc_cache.get_data_path(directory='MERFISH-C57BL6J-638850', file_name='C57BL6J-638850/log2')
print(file)
C57BL6J-638850-log2.h5ad: 100%|██████████████████████████████████████████████████████████████| 7.63G/7.63G [05:52<00:00, 21.6MMB/s]
/allen/scratch/aibstemp/chris.morrison/src/data/abc_atlas/expression_matrices/MERFISH-C57BL6J-638850/20230830/C57BL6J-638850-log2.h5ad

We use the anndata’s read_h5ad function to open the package for the log2 normalization file. The “backed=’r’” makes use of the lazy loading functionality to only load required data. By default, anndata will load the entire expression matrix in memory.

adata = anndata.read_h5ad(file, backed='r')
print(adata)
AnnData object with n_obs × n_vars = 4334174 × 550 backed at '/allen/scratch/aibstemp/chris.morrison/src/data/abc_atlas/expression_matrices/MERFISH-C57BL6J-638850/20230830/C57BL6J-638850-log2.h5ad'
    obs: 'brain_section_label'
    var: 'gene_symbol', 'transcript_identifier'
    uns: 'accessed_on', 'src'

Genes are represented as “variables”. For this data, the var dataframe is indexed by the Ensembl gene identifier with one metadata column gene symbol

print("Number of genes = ", len(adata.var))
adata.var.index[0:5]
Number of genes =  550
Index(['ENSMUSG00000026778', 'ENSMUSG00000026837', 'ENSMUSG00000001985',
       'ENSMUSG00000039323', 'ENSMUSG00000048387'],
      dtype='object', name='gene_identifier')

We define a small helper function plot section to visualize the cells in anatomical context colorized by: neurotransmitter identity, cell types division, class and subclass.

def plot_section(xx, yy, cc = None, val = None, fig_width = 8, fig_height = 8, cmap = None):
    fig, ax = plt.subplots()
    fig.set_size_inches(fig_width, fig_height)
    if cmap is not None:
        plt.scatter(xx, yy, s=0.5, c=val, marker='.', cmap=cmap)
    elif cc is not None:
        plt.scatter(xx, yy, s=0.5, color=cc, marker='.')
    ax.set_ylim(11, 0)
    ax.set_xlim(0, 11)
    ax.axis('equal')
    ax.set_xticks([])
    ax.set_yticks([])
    return fig, ax

Let pick one section “C57BL6J-638850.38” for visualization.

pred = (cell_extended['brain_section_label'] == 'C57BL6J-638850.38')
section = cell_extended[pred]
print("Number of cells in section:",len(section))
Number of cells in section: 120186
fig, ax = plot_section(section['x'], section['y'], section['neurotransmitter_color'])
res = ax.set_title("Neuortransmitter Identity")
plt.show()
../_images/3da5aabb6738c1cde4aa306e923d669a7e242cb367a43dbd6cb07fab3d3d8c52.png
fig, ax = plot_section(section['x'], section['y'], section['class_color'])
res = ax.set_title("Cell Types: Class")
plt.show()
../_images/8efe7ee24ef9c2c60928a4401e7a2d888fe7a775a34fdc9143ecaf2a70fd21b3.png
fig, ax = plot_section(section['x'], section['y'], section['subclass_color'])
res = ax.set_title("Cell Types: Subclass")
plt.show()
../_images/38b2880d3e5e702338f6f81e341f631966828d3514792bb61cc33a13473f41e7.png

Example use cases#

In this section, we explore the same two use cases as with the 10x data. The first example looks at the expression of nine canonical neurotransmitter transporter genes and the second the expression of gene Tac2. To support these use cases, we will create a smaller submatrix (all cells and 10 genes) that read it into memory. Note this operation takes around 10 seconds.

ntgenes = ['Slc17a7', 'Slc17a6', 'Slc17a8', 'Slc32a1', 'Slc6a5', 'Slc6a3', 'Slc6a4']
exgenes = ['Tac2']
gnames = ntgenes + exgenes
pred = [x in gnames for x in adata.var.gene_symbol]
gene_filtered = adata.var[pred]
gene_filtered
gene_symbol transcript_identifier
gene_identifier
ENSMUSG00000030500 Slc17a6 ENSMUST00000032710
ENSMUSG00000037771 Slc32a1 ENSMUST00000045738
ENSMUSG00000025400 Tac2 ENSMUST00000026466
ENSMUSG00000039728 Slc6a5 ENSMUST00000056442
ENSMUSG00000070570 Slc17a7 ENSMUST00000085374
ENSMUSG00000019935 Slc17a8 ENSMUST00000020102
ENSMUSG00000021609 Slc6a3 ENSMUST00000022100
ENSMUSG00000020838 Slc6a4 ENSMUST00000021195
start = time.process_time()
asubset = adata[:, gene_filtered.index].to_memory()
print("time taken: ", time.process_time() - start)
print(asubset)
time taken:  7.094561588000005
AnnData object with n_obs × n_vars = 4334174 × 8
    obs: 'brain_section_label'
    var: 'gene_symbol', 'transcript_identifier'
    uns: 'accessed_on', 'src'

We define two helper functions to (1) create_expression_dataframe: create joined gene expression and cell metadata dataframe based for a set of input genes and (2) aggregate_by_metadata which computes the average expression for each term in a given category.

def create_expression_dataframe(ad, gf):
    gdata = ad[:, gf.index].to_df()
    gdata.columns = gf.gene_symbol
    joined = section.join(gdata)
    return joined
def aggregate_by_metadata(df, gnames, value, sort = False):
    grouped = df.groupby(value)[gnames].mean()
    if sort:
        grouped = grouped.sort_values(by=gnames[0], ascending=False)
    return grouped

Expression of canonical neurotransmitter transporter genes#

During analysis, clusters were assigned neurotransmitter identities based on the expression of of canonical neurotransmitter transporter genes. In this example, we create a dataframe comprising of cells in the ‘C57BL6J-638850.40’ section and expression of the 9 solute carrier family genes. We then group the cells by the assigned neurotransmitter class and compute the mean expression for each group and visualized as a colorized table. The results are similar of the 10x data restricted to cells from the thalamus.

def plot_heatmap(df, fig_width = 8, fig_height = 4, cmap = plt.cm.magma_r):

    arr = df.to_numpy()

    fig, ax = plt.subplots()
    fig.set_size_inches(fig_width, fig_height)

    im = ax.imshow(arr, cmap=cmap, aspect='auto', vmin=0, vmax=5)
    xlabs = df.columns.values
    ylabs = df.index.values

    ax.set_xticks(range(len(xlabs)))
    ax.set_xticklabels(xlabs)

    ax.set_yticks(range(len(ylabs)))
    res = ax.set_yticklabels(ylabs)
    
    return im
pred = [x in ntgenes for x in asubset.var.gene_symbol]
gf = asubset.var[pred]

ntexp = create_expression_dataframe(asubset, gf)
agg = aggregate_by_metadata(ntexp, gf.gene_symbol, 'neurotransmitter')
agg = agg[ntgenes]
im = plot_heatmap(agg, 8, 3)
plt.show()
../_images/531de3d08a1af1da6456049092c86346381a9635fc1e2d110eca237ffec10a3e.png

The expression data can also be visualized as scatter point in anatomical context. Gene Slc17a7 is highly enriched in the cortex , hippocampus, medial habenula with some expression in the thalamus. On the other hand gene Slc32a1 is enriched in the striatum with sparse expression on the cortical plate.

fig, ax = plot_section(ntexp['x'], ntexp['y'], val=ntexp['Slc17a7'], cmap=plt.cm.magma_r)
res = ax.set_title('Gene Slc17a7')
plt.show()
../_images/71350c28618ad44b88ba1bf9b83c24026419e84f49b483935139749d57936cab.png
fig, ax = plot_section(ntexp['x'], ntexp['y'], val=ntexp['Slc32a1'], cmap=plt.cm.magma_r)
res = ax.set_title('Gene Slc32a1')
plt.show()
../_images/067c7662c6047f85cc8fdc6467c52d4b8ee660d076014dfc6201f579ff714e67.png

Expression of Tachykinin 2 (Tac2) in the whole brain#

In mice, the tachykinin 2 (Tac2) gene encodes neuropeptide called neurokinin B (NkB). Tac2 is produced by neurons in specific regions of the brain know to be invovled in emotion and social behavior. Based on ISH data from the Allen Mouse Brain Atlas, Tac 2 is sparsely expressed in the mouse isocortex and densely enriched is specific subcortical regions such the medial habenula (MH), the amygdala and hypothalamus.

In this example, we create a dataframe comprising of cells in section ‘C57BL6J-638850.40’ and expression values of Tac2 in those cells.

gf = asubset.var[asubset.var.gene_symbol == 'Tac2']
tac2 = create_expression_dataframe(asubset, gf)

Grouping cells by neurotransmitter identites and computing the mean expression in each group, we can observed that Tac2 gene is highly enriched in cholinergic cell types with some expression in Glut-GABA types.

agg = aggregate_by_metadata(tac2, gf.gene_symbol, 'neurotransmitter', True).head(8)
im = plot_heatmap(agg, 1, 3)
plt.show()
/tmp/ipykernel_28021/2280752953.py:4: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  grouped = grouped.sort_values(by=gnames[0], ascending=False)
../_images/6c1cd1b070cbf0af348fc43d02468efe494f218d0b7686f3ec5c6d07b04f71a6.png

Grouping cells by cell types class, shows that Tac2 is enriched in class “16 MH-LH Glut” with cells restricted to the medial (MH) and lateral (LH) habenula and a mixture of glutamatergic and cholinergic types.

agg = aggregate_by_metadata(tac2, gf.gene_symbol, 'class', True).head(8)
im = plot_heatmap(agg, 1, 3)
plt.show()
/tmp/ipykernel_28021/2280752953.py:4: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  grouped = grouped.sort_values(by=gnames[0], ascending=False)
../_images/f48e64de9d6c2306cf0426ed6f8bff89546e07585366fa4285f145126339d2b6.png

At the next level, grouping by subclass, shows Tac2 is enriched in sublcasses with specificity to the medial habenula (MH), bed muclei of the stria terminalis (BST) and main olfactory blub (MOB).

agg = aggregate_by_metadata(tac2, gf.gene_symbol, 'subclass', True).head(8)
im = plot_heatmap(agg, 1, 3)
plt.show()
/tmp/ipykernel_28021/2280752953.py:4: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  grouped = grouped.sort_values(by=gnames[0], ascending=False)
../_images/b306c5506ddf5e430afd5d5c209cf0f611a427b63e6c20c19c134894f37e5359.png

Plotting Tac2 expression in anatomical coordinates results in a image that is comparable to the ISH image data showing enriched expression in the medial habenula, amydala and hypothalamus and sparse expression in the cortex.

fig,ax = plot_section(tac2['x'], tac2['y'], val=tac2['Tac2'], cmap=plt.cm.magma_r)
res = ax.set_title('Gene Tac2')
plt.show()
../_images/526a304a7d0a2d74ca9d60fb0693e93665cb56a84b0919aee259bb4ef7452703.png

Close h5ad file and clean up

adata.file.close()
del adata