10x RNA-seq gene expression data (part 2a)#

In part 1, we explore two examples looking at the expression of canonical neurotransmitter transporter genes and gene Tac2 in the thalamus. In this notebook, we will prepare data so that we can repeat the examples for all cells spanning the whole brain. This notebook takes ~ 5 minutes to run.

The results from this notebook has already been cached and saved. As such, you can skip this notebook and continue with part 2b.

You need to be connected to the internet to run this notebook and that you have downloaded the all the log2 expression matrices associated with both the WMB-10Xv2, WMB-10Xv3 and WMB-10XMulti datasets.

import os
import pandas as pd
import numpy as np
import anndata
import time
import json
import requests

The prerequisite for running this notebook is that the data have been downloaded to local directory maintaining the organization from the manifest.json. Change the download_base variable to where you have downloaded the data in your system.

version = '20231215'
download_base = '../../abc_download_root'

use_local_cache = False
manifest_path = 'releases/%s/manifest.json' % version

if not use_local_cache :
    url = 'https://allen-brain-cell-atlas.s3-us-west-2.amazonaws.com/' + manifest_path
    manifest = json.loads(requests.get(url).text)
else :
    file = os.path.join(download_base,manifest_path)
    with open(file,'rb') as f:
        manifest = json.load(f)
    
metadata = manifest['file_listing']['WMB-10X']['metadata']
view_directory = os.path.join( download_base, 
                               manifest['directory_listing']['WMB-10X']['directories']['metadata']['relative_path'], 
                              'views')
cache_views = False
if cache_views :
    os.makedirs( view_directory, exist_ok=True )
rpath = metadata['cell_metadata']['files']['csv']['relative_path']
file = os.path.join( download_base, rpath)
cell = pd.read_csv(file)
cell.set_index('cell_label',inplace=True)

Gene expression matrices#

The large 4 million cell dataset has been divided into 23 packages to make data transfer and download more efficient. Each package is formatted as annadata h5ad file with minimal metadata. In this next section, we provide example code on how to open the file and connect with the rich cell level metadata discussed above.

For each subset, there are two h5ad files one storing the raw counts and the other log normalization of it.

matrices = cell.groupby(['dataset_label','feature_matrix_label'])[['library_label']].count()
matrices.columns  = ['cell_count']
matrices
cell_count
dataset_label feature_matrix_label
WMB-10XMulti WMB-10XMulti 1687
WMB-10Xv2 WMB-10Xv2-CTXsp 43985
WMB-10Xv2-HPF 207281
WMB-10Xv2-HY 99879
WMB-10Xv2-Isocortex-1 248776
WMB-10Xv2-Isocortex-2 249360
WMB-10Xv2-Isocortex-3 249356
WMB-10Xv2-Isocortex-4 248784
WMB-10Xv2-MB 29781
WMB-10Xv2-OLF 192182
WMB-10Xv2-TH 130555
WMB-10Xv3 WMB-10Xv3-CB 181723
WMB-10Xv3-CTXsp 78223
WMB-10Xv3-HPF 181055
WMB-10Xv3-HY 162296
WMB-10Xv3-Isocortex-1 227670
WMB-10Xv3-Isocortex-2 227537
WMB-10Xv3-MB 337101
WMB-10Xv3-MY 191746
WMB-10Xv3-OLF 88560
WMB-10Xv3-P 143157
WMB-10Xv3-PAL 108046
WMB-10Xv3-STR 283782
WMB-10Xv3-TH 130454

Example use cases#

In this section, we explore two use cases. The first example looks at the expression of nine canonical neurotransmitter transporter genes and the second the expression of gene Tac2.

To support these use cases, we will create a smaller submatrix (all cells and 10 genes) and write to file for resue in part 2b. Note this operation takes around 5 minutes.

matrix_label = matrices.index[0][1]
dataset_label = matrices.index[0][0]

expression_matrices = manifest['file_listing'][dataset_label]['expression_matrices']
print(expression_matrices[matrix_label])
{'log2': {'files': {'h5ad': {'version': '20230830', 'relative_path': 'expression_matrices/WMB-10XMulti/20230830/WMB-10XMulti-log2.h5ad', 'url': 'https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/expression_matrices/WMB-10XMulti/20230830/WMB-10XMulti-log2.h5ad', 'size': 89318511}}}, 'raw': {'files': {'h5ad': {'version': '20230830', 'relative_path': 'expression_matrices/WMB-10XMulti/20230830/WMB-10XMulti-raw.h5ad', 'url': 'https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/expression_matrices/WMB-10XMulti/20230830/WMB-10XMulti-raw.h5ad', 'size': 132220015}}}}
rpath = expression_matrices[matrix_label]['log2']['files']['h5ad']['relative_path']
file = os.path.join( download_base, rpath)
print(file)
../../abc_download_root/expression_matrices/WMB-10XMulti/20230830/WMB-10XMulti-log2.h5ad
ad = anndata.read_h5ad(file,backed='r')
gene = ad.var
ntgenes = ['Slc17a7','Slc17a6','Slc17a8','Slc32a1','Slc6a5','Slc18a3','Slc6a3','Slc6a4','Slc6a2']
exgenes = ['Tac2']
gnames = ntgenes + exgenes
pred = [x in gnames for x in gene.gene_symbol]
gene_filtered = gene[pred]
gene_filtered
gene_symbol
gene_identifier
ENSMUSG00000037771 Slc32a1
ENSMUSG00000070570 Slc17a7
ENSMUSG00000039728 Slc6a5
ENSMUSG00000030500 Slc17a6
ENSMUSG00000055368 Slc6a2
ENSMUSG00000019935 Slc17a8
ENSMUSG00000025400 Tac2
ENSMUSG00000020838 Slc6a4
ENSMUSG00000021609 Slc6a3
ENSMUSG00000100241 Slc18a3
# create empty gene expression dataframe
gdata = pd.DataFrame(index=cell.index,columns=gene_filtered.index)
count = 0
total_start = time.process_time()

for matindex in matrices.index :
    
    ds = matindex[0]
    mp = matindex[1]
    
    print(mp)
    
    expression_matrices = manifest['file_listing'][ds]['expression_matrices']
    rpath = expression_matrices[mp]['log2']['files']['h5ad']['relative_path']
    file = os.path.join( download_base, rpath)
    
    pred = (cell['feature_matrix_label'] == mp)
    cell_filtered = cell[pred]
    
    start = time.process_time()
    ad = anndata.read_h5ad(file,backed='r')
    exp = ad[cell_filtered.index, gene_filtered.index].to_df()
    gdata.loc[ exp.index, gene_filtered.index ] = exp
    print(" - time taken: ", time.process_time() - start)
    
    ad.file.close()
    del ad
    
    count += 1
    
    #if count > 2 :
    #    break
        
print("total time taken: ", time.process_time() - total_start)
    
WMB-10XMulti
 - time taken:  1.9875723679999986
WMB-10Xv2-CTXsp
 - time taken:  1.736294100000002
WMB-10Xv2-HPF
 - time taken:  5.911806116000001
WMB-10Xv2-HY
 - time taken:  2.839109353999998
WMB-10Xv2-Isocortex-1
 - time taken:  8.228330416999999
WMB-10Xv2-Isocortex-2
 - time taken:  9.044668807999997
WMB-10Xv2-Isocortex-3
 - time taken:  9.986195581999993
WMB-10Xv2-Isocortex-4
 - time taken:  11.247832531
WMB-10Xv2-MB
 - time taken:  0.8962304950000117
WMB-10Xv2-OLF
 - time taken:  5.684649880000009
WMB-10Xv2-TH
 - time taken:  4.081716135999997
WMB-10Xv3-CB
 - time taken:  7.552754073000003
WMB-10Xv3-CTXsp
 - time taken:  3.717788982000002
WMB-10Xv3-HPF
 - time taken:  9.632554069000008
WMB-10Xv3-HY
 - time taken:  8.718123004000006
WMB-10Xv3-Isocortex-1
 - time taken:  15.792058513
WMB-10Xv3-Isocortex-2
 - time taken:  9.76483662599999
WMB-10Xv3-MB
 - time taken:  18.491481762000006
WMB-10Xv3-MY
 - time taken:  6.938539325000022
WMB-10Xv3-OLF
 - time taken:  2.9817742030000147
WMB-10Xv3-P
 - time taken:  4.906559042999987
WMB-10Xv3-PAL
 - time taken:  4.105939183000004
WMB-10Xv3-STR
 - time taken:  17.826632974000006
WMB-10Xv3-TH
 - time taken:  7.132106335000003
total time taken:  184.041671802
# change columns from index to gene symbol
gdata.columns = gene_filtered.gene_symbol
pred = pd.notna(gdata[gdata.columns[0]])
gdata = gdata[pred].copy(deep=True)
print(len(gdata))
4042976
if cache_views :
    file = os.path.join( view_directory, 'example_genes_all_cells_expression.csv')
    gdata.to_csv( file )