10x RNA-seq gene expression data (part 2a)

10x RNA-seq gene expression data (part 2a)#

In part 1, we explore two examples looking at the expression of canonical neurotransmitter transporter genes and gene Tac2 in the thalamus. In this notebook, we will prepare data so that we can repeat the examples for all cells spanning the whole brain. This notebook takes ~ 5 minutes to run.

The results from this notebook has already been cached and saved. As such, you can skip this notebook and continue with part 2b. A cleaner example of accessing genes from the expression matricies can also be found in the general_accessing_10x_snRNASeq_tutorial.ipynb notebook.

You need to be connected to the internet to run this notebook and have run through the getting started notebook.

import pandas as pd
from pathlib import Path
import numpy as np
import anndata
import time

from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache

We will interact with the data using the AbcProjectCache. This cache object tracks which data has been downloaded and serves the path to the requsted data on disk. For metadata, the cache can also directly serve a up a Pandas Dataframe. See the getting_started notebook for more details on using the cache including installing it if it has not already been.

Change the download_base variable to where you have downloaded the data in your system.

download_base = Path('../../data/abc_atlas')
abc_cache = AbcProjectCache.from_cache_dir(download_base)

abc_cache.current_manifest

'releases/20241130/manifest.json'

cell = abc_cache.get_metadata_dataframe(directory='WMB-10X', file_name='cell_metadata')
cell.set_index('cell_label', inplace=True)

Gene expression matrices#

The large 4 million cell dataset has been divided into 23 packages to make data transfer and download more efficient. Each package is formatted as annadata h5ad file with minimal metadata. In this next section, we provide example code on how to open the file and connect with the rich cell level metadata discussed above.

For each subset, there are two h5ad files one storing the raw counts and the other log normalization of it.

matrices = cell.groupby(['dataset_label', 'feature_matrix_label'])[['library_label']].count()
matrices.columns  = ['cell_count']
matrices

		cell_count
dataset_label	feature_matrix_label
WMB-10XMulti	WMB-10XMulti	1687
WMB-10Xv2	WMB-10Xv2-CTXsp	43985
	WMB-10Xv2-HPF	207281
	WMB-10Xv2-HY	99879
	WMB-10Xv2-Isocortex-1	248776
	WMB-10Xv2-Isocortex-2	249360
	WMB-10Xv2-Isocortex-3	249356
	WMB-10Xv2-Isocortex-4	248784
	WMB-10Xv2-MB	29781
	WMB-10Xv2-OLF	192182
	WMB-10Xv2-TH	130555
WMB-10Xv3	WMB-10Xv3-CB	181723
	WMB-10Xv3-CTXsp	78223
	WMB-10Xv3-HPF	181055
	WMB-10Xv3-HY	162296
	WMB-10Xv3-Isocortex-1	227670
	WMB-10Xv3-Isocortex-2	227537
	WMB-10Xv3-MB	337101
	WMB-10Xv3-MY	191746
	WMB-10Xv3-OLF	88560
	WMB-10Xv3-P	143157
	WMB-10Xv3-PAL	108046
	WMB-10Xv3-STR	283782
	WMB-10Xv3-TH	130454

Example use cases#

In this section, we explore two use cases. The first example looks at the expression of nine canonical neurotransmitter transporter genes and the second the expression of gene Tac2.

To support these use cases, we will create a smaller submatrix (all cells and 10 genes) and write to file for resue in part 2b. Note this operation takes around 5 minutes.

A cleaner example of loading data from the expression matricies can be found in the general_accessing_10x_snRNASeq_tutorial.ipynb notebook.

abc_cache.list_data_files('WMB-10XMulti')

['WMB-10XMulti/log2', 'WMB-10XMulti/raw']

file = abc_cache.get_data_path(directory='WMB-10XMulti', file_name='WMB-10XMulti/log2')
print(file)

/Users/chris.morrison/src/data/abc_atlas/expression_matrices/WMB-10XMulti/20230830/WMB-10XMulti-log2.h5ad

ad = anndata.read_h5ad(file,backed='r')
gene = ad.var

ntgenes = ['Slc17a7', 'Slc17a6', 'Slc17a8', 'Slc32a1', 'Slc6a5', 'Slc18a3', 'Slc6a3', 'Slc6a4', 'Slc6a2']
exgenes = ['Tac2']
gnames = ntgenes + exgenes
pred = [x in gnames for x in gene.gene_symbol]
gene_filtered = gene[pred]
gene_filtered

	gene_symbol
gene_identifier
ENSMUSG00000037771	Slc32a1
ENSMUSG00000070570	Slc17a7
ENSMUSG00000039728	Slc6a5
ENSMUSG00000030500	Slc17a6
ENSMUSG00000055368	Slc6a2
ENSMUSG00000019935	Slc17a8
ENSMUSG00000025400	Tac2
ENSMUSG00000020838	Slc6a4
ENSMUSG00000021609	Slc6a3
ENSMUSG00000100241	Slc18a3

# create empty gene expression dataframe
gdata = pd.DataFrame(index=cell.index, columns=gene_filtered.index)
count = 0
total_start = time.process_time()

for matindex in matrices.index:
    
    ds = matindex[0]
    mp = matindex[1]
    
    print(mp)
    
    file = abc_cache.get_data_path(directory=ds, file_name=mp + '/log2')
    
    pred = (cell['feature_matrix_label'] == mp)
    cell_filtered = cell[pred]
    
    start = time.process_time()
    ad = anndata.read_h5ad(file, backed='r')
    exp = ad[cell_filtered.index, gene_filtered.index].to_df()
    gdata.loc[ exp.index, gene_filtered.index ] = exp
    print(" - time taken: ", time.process_time() - start)
    
    ad.file.close()
    del ad
    
    count += 1
    
    if count > 2 :
       break
        
print("total time taken: ", time.process_time() - total_start)
    

WMB-10XMulti
 - time taken:  0.14998999999999896
WMB-10Xv2-CTXsp

WMB-10Xv2-CTXsp-log2.h5ad: 100%|████████████████████████████████████████████████████████████████████████| 1.74G/1.74G [01:01<00:00, 28.3MMB/s]

 - time taken:  2.0309760000000026
WMB-10Xv2-HPF

WMB-10Xv2-HPF-log2.h5ad: 100%|██████████████████████████████████████████████████████████████████████████| 6.10G/6.10G [03:46<00:00, 26.9MMB/s]

 - time taken:  8.495839999999987
total time taken:  129.784312

# change columns from index to gene symbol
gdata.columns = gene_filtered.gene_symbol
pred = pd.notna(gdata[gdata.columns[0]])
gdata = gdata[pred].copy(deep=True)
print(len(gdata))

10x RNA-seq gene expression data (part 2a)

Contents

10x RNA-seq gene expression data (part 2a)#

Gene expression matrices#

Example use cases#