Accessing 10x RNA-seq gene expression data#

This notebook provides examples and functions for accessing the 10X expression matrix data stored in the ABC Atlas. These files require a large amount of memory to be available to load and analyize them if care is not taken. In this notebook, we present an example of how to access specific gene expressions from the data. The functions used below could be simplily parallized when processing data at scale, however, we leave them simple here.

Care should still be taken not to attempt too load to many genes from the expression matrices.

from typing import List
import os
import pandas as pd
from pathlib import Path
import numpy as np
import anndata
import time

from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache
from abc_atlas_access.abc_atlas_cache.anndata_utils import get_gene_data

We will interact with the data using the AbcProjectCache. This cache object tracks which data has been downloaded and serves the path to the requsted data on disk. For metadata, the cache can also directly serve up a Pandas Dataframe. See the getting_started notebook for more details on using the cache including installing it if it has not already been.

Change the download_base variable to where you have downloaded the data in your system.

download_base = Path('../../abc_download_root')
abc_cache = AbcProjectCache.from_s3_cache(download_base)
abc_cache.current_manifest
'releases/20240330/manifest.json'

Gene expression matrices#

The Whole Mouse Brain (WMB) and Whole Human Brain (WHB) datasets are formatted similarly. Each package is formatted as annadata h5ad files with minimal metadata. For each dataset, there are two h5ad files: one storing the raw counts and the other log2 normalization of the counts.

To load the data by gene for either mouse or human dataset, we need to load two pieces of metadata, the cells table and the genes table in addition to our instantiated AbcProjectCache object. These metadata can be found in the directories WMB-10X and WHB-10Xv3 for mouse and human respectively. Below we use the human brain data in our example.

cell = abc_cache.get_metadata_dataframe(directory='WHB-10Xv3', file_name='cell_metadata').set_index('cell_label')
cell
cell_metadata.csv: 100%|██████████| 705M/705M [00:31<00:00, 22.7MMB/s]    
cell_barcode barcoded_cell_sample_label library_label feature_matrix_label entity brain_section_label library_method donor_label donor_sex dataset_label x y cluster_alias region_of_interest_label anatomical_division_label
cell_label
10X386_2:CATGGATTCTCGACGG CATGGATTCTCGACGG 10X386_2 LKTX_210825_01_B01 WHB-10Xv3-Neurons nuclei H19.30.001.CX.51 10Xv3 H19.30.001 M WHB-10Xv3 7.533584 -15.230048 20 Human MoAN Myelencephalon
10X383_5:TCTTGCGGTGAATTGA TCTTGCGGTGAATTGA 10X383_5 LKTX_210818_02_E01 WHB-10Xv3-Neurons nuclei H19.30.002.BS.94 10Xv3 H19.30.002 M WHB-10Xv3 2.307856 -15.542040 20 Human MoSR Myelencephalon
10X386_2:CTCATCGGTCGAGCAA CTCATCGGTCGAGCAA 10X386_2 LKTX_210825_01_B01 WHB-10Xv3-Neurons nuclei H19.30.001.CX.51 10Xv3 H19.30.001 M WHB-10Xv3 6.740066 -16.186017 17 Human MoAN Myelencephalon
10X378_8:TTGGATGAGACAAGCC TTGGATGAGACAAGCC 10X378_8 LKTX_210809_01_H01 WHB-10Xv3-Neurons nuclei H19.30.002.BS.93 10Xv3 H19.30.002 M WHB-10Xv3 5.926133 -20.015151 18 Human PnAN Pons
10X387_7:TGAACGTAGTATTCCG TGAACGTAGTATTCCG 10X387_7 LKTX_210825_02_G01 WHB-10Xv3-Neurons nuclei H19.30.001.CX.51 10Xv3 H19.30.001 M WHB-10Xv3 5.622083 -13.561958 16 Human MoAN Myelencephalon
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10X194_8:GAAATGAGTTCGGCTG GAAATGAGTTCGGCTG 10X194_8 LKTX_190529_02_H01 WHB-10Xv3-Nonneurons nuclei H18.30.002.CX.50 10Xv3 H18.30.002 M WHB-10Xv3 -32.154318 21.585480 3264 Human SN Midbrain
10X350_4:TTTACCATCGCACGAC TTTACCATCGCACGAC 10X350_4 LKTX_210421_03_D01 WHB-10Xv3-Nonneurons nuclei H19.30.002.CB.62 10Xv3 H19.30.002 M WHB-10Xv3 -29.906175 23.914979 3265 Human CbDN Cerebellum
10X225_1:AGAAGCGTCCATATGG AGAAGCGTCCATATGG 10X225_1 LKTX_190913_02_A01 WHB-10Xv3-Nonneurons nuclei H18.30.002.CX.51 10Xv3 H18.30.002 M WHB-10Xv3 -32.091915 21.212210 3264 Human PAG Midbrain
10X221_5:TTGAACGCAGCCTTCT TTGAACGCAGCCTTCT 10X221_5 LKTX_190830_01_E01 WHB-10Xv3-Nonneurons nuclei H18.30.002.CX.49 10Xv3 H18.30.002 M WHB-10Xv3 -30.665837 22.993120 3265 Human STG Cerebral cortex
10X385_3:CTACCCAGTGGCGCTT CTACCCAGTGGCGCTT 10X385_3 LKTX_210818_04_C01 WHB-10Xv3-Nonneurons nuclei H19.30.002.CX.49 10Xv3 H19.30.002 M WHB-10Xv3 -30.579284 22.853419 3264 Human GPi Basal nuclei

3369219 rows × 15 columns

gene = abc_cache.get_metadata_dataframe(directory='WHB-10Xv3', file_name='gene').set_index('gene_identifier')
gene
gene.csv: 100%|██████████| 4.23M/4.23M [00:00<00:00, 9.17MMB/s]
gene_symbol biotype name
gene_identifier
ENSG00000000003 TSPAN6 protein_coding tetraspanin 6
ENSG00000000005 TNMD protein_coding tenomodulin
ENSG00000000419 DPM1 protein_coding dolichyl-phosphate mannosyltransferase subunit...
ENSG00000000457 SCYL3 protein_coding SCY1 like pseudokinase 3
ENSG00000000460 C1orf112 protein_coding chromosome 1 open reading frame 112
... ... ... ...
ENSG00000288638 AL627443.4 lncRNA novel transcript
ENSG00000288639 AC093246.1 protein_coding novel protein
ENSG00000288642 CDR1 protein_coding cerebellar degeneration related protein 1
ENSG00000288643 AC114982.3 protein_coding novel transcript
ENSG00000288645 AC084756.2 protein_coding novel protein

59357 rows × 3 columns

Loading specific genes from the data#

The Whole Human Brain dataset (Siletti et al. 2023) consists of two sub-datasets: Neuron cells and Non-neuron cells. The neuron files are 30 GB in size and if we were to attempt to slice the dataset by gene we would have to load >30 GB once the data is uncompressed into memory. To avoid this, we load the data in chunks and recombine them into a pandas dataframe with all cells and the requested genes loaded.

Below, we specify a select set of genes from the data that we’ll load into our output data frame. These are the same genes that are available in the example file example_genes_all_cells_expression in the WHB-10Xv3 metadata directory.

gene_names = ['SLC17A6', 'SLC17A7', 'SLC32A1', 'PTPRC', 'PLP1', 'AQP4', 'TTR']

Below is a function that can be used to load specific genes from the from the full set of data in the WHB dataset. Adjust the chunk size down if you find your system still running out of memory. Additionally, reduce the number of genes if you still have issues. Note that the funciton can load either the raw expression data or the log2 data. The full code of this function can be found here

?get_gene_data
Signature:
get_gene_data(
    abc_atlas_cache: abc_atlas_access.abc_atlas_cache.abc_project_cache.AbcProjectCache,
    all_cells: pandas.core.frame.DataFrame,
    all_genes: pandas.core.frame.DataFrame,
    selected_genes: List[str],
    data_type: str = 'log2',
    chunk_size: int = 8192,
)
Docstring:
Load expression matrix data from the ABC Atlas and extract data for
specific genes.

Method will load all expression data required to process across multiple
files to extract the full set of genes. This may result in downloading
potentially ~100 GB of data.

Parameters
----------
abc_atlas_cache: AbcProjectCache
    An AbcProjectCache instance object to handle downloading and serving
    the path to the expression matrix data.
all_cells: pandas.DataFrame
    cells metadata loaded as a pandas Dataframe from the AbcProjectCache
    indexed on cell_label.
all_genes: pandas.DataFrame
    genes metadata loaded as a pandas Dataframe from the AbcProjectCache
    indexed on gene_identifier.
selected_genes: list of strings
    List of gene_symbols that are a subset of those in the full genes
    DataFrame.
data_type: str (Default: "log2")
    Kind of expression matrix to load either "log2" or "raw". Defaults to
    "log2".
chunk_size: int (Default: 8192)
    Size of the chunk to load from the anndata files. Adjust this size if
    needed based on memory/file io. Default: 8192.

Returns
-------
output_gene_data: pandas.DataFrame
    Subset of gene data indexed by cell.
File:      ~/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages/abc_atlas_access/abc_atlas_cache/anndata_utils.py
Type:      function

The code commented out below will create a gene expression DataFrame over the full WHB dataset. Running this full process takes around 10 minutes to processes, however downloading the full data can take up to an hour depending on your download speed.

"""
gene_data = get_gene_data(
    abc_atlas_cache=abc_cache,
    all_cells=cell,
    all_genes=gene,
    selected_genes=gene_names
)"""
'\ngene_data = get_gene_data(\n    abc_atlas_cache=abc_cache,\n    all_cells=cell,\n    all_genes=gene,\n    selected_genes=gene_names\n)'

For a quicker example, we’ll load the Non-neurons from the WHB dataset only. This should take roughly a handful of minutes to complete and around 10 minutes download depending on your connection speed.

nonneuron_cells = cell[cell['feature_matrix_label'] == 'WHB-10Xv3-Nonneurons']
gene_data = get_gene_data(
    abc_atlas_cache=abc_cache,
    all_cells=nonneuron_cells,
    all_genes=gene,
    selected_genes=gene_names
)
gene_data[pd.notna(gene_data[gene_data.columns[0]])]
loading file: WHB-10Xv3-Nonneurons
WHB-10Xv3-Nonneurons-log2.h5ad: 100%|██████████| 4.78G/4.78G [04:12<00:00, 19.0MMB/s]    
 - time taken:  133.182634
total time taken: 219.19582499999999
gene_symbol PTPRC SLC17A6 SLC32A1 SLC17A7 TTR PLP1 AQP4
cell_label
10X362_3:TCAGTGAGTATTGACC 0.0 0.0 0.0 0.0 0.0 10.177927 0.0
10X362_5:TCCGTGTGTGAAAGTT 0.0 0.0 0.0 0.0 0.0 9.262379 0.0
10X362_5:CACGGGTAGAGCAGAA 0.0 0.0 0.0 0.0 0.0 11.240114 0.0
10X362_5:GATTCTTGTATGTCAC 0.0 0.0 0.0 0.0 0.0 8.314513 0.0
10X362_6:AGGACTTGTATCCTTT 0.0 0.0 0.0 0.0 0.0 9.736156 0.0
... ... ... ... ... ... ... ...
10X194_8:GAAATGAGTTCGGCTG 0.0 0.0 0.0 0.0 0.0 10.210833 0.0
10X350_4:TTTACCATCGCACGAC 9.587301 0.0 0.0 0.0 0.0 8.006084 0.0
10X225_1:AGAAGCGTCCATATGG 8.567961 0.0 0.0 0.0 0.0 0.0 0.0
10X221_5:TTGAACGCAGCCTTCT 10.119619 0.0 0.0 0.0 0.0 0.0 0.0
10X385_3:CTACCCAGTGGCGCTT 0.0 0.0 0.0 0.0 0.0 0.0 0.0

888263 rows × 7 columns

The returned DataFrame is indexed by cell_label and can thus be joined with the cell DataFrame for further analysis.