Accessing 10x RNA-seq gene expression data

Accessing 10x RNA-seq gene expression data#

This notebook provides examples and functions for accessing the 10X expression matrix data stored in the ABC Atlas. These files require a large amount of memory to be available to load and analyize them if care is not taken. In this notebook, we present an example of how to access specific gene expressions from the data. The functions used below could be simplily parallized when processing data at scale, however, we leave them simple here.

Care should still be taken not to attempt too load to many genes from the expression matrices.

import pandas as pd
from pathlib import Path
import numpy as np
import anndata

from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache
from abc_atlas_access.abc_atlas_cache.anndata_utils import get_gene_data

We will interact with the data using the AbcProjectCache. This cache object tracks which data has been downloaded and serves the path to the requsted data on disk. For metadata, the cache can also directly serve a up a Pandas Dataframe. See the getting_started notebook for more details on using the cache including installing it if it has not already been.

Change the download_base variable to where you have downloaded the data in your system.

download_base = Path('../../data/abc_atlas')
abc_cache = AbcProjectCache.from_cache_dir(download_base)

abc_cache.current_manifest

'releases/20250131/manifest.json'

Gene expression matrices#

The Whole Mouse Brain (WMB) and Whole Human Brain (WHB) datasets are formatted similarly. Each package is formatted as annadata h5ad files with minimal metadata. For each dataset, there are two h5ad files: one storing the raw counts and the other log2 normalization of the counts.

To load the data by gene for either mouse or human dataset, we need to load two pieces of metadata, the cells table and the genes table in addition to our instantiated AbcProjectCache object. These metadata can be found in the directories WMB-10X and WHB-10Xv3 for mouse and human respectively. Below we use the human brain data in our example.

cell = abc_cache.get_metadata_dataframe(
    directory='WHB-10Xv3',
    file_name='cell_metadata'
).set_index('cell_label')
cell.head()

cell_metadata.csv: 100%|██████████| 830M/830M [00:24<00:00, 34.2MMB/s]    

	cell_barcode	barcoded_cell_sample_label	library_label	feature_matrix_label	entity	brain_section_label	library_method	donor_label	donor_sex	dataset_label	x	y	cluster_alias	region_of_interest_label	anatomical_division_label	abc_sample_id
cell_label
10X386_2:CATGGATTCTCGACGG	CATGGATTCTCGACGG	10X386_2	LKTX_210825_01_B01	WHB-10Xv3-Neurons	nuclei	H19.30.001.CX.51	10Xv3	H19.30.001	M	WHB-10Xv3	7.533584	-15.230048	20	Human MoAN	Myelencephalon	b3d3a6c0-3dc5-4738-af4c-96cc5ff6033f
10X383_5:TCTTGCGGTGAATTGA	TCTTGCGGTGAATTGA	10X383_5	LKTX_210818_02_E01	WHB-10Xv3-Neurons	nuclei	H19.30.002.BS.94	10Xv3	H19.30.002	M	WHB-10Xv3	2.307856	-15.542040	20	Human MoSR	Myelencephalon	0a7ed99b-d5eb-4c50-b95e-eb31b448d2c7
10X386_2:CTCATCGGTCGAGCAA	CTCATCGGTCGAGCAA	10X386_2	LKTX_210825_01_B01	WHB-10Xv3-Neurons	nuclei	H19.30.001.CX.51	10Xv3	H19.30.001	M	WHB-10Xv3	6.740066	-16.186017	17	Human MoAN	Myelencephalon	458f6409-af8d-4c7c-ac9b-0b0d9b73b36d
10X378_8:TTGGATGAGACAAGCC	TTGGATGAGACAAGCC	10X378_8	LKTX_210809_01_H01	WHB-10Xv3-Neurons	nuclei	H19.30.002.BS.93	10Xv3	H19.30.002	M	WHB-10Xv3	5.926133	-20.015151	18	Human PnAN	Pons	ff6b2d95-3752-49bf-9421-f17107538868
10X387_7:TGAACGTAGTATTCCG	TGAACGTAGTATTCCG	10X387_7	LKTX_210825_02_G01	WHB-10Xv3-Neurons	nuclei	H19.30.001.CX.51	10Xv3	H19.30.001	M	WHB-10Xv3	5.622083	-13.561958	16	Human MoAN	Myelencephalon	589a1c5c-839b-480d-babc-4098c467211b

gene = abc_cache.get_metadata_dataframe(directory='WHB-10Xv3', file_name='gene').set_index('gene_identifier')
gene.head()

gene.csv: 100%|██████████| 4.23M/4.23M [00:00<00:00, 21.1MMB/s]

	gene_symbol	biotype	name
gene_identifier
ENSG00000000003	TSPAN6	protein_coding	tetraspanin 6
ENSG00000000005	TNMD	protein_coding	tenomodulin
ENSG00000000419	DPM1	protein_coding	dolichyl-phosphate mannosyltransferase subunit...
ENSG00000000457	SCYL3	protein_coding	SCY1 like pseudokinase 3
ENSG00000000460	C1orf112	protein_coding	chromosome 1 open reading frame 112

Loading specific genes from the data#

The Whole Human Brain dataset (Siletti et al. 2023) consists of two sub-datasets: Neuron cells and Non-neuron cells. The neuron files are 30 GB in size and if we were to attempt to slice the dataset by gene we would have to load >30 GB once the data is uncompressed into memory. To avoid this, we load the data in chunks and recombine them into a pandas dataframe with all cells and the requested genes loaded.

Below, we specify a select set of genes from the data that we’ll load into our output data frame. These are the same genes that are available in the example file example_genes_all_cells_expression in the WHB-10Xv3 metadata directory.

gene_names = ['SLC17A6', 'SLC17A7', 'SLC32A1', 'PTPRC', 'PLP1', 'AQP4', 'TTR']

Below is a function that can be used to load specific genes from the from the full set of data in the WHB dataset. Adjust the chunk size down if you find your system still running out of memory. Additionally, reduce the number of genes if you still have issues. Note that the funciton can load either the raw expression data or the log2 data. The full code of this function can be found here

?get_gene_data

Signature:
get_gene_data(
    abc_atlas_cache: abc_atlas_access.abc_atlas_cache.abc_project_cache.AbcProjectCache,
    all_cells: pandas.core.frame.DataFrame,
    all_genes: pandas.core.frame.DataFrame,
    selected_genes: List[str],
    data_type: str = 'log2',
    chunk_size: int = 8192,
)
Docstring:
Load expression matrix data from the ABC Atlas and extract data for
specific genes.

Method will load all expression data required to process across multiple
files to extract the full set of genes. This may result in downloading
potentially ~100 GB of data.

Parameters
----------
abc_atlas_cache: AbcProjectCache
    An AbcProjectCache instance object to handle downloading and serving
    the path to the expression matrix data.
all_cells: pandas.DataFrame
    cells metadata loaded as a pandas Dataframe from the AbcProjectCache
    indexed on cell_label.
all_genes: pandas.DataFrame
    genes metadata loaded as a pandas Dataframe from the AbcProjectCache
    indexed on gene_identifier.
selected_genes: list of strings
    List of gene_symbols that are a subset of those in the full genes
    DataFrame.
data_type: str (Default: "log2")
    Kind of expression matrix to load either "log2" or "raw". Defaults to
    "log2".
chunk_size: int (Default: 8192)
    Size of the chunk to load from the anndata files. Adjust this size if
    needed based on memory/file io. Default: 8192.

Returns
-------
output_gene_data: pandas.DataFrame
    Subset of gene data indexed by cell.
File:      /allen/scratch/aibstemp/chris.morrison/src/abc_atlas_access/src/abc_atlas_access/abc_atlas_cache/anndata_utils.py
Type:      function

The code commented out below will create a gene expression DataFrame over the full WHB dataset. Running this full process takes around 10 minutes to processes, however downloading the full data can take up to an hour depending on your download speed.

"""
gene_data = get_gene_data(
    abc_atlas_cache=abc_cache,
    all_cells=cell,
    all_genes=gene,
    selected_genes=gene_names
)"""

'\ngene_data = get_gene_data(\n    abc_atlas_cache=abc_cache,\n    all_cells=cell,\n    all_genes=gene,\n    selected_genes=gene_names\n)'

For a quicker example, we’ll load the Non-neurons from the WHB dataset only. This should take roughly a handful of minutes to complete and around 10 minutes download depending on your connection speed.

nonneuron_cells = cell[cell['feature_matrix_label'] == 'WHB-10Xv3-Nonneurons']
gene_data = get_gene_data(
    abc_atlas_cache=abc_cache,
    all_cells=nonneuron_cells,
    all_genes=gene,
    selected_genes=gene_names
)
gene_data[pd.notna(gene_data[gene_data.columns[0]])]

loading file: WHB-10Xv3-Nonneurons

WHB-10Xv3-Nonneurons-log2.h5ad: 100%|██████████| 4.78G/4.78G [02:25<00:00, 33.0MMB/s]    

 - time taken:  251.78768398399998
total time taken: 297.553464587
	total cells: 888263 processed cells: 888263

gene_symbol	PTPRC	SLC17A6	SLC32A1	SLC17A7	TTR	PLP1	AQP4
cell_label
10X362_3:TCAGTGAGTATTGACC	0.0	0.0	0.0	0.0	0.0	10.177927	0.0
10X362_5:TCCGTGTGTGAAAGTT	0.0	0.0	0.0	0.0	0.0	9.262379	0.0
10X362_5:CACGGGTAGAGCAGAA	0.0	0.0	0.0	0.0	0.0	11.240114	0.0
10X362_5:GATTCTTGTATGTCAC	0.0	0.0	0.0	0.0	0.0	8.314513	0.0
10X362_6:AGGACTTGTATCCTTT	0.0	0.0	0.0	0.0	0.0	9.736156	0.0
...	...	...	...	...	...	...	...
10X194_8:GAAATGAGTTCGGCTG	0.0	0.0	0.0	0.0	0.0	10.210833	0.0
10X350_4:TTTACCATCGCACGAC	9.587301	0.0	0.0	0.0	0.0	8.006084	0.0
10X225_1:AGAAGCGTCCATATGG	8.567961	0.0	0.0	0.0	0.0	0.0	0.0
10X221_5:TTGAACGCAGCCTTCT	10.119619	0.0	0.0	0.0	0.0	0.0	0.0
10X385_3:CTACCCAGTGGCGCTT	0.0	0.0	0.0	0.0	0.0	0.0	0.0

888263 rows × 7 columns

The returned DataFrame is indexed by cell_label and can thus be joined with the cell DataFrame for further analysis.

cells_with_genes = cell.join(gene_data)
cells_with_genes

	cell_barcode	barcoded_cell_sample_label	library_label	feature_matrix_label	entity	brain_section_label	library_method	donor_label	donor_sex	dataset_label	...	region_of_interest_label	anatomical_division_label	abc_sample_id	PTPRC	SLC17A6	SLC32A1	SLC17A7	TTR	PLP1	AQP4
cell_label
10X386_2:CATGGATTCTCGACGG	CATGGATTCTCGACGG	10X386_2	LKTX_210825_01_B01	WHB-10Xv3-Neurons	nuclei	H19.30.001.CX.51	10Xv3	H19.30.001	M	WHB-10Xv3	...	Human MoAN	Myelencephalon	b3d3a6c0-3dc5-4738-af4c-96cc5ff6033f	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10X383_5:TCTTGCGGTGAATTGA	TCTTGCGGTGAATTGA	10X383_5	LKTX_210818_02_E01	WHB-10Xv3-Neurons	nuclei	H19.30.002.BS.94	10Xv3	H19.30.002	M	WHB-10Xv3	...	Human MoSR	Myelencephalon	0a7ed99b-d5eb-4c50-b95e-eb31b448d2c7	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10X386_2:CTCATCGGTCGAGCAA	CTCATCGGTCGAGCAA	10X386_2	LKTX_210825_01_B01	WHB-10Xv3-Neurons	nuclei	H19.30.001.CX.51	10Xv3	H19.30.001	M	WHB-10Xv3	...	Human MoAN	Myelencephalon	458f6409-af8d-4c7c-ac9b-0b0d9b73b36d	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10X378_8:TTGGATGAGACAAGCC	TTGGATGAGACAAGCC	10X378_8	LKTX_210809_01_H01	WHB-10Xv3-Neurons	nuclei	H19.30.002.BS.93	10Xv3	H19.30.002	M	WHB-10Xv3	...	Human PnAN	Pons	ff6b2d95-3752-49bf-9421-f17107538868	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10X387_7:TGAACGTAGTATTCCG	TGAACGTAGTATTCCG	10X387_7	LKTX_210825_02_G01	WHB-10Xv3-Neurons	nuclei	H19.30.001.CX.51	10Xv3	H19.30.001	M	WHB-10Xv3	...	Human MoAN	Myelencephalon	589a1c5c-839b-480d-babc-4098c467211b	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
10X194_8:GAAATGAGTTCGGCTG	GAAATGAGTTCGGCTG	10X194_8	LKTX_190529_02_H01	WHB-10Xv3-Nonneurons	nuclei	H18.30.002.CX.50	10Xv3	H18.30.002	M	WHB-10Xv3	...	Human SN	Midbrain	721e951b-b734-46ed-83e3-89f803f74f61	0.0	0.0	0.0	0.0	0.0	10.210833	0.0
10X350_4:TTTACCATCGCACGAC	TTTACCATCGCACGAC	10X350_4	LKTX_210421_03_D01	WHB-10Xv3-Nonneurons	nuclei	H19.30.002.CB.62	10Xv3	H19.30.002	M	WHB-10Xv3	...	Human CbDN	Cerebellum	8752a645-853b-49a5-8c3a-2cf9f8557b94	9.587301	0.0	0.0	0.0	0.0	8.006084	0.0
10X225_1:AGAAGCGTCCATATGG	AGAAGCGTCCATATGG	10X225_1	LKTX_190913_02_A01	WHB-10Xv3-Nonneurons	nuclei	H18.30.002.CX.51	10Xv3	H18.30.002	M	WHB-10Xv3	...	Human PAG	Midbrain	1a1738c0-1efe-4124-ae62-809cd88cb168	8.567961	0.0	0.0	0.0	0.0	0.0	0.0
10X221_5:TTGAACGCAGCCTTCT	TTGAACGCAGCCTTCT	10X221_5	LKTX_190830_01_E01	WHB-10Xv3-Nonneurons	nuclei	H18.30.002.CX.49	10Xv3	H18.30.002	M	WHB-10Xv3	...	Human STG	Cerebral cortex	9606ce80-ab0a-4c31-a7bb-65054c0d6e01	10.119619	0.0	0.0	0.0	0.0	0.0	0.0
10X385_3:CTACCCAGTGGCGCTT	CTACCCAGTGGCGCTT	10X385_3	LKTX_210818_04_C01	WHB-10Xv3-Nonneurons	nuclei	H19.30.002.CX.49	10Xv3	H19.30.002	M	WHB-10Xv3	...	Human GPi	Basal nuclei	a69fef25-9149-41e6-8056-030e283434ea	0.0	0.0	0.0	0.0	0.0	0.0	0.0

3369219 rows × 23 columns

Accessing 10x RNA-seq gene expression data

Contents

Accessing 10x RNA-seq gene expression data#

Gene expression matrices#

Loading specific genes from the data#