IPython magic command to render matplotlib plots.
%matplotlib inline
Accessing 10x RNA-seq gene expression data#
This notebook provides examples and functions for accessing the 10X expression matrix data stored in the ABC Atlas. These files require a large amount of memory to be available to load and analyize them if care is not taken. In this notebook, we present an example of how to access specific gene expressions from the data. The functions used below could be simplily parallized when processing data at scale, however, we leave them simple here.
Care should still be taken not to attempt too load to many genes from the expression matrices.
import pandas as pd
from pathlib import Path
import numpy as np
import anndata
from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache
from abc_atlas_access.abc_atlas_cache.anndata_utils import get_gene_data
We will interact with the data using the AbcProjectCache. This cache object tracks which data has been downloaded and serves the path to the requsted data on disk. For metadata, the cache can also directly serve a up a Pandas Dataframe. See the getting_started
notebook for more details on using the cache including installing it if it has not already been.
Change the download_base variable to where you have downloaded the data in your system.
download_base = Path('../../data/abc_atlas')
abc_cache = AbcProjectCache.from_cache_dir(download_base)
abc_cache.current_manifest
'releases/20240831/manifest.json'
Gene expression matrices#
The Whole Mouse Brain (WMB) and Whole Human Brain (WHB) datasets are formatted similarly. Each package is formatted as annadata h5ad files with minimal metadata. For each dataset, there are two h5ad files: one storing the raw counts and the other log2 normalization of the counts.
To load the data by gene for either mouse or human dataset, we need to load two pieces of metadata, the cell
s table and the gene
s table in addition to our instantiated AbcProjectCache object. These metadata can be found in the directories WMB-10X and WHB-10Xv3 for mouse and human respectively. Below we use the human brain data in our example.
cell = abc_cache.get_metadata_dataframe(directory='WHB-10Xv3', file_name='cell_metadata').set_index('cell_label')
cell
cell_metadata.csv: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 705M/705M [00:33<00:00, 21.0MMB/s]
cell_barcode | barcoded_cell_sample_label | library_label | feature_matrix_label | entity | brain_section_label | library_method | donor_label | donor_sex | dataset_label | x | y | cluster_alias | region_of_interest_label | anatomical_division_label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cell_label | |||||||||||||||
10X386_2:CATGGATTCTCGACGG | CATGGATTCTCGACGG | 10X386_2 | LKTX_210825_01_B01 | WHB-10Xv3-Neurons | nuclei | H19.30.001.CX.51 | 10Xv3 | H19.30.001 | M | WHB-10Xv3 | 7.533584 | -15.230048 | 20 | Human MoAN | Myelencephalon |
10X383_5:TCTTGCGGTGAATTGA | TCTTGCGGTGAATTGA | 10X383_5 | LKTX_210818_02_E01 | WHB-10Xv3-Neurons | nuclei | H19.30.002.BS.94 | 10Xv3 | H19.30.002 | M | WHB-10Xv3 | 2.307856 | -15.542040 | 20 | Human MoSR | Myelencephalon |
10X386_2:CTCATCGGTCGAGCAA | CTCATCGGTCGAGCAA | 10X386_2 | LKTX_210825_01_B01 | WHB-10Xv3-Neurons | nuclei | H19.30.001.CX.51 | 10Xv3 | H19.30.001 | M | WHB-10Xv3 | 6.740066 | -16.186017 | 17 | Human MoAN | Myelencephalon |
10X378_8:TTGGATGAGACAAGCC | TTGGATGAGACAAGCC | 10X378_8 | LKTX_210809_01_H01 | WHB-10Xv3-Neurons | nuclei | H19.30.002.BS.93 | 10Xv3 | H19.30.002 | M | WHB-10Xv3 | 5.926133 | -20.015151 | 18 | Human PnAN | Pons |
10X387_7:TGAACGTAGTATTCCG | TGAACGTAGTATTCCG | 10X387_7 | LKTX_210825_02_G01 | WHB-10Xv3-Neurons | nuclei | H19.30.001.CX.51 | 10Xv3 | H19.30.001 | M | WHB-10Xv3 | 5.622083 | -13.561958 | 16 | Human MoAN | Myelencephalon |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
10X194_8:GAAATGAGTTCGGCTG | GAAATGAGTTCGGCTG | 10X194_8 | LKTX_190529_02_H01 | WHB-10Xv3-Nonneurons | nuclei | H18.30.002.CX.50 | 10Xv3 | H18.30.002 | M | WHB-10Xv3 | -32.154318 | 21.585480 | 3264 | Human SN | Midbrain |
10X350_4:TTTACCATCGCACGAC | TTTACCATCGCACGAC | 10X350_4 | LKTX_210421_03_D01 | WHB-10Xv3-Nonneurons | nuclei | H19.30.002.CB.62 | 10Xv3 | H19.30.002 | M | WHB-10Xv3 | -29.906175 | 23.914979 | 3265 | Human CbDN | Cerebellum |
10X225_1:AGAAGCGTCCATATGG | AGAAGCGTCCATATGG | 10X225_1 | LKTX_190913_02_A01 | WHB-10Xv3-Nonneurons | nuclei | H18.30.002.CX.51 | 10Xv3 | H18.30.002 | M | WHB-10Xv3 | -32.091915 | 21.212210 | 3264 | Human PAG | Midbrain |
10X221_5:TTGAACGCAGCCTTCT | TTGAACGCAGCCTTCT | 10X221_5 | LKTX_190830_01_E01 | WHB-10Xv3-Nonneurons | nuclei | H18.30.002.CX.49 | 10Xv3 | H18.30.002 | M | WHB-10Xv3 | -30.665837 | 22.993120 | 3265 | Human STG | Cerebral cortex |
10X385_3:CTACCCAGTGGCGCTT | CTACCCAGTGGCGCTT | 10X385_3 | LKTX_210818_04_C01 | WHB-10Xv3-Nonneurons | nuclei | H19.30.002.CX.49 | 10Xv3 | H19.30.002 | M | WHB-10Xv3 | -30.579284 | 22.853419 | 3264 | Human GPi | Basal nuclei |
3369219 rows × 15 columns
gene = abc_cache.get_metadata_dataframe(directory='WHB-10Xv3', file_name='gene').set_index('gene_identifier')
gene
gene.csv: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.23M/4.23M [00:00<00:00, 7.51MMB/s]
gene_symbol | biotype | name | |
---|---|---|---|
gene_identifier | |||
ENSG00000000003 | TSPAN6 | protein_coding | tetraspanin 6 |
ENSG00000000005 | TNMD | protein_coding | tenomodulin |
ENSG00000000419 | DPM1 | protein_coding | dolichyl-phosphate mannosyltransferase subunit... |
ENSG00000000457 | SCYL3 | protein_coding | SCY1 like pseudokinase 3 |
ENSG00000000460 | C1orf112 | protein_coding | chromosome 1 open reading frame 112 |
... | ... | ... | ... |
ENSG00000288638 | AL627443.4 | lncRNA | novel transcript |
ENSG00000288639 | AC093246.1 | protein_coding | novel protein |
ENSG00000288642 | CDR1 | protein_coding | cerebellar degeneration related protein 1 |
ENSG00000288643 | AC114982.3 | protein_coding | novel transcript |
ENSG00000288645 | AC084756.2 | protein_coding | novel protein |
59357 rows × 3 columns
Loading specific genes from the data#
The Whole Human Brain dataset (Siletti et al. 2023) consists of two sub-datasets: Neuron cells and Non-neuron cells. The neuron files are 30 GB in size and if we were to attempt to slice the dataset by gene we would have to load >30 GB once the data is uncompressed into memory. To avoid this, we load the data in chunks and recombine them into a pandas dataframe with all cells and the requested genes loaded.
Below, we specify a select set of genes from the data that we’ll load into our output data frame. These are the same genes that are available in the example file example_genes_all_cells_expression
in the WHB-10Xv3 metadata directory.
gene_names = ['SLC17A6', 'SLC17A7', 'SLC32A1', 'PTPRC', 'PLP1', 'AQP4', 'TTR']
Below is a function that can be used to load specific genes from the from the full set of data in the WHB dataset. Adjust the chunk size down if you find your system still running out of memory. Additionally, reduce the number of genes if you still have issues. Note that the funciton can load either the raw expression data or the log2 data. The full code of this function can be found here
?get_gene_data
Signature:
get_gene_data(
abc_atlas_cache: abc_atlas_access.abc_atlas_cache.abc_project_cache.AbcProjectCache,
all_cells: pandas.core.frame.DataFrame,
all_genes: pandas.core.frame.DataFrame,
selected_genes: List[str],
data_type: str = 'log2',
chunk_size: int = 8192,
)
Docstring:
Load expression matrix data from the ABC Atlas and extract data for
specific genes.
Method will load all expression data required to process across multiple
files to extract the full set of genes. This may result in downloading
potentially ~100 GB of data.
Parameters
----------
abc_atlas_cache: AbcProjectCache
An AbcProjectCache instance object to handle downloading and serving
the path to the expression matrix data.
all_cells: pandas.DataFrame
cells metadata loaded as a pandas Dataframe from the AbcProjectCache
indexed on cell_label.
all_genes: pandas.DataFrame
genes metadata loaded as a pandas Dataframe from the AbcProjectCache
indexed on gene_identifier.
selected_genes: list of strings
List of gene_symbols that are a subset of those in the full genes
DataFrame.
data_type: str (Default: "log2")
Kind of expression matrix to load either "log2" or "raw". Defaults to
"log2".
chunk_size: int (Default: 8192)
Size of the chunk to load from the anndata files. Adjust this size if
needed based on memory/file io. Default: 8192.
Returns
-------
output_gene_data: pandas.DataFrame
Subset of gene data indexed by cell.
File: ~/src/abc_atlas_access/src/abc_atlas_access/abc_atlas_cache/anndata_utils.py
Type: function
The code commented out below will create a gene expression DataFrame over the full WHB dataset. Running this full process takes around 10 minutes to processes, however downloading the full data can take up to an hour depending on your download speed.
"""
gene_data = get_gene_data(
abc_atlas_cache=abc_cache,
all_cells=cell,
all_genes=gene,
selected_genes=gene_names
)"""
'\ngene_data = get_gene_data(\n abc_atlas_cache=abc_cache,\n all_cells=cell,\n all_genes=gene,\n selected_genes=gene_names\n)'
For a quicker example, we’ll load the Non-neurons from the WHB dataset only. This should take roughly a handful of minutes to complete and around 10 minutes download depending on your connection speed.
nonneuron_cells = cell[cell['feature_matrix_label'] == 'WHB-10Xv3-Nonneurons']
gene_data = get_gene_data(
abc_atlas_cache=abc_cache,
all_cells=nonneuron_cells,
all_genes=gene,
selected_genes=gene_names
)
gene_data[pd.notna(gene_data[gene_data.columns[0]])]
loading file: WHB-10Xv3-Nonneurons
WHB-10Xv3-Nonneurons-log2.h5ad: 100%|███████████████████████████████████████████████████████████████████████████████████████| 4.78G/4.78G [17:16<00:00, 4.62MMB/s]
- time taken: 237.755479
total time taken: 381.04928600000005
total cells: 888263 processed cells: 888263
gene_symbol | PTPRC | SLC17A6 | SLC32A1 | SLC17A7 | TTR | PLP1 | AQP4 |
---|---|---|---|---|---|---|---|
cell_label | |||||||
10X362_3:TCAGTGAGTATTGACC | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 10.177927 | 0.0 |
10X362_5:TCCGTGTGTGAAAGTT | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 9.262379 | 0.0 |
10X362_5:CACGGGTAGAGCAGAA | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 11.240114 | 0.0 |
10X362_5:GATTCTTGTATGTCAC | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 8.314513 | 0.0 |
10X362_6:AGGACTTGTATCCTTT | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 9.736156 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... |
10X194_8:GAAATGAGTTCGGCTG | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 10.210833 | 0.0 |
10X350_4:TTTACCATCGCACGAC | 9.587301 | 0.0 | 0.0 | 0.0 | 0.0 | 8.006084 | 0.0 |
10X225_1:AGAAGCGTCCATATGG | 8.567961 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
10X221_5:TTGAACGCAGCCTTCT | 10.119619 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
10X385_3:CTACCCAGTGGCGCTT | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
888263 rows × 7 columns
The returned DataFrame is indexed by cell_label
and can thus be joined with the cell
DataFrame for further analysis.
cells_with_genes = cell.join(gene_data)
cells_with_genes
cell_barcode | barcoded_cell_sample_label | library_label | feature_matrix_label | entity | brain_section_label | library_method | donor_label | donor_sex | dataset_label | ... | cluster_alias | region_of_interest_label | anatomical_division_label | PTPRC | SLC17A6 | SLC32A1 | SLC17A7 | TTR | PLP1 | AQP4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cell_label | |||||||||||||||||||||
10X386_2:CATGGATTCTCGACGG | CATGGATTCTCGACGG | 10X386_2 | LKTX_210825_01_B01 | WHB-10Xv3-Neurons | nuclei | H19.30.001.CX.51 | 10Xv3 | H19.30.001 | M | WHB-10Xv3 | ... | 20 | Human MoAN | Myelencephalon | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10X383_5:TCTTGCGGTGAATTGA | TCTTGCGGTGAATTGA | 10X383_5 | LKTX_210818_02_E01 | WHB-10Xv3-Neurons | nuclei | H19.30.002.BS.94 | 10Xv3 | H19.30.002 | M | WHB-10Xv3 | ... | 20 | Human MoSR | Myelencephalon | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10X386_2:CTCATCGGTCGAGCAA | CTCATCGGTCGAGCAA | 10X386_2 | LKTX_210825_01_B01 | WHB-10Xv3-Neurons | nuclei | H19.30.001.CX.51 | 10Xv3 | H19.30.001 | M | WHB-10Xv3 | ... | 17 | Human MoAN | Myelencephalon | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10X378_8:TTGGATGAGACAAGCC | TTGGATGAGACAAGCC | 10X378_8 | LKTX_210809_01_H01 | WHB-10Xv3-Neurons | nuclei | H19.30.002.BS.93 | 10Xv3 | H19.30.002 | M | WHB-10Xv3 | ... | 18 | Human PnAN | Pons | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10X387_7:TGAACGTAGTATTCCG | TGAACGTAGTATTCCG | 10X387_7 | LKTX_210825_02_G01 | WHB-10Xv3-Neurons | nuclei | H19.30.001.CX.51 | 10Xv3 | H19.30.001 | M | WHB-10Xv3 | ... | 16 | Human MoAN | Myelencephalon | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
10X194_8:GAAATGAGTTCGGCTG | GAAATGAGTTCGGCTG | 10X194_8 | LKTX_190529_02_H01 | WHB-10Xv3-Nonneurons | nuclei | H18.30.002.CX.50 | 10Xv3 | H18.30.002 | M | WHB-10Xv3 | ... | 3264 | Human SN | Midbrain | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 10.210833 | 0.0 |
10X350_4:TTTACCATCGCACGAC | TTTACCATCGCACGAC | 10X350_4 | LKTX_210421_03_D01 | WHB-10Xv3-Nonneurons | nuclei | H19.30.002.CB.62 | 10Xv3 | H19.30.002 | M | WHB-10Xv3 | ... | 3265 | Human CbDN | Cerebellum | 9.587301 | 0.0 | 0.0 | 0.0 | 0.0 | 8.006084 | 0.0 |
10X225_1:AGAAGCGTCCATATGG | AGAAGCGTCCATATGG | 10X225_1 | LKTX_190913_02_A01 | WHB-10Xv3-Nonneurons | nuclei | H18.30.002.CX.51 | 10Xv3 | H18.30.002 | M | WHB-10Xv3 | ... | 3264 | Human PAG | Midbrain | 8.567961 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
10X221_5:TTGAACGCAGCCTTCT | TTGAACGCAGCCTTCT | 10X221_5 | LKTX_190830_01_E01 | WHB-10Xv3-Nonneurons | nuclei | H18.30.002.CX.49 | 10Xv3 | H18.30.002 | M | WHB-10Xv3 | ... | 3265 | Human STG | Cerebral cortex | 10.119619 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
10X385_3:CTACCCAGTGGCGCTT | CTACCCAGTGGCGCTT | 10X385_3 | LKTX_210818_04_C01 | WHB-10Xv3-Nonneurons | nuclei | H19.30.002.CX.49 | 10Xv3 | H19.30.002 | M | WHB-10Xv3 | ... | 3264 | Human GPi | Basal nuclei | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3369219 rows × 22 columns