{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "IPython magic command to render matplotlib plots." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Accessing 10x RNA-seq gene expression data\n", "\n", "This notebook provides examples and functions for accessing the 10X expression matrix data stored in the ABC Atlas. These files require a large amount of memory to be available to load and analyize them if care is not taken. In this notebook, we present an example of how to access specific gene expressions from the data. The functions used below could be simplily parallized when processing data at scale, however, we leave them simple here.\n", "\n", "Care should still be taken not to attempt too load to many genes from the expression matrices." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from pathlib import Path\n", "import numpy as np\n", "import anndata\n", "\n", "from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache\n", "from abc_atlas_access.abc_atlas_cache.anndata_utils import get_gene_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will interact with the data using the **AbcProjectCache**. This cache object tracks which data has been downloaded and serves the path to the requsted data on disk. For metadata, the cache can also directly serve a up a Pandas Dataframe. See the ``getting_started`` notebook for more details on using the cache including installing it if it has not already been.\n", "\n", "**Change the download_base variable to where you have downloaded the data in your system.**" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'releases/20240831/manifest.json'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "download_base = Path('../../data/abc_atlas')\n", "abc_cache = AbcProjectCache.from_cache_dir(download_base)\n", "\n", "abc_cache.current_manifest" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gene expression matrices\n", "\n", "The Whole Mouse Brain (WMB) and Whole Human Brain (WHB) datasets are formatted similarly. Each package is formatted as annadata h5ad files with minimal metadata. For each dataset, there are two h5ad files: one storing the raw counts and the other log2 normalization of the counts.\n", "\n", "To load the data by gene for either mouse or human dataset, we need to load two pieces of metadata, the ``cell``s table and the ``gene``s table in addition to our instantiated AbcProjectCache object. These metadata can be found in the directories WMB-10X and WHB-10Xv3 for mouse and human respectively. Below we use the human brain data in our example." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "cell_metadata.csv: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 705M/705M [00:33<00:00, 21.0MMB/s]\n" ] }, { "data": { "text/html": [ "
\n", " | cell_barcode | \n", "barcoded_cell_sample_label | \n", "library_label | \n", "feature_matrix_label | \n", "entity | \n", "brain_section_label | \n", "library_method | \n", "donor_label | \n", "donor_sex | \n", "dataset_label | \n", "x | \n", "y | \n", "cluster_alias | \n", "region_of_interest_label | \n", "anatomical_division_label | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cell_label | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
10X386_2:CATGGATTCTCGACGG | \n", "CATGGATTCTCGACGG | \n", "10X386_2 | \n", "LKTX_210825_01_B01 | \n", "WHB-10Xv3-Neurons | \n", "nuclei | \n", "H19.30.001.CX.51 | \n", "10Xv3 | \n", "H19.30.001 | \n", "M | \n", "WHB-10Xv3 | \n", "7.533584 | \n", "-15.230048 | \n", "20 | \n", "Human MoAN | \n", "Myelencephalon | \n", "
10X383_5:TCTTGCGGTGAATTGA | \n", "TCTTGCGGTGAATTGA | \n", "10X383_5 | \n", "LKTX_210818_02_E01 | \n", "WHB-10Xv3-Neurons | \n", "nuclei | \n", "H19.30.002.BS.94 | \n", "10Xv3 | \n", "H19.30.002 | \n", "M | \n", "WHB-10Xv3 | \n", "2.307856 | \n", "-15.542040 | \n", "20 | \n", "Human MoSR | \n", "Myelencephalon | \n", "
10X386_2:CTCATCGGTCGAGCAA | \n", "CTCATCGGTCGAGCAA | \n", "10X386_2 | \n", "LKTX_210825_01_B01 | \n", "WHB-10Xv3-Neurons | \n", "nuclei | \n", "H19.30.001.CX.51 | \n", "10Xv3 | \n", "H19.30.001 | \n", "M | \n", "WHB-10Xv3 | \n", "6.740066 | \n", "-16.186017 | \n", "17 | \n", "Human MoAN | \n", "Myelencephalon | \n", "
10X378_8:TTGGATGAGACAAGCC | \n", "TTGGATGAGACAAGCC | \n", "10X378_8 | \n", "LKTX_210809_01_H01 | \n", "WHB-10Xv3-Neurons | \n", "nuclei | \n", "H19.30.002.BS.93 | \n", "10Xv3 | \n", "H19.30.002 | \n", "M | \n", "WHB-10Xv3 | \n", "5.926133 | \n", "-20.015151 | \n", "18 | \n", "Human PnAN | \n", "Pons | \n", "
10X387_7:TGAACGTAGTATTCCG | \n", "TGAACGTAGTATTCCG | \n", "10X387_7 | \n", "LKTX_210825_02_G01 | \n", "WHB-10Xv3-Neurons | \n", "nuclei | \n", "H19.30.001.CX.51 | \n", "10Xv3 | \n", "H19.30.001 | \n", "M | \n", "WHB-10Xv3 | \n", "5.622083 | \n", "-13.561958 | \n", "16 | \n", "Human MoAN | \n", "Myelencephalon | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
10X194_8:GAAATGAGTTCGGCTG | \n", "GAAATGAGTTCGGCTG | \n", "10X194_8 | \n", "LKTX_190529_02_H01 | \n", "WHB-10Xv3-Nonneurons | \n", "nuclei | \n", "H18.30.002.CX.50 | \n", "10Xv3 | \n", "H18.30.002 | \n", "M | \n", "WHB-10Xv3 | \n", "-32.154318 | \n", "21.585480 | \n", "3264 | \n", "Human SN | \n", "Midbrain | \n", "
10X350_4:TTTACCATCGCACGAC | \n", "TTTACCATCGCACGAC | \n", "10X350_4 | \n", "LKTX_210421_03_D01 | \n", "WHB-10Xv3-Nonneurons | \n", "nuclei | \n", "H19.30.002.CB.62 | \n", "10Xv3 | \n", "H19.30.002 | \n", "M | \n", "WHB-10Xv3 | \n", "-29.906175 | \n", "23.914979 | \n", "3265 | \n", "Human CbDN | \n", "Cerebellum | \n", "
10X225_1:AGAAGCGTCCATATGG | \n", "AGAAGCGTCCATATGG | \n", "10X225_1 | \n", "LKTX_190913_02_A01 | \n", "WHB-10Xv3-Nonneurons | \n", "nuclei | \n", "H18.30.002.CX.51 | \n", "10Xv3 | \n", "H18.30.002 | \n", "M | \n", "WHB-10Xv3 | \n", "-32.091915 | \n", "21.212210 | \n", "3264 | \n", "Human PAG | \n", "Midbrain | \n", "
10X221_5:TTGAACGCAGCCTTCT | \n", "TTGAACGCAGCCTTCT | \n", "10X221_5 | \n", "LKTX_190830_01_E01 | \n", "WHB-10Xv3-Nonneurons | \n", "nuclei | \n", "H18.30.002.CX.49 | \n", "10Xv3 | \n", "H18.30.002 | \n", "M | \n", "WHB-10Xv3 | \n", "-30.665837 | \n", "22.993120 | \n", "3265 | \n", "Human STG | \n", "Cerebral cortex | \n", "
10X385_3:CTACCCAGTGGCGCTT | \n", "CTACCCAGTGGCGCTT | \n", "10X385_3 | \n", "LKTX_210818_04_C01 | \n", "WHB-10Xv3-Nonneurons | \n", "nuclei | \n", "H19.30.002.CX.49 | \n", "10Xv3 | \n", "H19.30.002 | \n", "M | \n", "WHB-10Xv3 | \n", "-30.579284 | \n", "22.853419 | \n", "3264 | \n", "Human GPi | \n", "Basal nuclei | \n", "
3369219 rows × 15 columns
\n", "\n", " | gene_symbol | \n", "biotype | \n", "name | \n", "
---|---|---|---|
gene_identifier | \n", "\n", " | \n", " | \n", " |
ENSG00000000003 | \n", "TSPAN6 | \n", "protein_coding | \n", "tetraspanin 6 | \n", "
ENSG00000000005 | \n", "TNMD | \n", "protein_coding | \n", "tenomodulin | \n", "
ENSG00000000419 | \n", "DPM1 | \n", "protein_coding | \n", "dolichyl-phosphate mannosyltransferase subunit... | \n", "
ENSG00000000457 | \n", "SCYL3 | \n", "protein_coding | \n", "SCY1 like pseudokinase 3 | \n", "
ENSG00000000460 | \n", "C1orf112 | \n", "protein_coding | \n", "chromosome 1 open reading frame 112 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
ENSG00000288638 | \n", "AL627443.4 | \n", "lncRNA | \n", "novel transcript | \n", "
ENSG00000288639 | \n", "AC093246.1 | \n", "protein_coding | \n", "novel protein | \n", "
ENSG00000288642 | \n", "CDR1 | \n", "protein_coding | \n", "cerebellar degeneration related protein 1 | \n", "
ENSG00000288643 | \n", "AC114982.3 | \n", "protein_coding | \n", "novel transcript | \n", "
ENSG00000288645 | \n", "AC084756.2 | \n", "protein_coding | \n", "novel protein | \n", "
59357 rows × 3 columns
\n", "gene_symbol | \n", "PTPRC | \n", "SLC17A6 | \n", "SLC32A1 | \n", "SLC17A7 | \n", "TTR | \n", "PLP1 | \n", "AQP4 | \n", "
---|---|---|---|---|---|---|---|
cell_label | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
10X362_3:TCAGTGAGTATTGACC | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "10.177927 | \n", "0.0 | \n", "
10X362_5:TCCGTGTGTGAAAGTT | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "9.262379 | \n", "0.0 | \n", "
10X362_5:CACGGGTAGAGCAGAA | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "11.240114 | \n", "0.0 | \n", "
10X362_5:GATTCTTGTATGTCAC | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "8.314513 | \n", "0.0 | \n", "
10X362_6:AGGACTTGTATCCTTT | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "9.736156 | \n", "0.0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
10X194_8:GAAATGAGTTCGGCTG | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "10.210833 | \n", "0.0 | \n", "
10X350_4:TTTACCATCGCACGAC | \n", "9.587301 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "8.006084 | \n", "0.0 | \n", "
10X225_1:AGAAGCGTCCATATGG | \n", "8.567961 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
10X221_5:TTGAACGCAGCCTTCT | \n", "10.119619 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
10X385_3:CTACCCAGTGGCGCTT | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
888263 rows × 7 columns
\n", "\n", " | cell_barcode | \n", "barcoded_cell_sample_label | \n", "library_label | \n", "feature_matrix_label | \n", "entity | \n", "brain_section_label | \n", "library_method | \n", "donor_label | \n", "donor_sex | \n", "dataset_label | \n", "... | \n", "cluster_alias | \n", "region_of_interest_label | \n", "anatomical_division_label | \n", "PTPRC | \n", "SLC17A6 | \n", "SLC32A1 | \n", "SLC17A7 | \n", "TTR | \n", "PLP1 | \n", "AQP4 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cell_label | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
10X386_2:CATGGATTCTCGACGG | \n", "CATGGATTCTCGACGG | \n", "10X386_2 | \n", "LKTX_210825_01_B01 | \n", "WHB-10Xv3-Neurons | \n", "nuclei | \n", "H19.30.001.CX.51 | \n", "10Xv3 | \n", "H19.30.001 | \n", "M | \n", "WHB-10Xv3 | \n", "... | \n", "20 | \n", "Human MoAN | \n", "Myelencephalon | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
10X383_5:TCTTGCGGTGAATTGA | \n", "TCTTGCGGTGAATTGA | \n", "10X383_5 | \n", "LKTX_210818_02_E01 | \n", "WHB-10Xv3-Neurons | \n", "nuclei | \n", "H19.30.002.BS.94 | \n", "10Xv3 | \n", "H19.30.002 | \n", "M | \n", "WHB-10Xv3 | \n", "... | \n", "20 | \n", "Human MoSR | \n", "Myelencephalon | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
10X386_2:CTCATCGGTCGAGCAA | \n", "CTCATCGGTCGAGCAA | \n", "10X386_2 | \n", "LKTX_210825_01_B01 | \n", "WHB-10Xv3-Neurons | \n", "nuclei | \n", "H19.30.001.CX.51 | \n", "10Xv3 | \n", "H19.30.001 | \n", "M | \n", "WHB-10Xv3 | \n", "... | \n", "17 | \n", "Human MoAN | \n", "Myelencephalon | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
10X378_8:TTGGATGAGACAAGCC | \n", "TTGGATGAGACAAGCC | \n", "10X378_8 | \n", "LKTX_210809_01_H01 | \n", "WHB-10Xv3-Neurons | \n", "nuclei | \n", "H19.30.002.BS.93 | \n", "10Xv3 | \n", "H19.30.002 | \n", "M | \n", "WHB-10Xv3 | \n", "... | \n", "18 | \n", "Human PnAN | \n", "Pons | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
10X387_7:TGAACGTAGTATTCCG | \n", "TGAACGTAGTATTCCG | \n", "10X387_7 | \n", "LKTX_210825_02_G01 | \n", "WHB-10Xv3-Neurons | \n", "nuclei | \n", "H19.30.001.CX.51 | \n", "10Xv3 | \n", "H19.30.001 | \n", "M | \n", "WHB-10Xv3 | \n", "... | \n", "16 | \n", "Human MoAN | \n", "Myelencephalon | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
10X194_8:GAAATGAGTTCGGCTG | \n", "GAAATGAGTTCGGCTG | \n", "10X194_8 | \n", "LKTX_190529_02_H01 | \n", "WHB-10Xv3-Nonneurons | \n", "nuclei | \n", "H18.30.002.CX.50 | \n", "10Xv3 | \n", "H18.30.002 | \n", "M | \n", "WHB-10Xv3 | \n", "... | \n", "3264 | \n", "Human SN | \n", "Midbrain | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "10.210833 | \n", "0.0 | \n", "
10X350_4:TTTACCATCGCACGAC | \n", "TTTACCATCGCACGAC | \n", "10X350_4 | \n", "LKTX_210421_03_D01 | \n", "WHB-10Xv3-Nonneurons | \n", "nuclei | \n", "H19.30.002.CB.62 | \n", "10Xv3 | \n", "H19.30.002 | \n", "M | \n", "WHB-10Xv3 | \n", "... | \n", "3265 | \n", "Human CbDN | \n", "Cerebellum | \n", "9.587301 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "8.006084 | \n", "0.0 | \n", "
10X225_1:AGAAGCGTCCATATGG | \n", "AGAAGCGTCCATATGG | \n", "10X225_1 | \n", "LKTX_190913_02_A01 | \n", "WHB-10Xv3-Nonneurons | \n", "nuclei | \n", "H18.30.002.CX.51 | \n", "10Xv3 | \n", "H18.30.002 | \n", "M | \n", "WHB-10Xv3 | \n", "... | \n", "3264 | \n", "Human PAG | \n", "Midbrain | \n", "8.567961 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
10X221_5:TTGAACGCAGCCTTCT | \n", "TTGAACGCAGCCTTCT | \n", "10X221_5 | \n", "LKTX_190830_01_E01 | \n", "WHB-10Xv3-Nonneurons | \n", "nuclei | \n", "H18.30.002.CX.49 | \n", "10Xv3 | \n", "H18.30.002 | \n", "M | \n", "WHB-10Xv3 | \n", "... | \n", "3265 | \n", "Human STG | \n", "Cerebral cortex | \n", "10.119619 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
10X385_3:CTACCCAGTGGCGCTT | \n", "CTACCCAGTGGCGCTT | \n", "10X385_3 | \n", "LKTX_210818_04_C01 | \n", "WHB-10Xv3-Nonneurons | \n", "nuclei | \n", "H19.30.002.CX.49 | \n", "10Xv3 | \n", "H19.30.002 | \n", "M | \n", "WHB-10Xv3 | \n", "... | \n", "3264 | \n", "Human GPi | \n", "Basal nuclei | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
3369219 rows × 22 columns
\n", "