{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "IPython magic command to render matplotlib plots." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Accessing 10x RNA-seq gene expression data\n", "\n", "This notebook provides examples and functions for accessing the 10X expression matrix data stored in the ABC Atlas. These files require a large amount of memory to be available to load and analyize them if care is not taken. In this notebook, we present an example of how to access specific gene expressions from the data. The functions used below could be simplily parallized when processing data at scale, however, we leave them simple here.\n", "\n", "Care should still be taken not to attempt too load to many genes from the expression matrices." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from pathlib import Path\n", "import numpy as np\n", "import anndata\n", "\n", "from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache\n", "from abc_atlas_access.abc_atlas_cache.anndata_utils import get_gene_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will interact with the data using the **AbcProjectCache**. This cache object tracks which data has been downloaded and serves the path to the requsted data on disk. For metadata, the cache can also directly serve a up a Pandas Dataframe. See the ``getting_started`` notebook for more details on using the cache including installing it if it has not already been.\n", "\n", "**Change the download_base variable to where you have downloaded the data in your system.**" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'releases/20240831/manifest.json'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "download_base = Path('../../data/abc_atlas')\n", "abc_cache = AbcProjectCache.from_cache_dir(download_base)\n", "\n", "abc_cache.current_manifest" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gene expression matrices\n", "\n", "The Whole Mouse Brain (WMB) and Whole Human Brain (WHB) datasets are formatted similarly. Each package is formatted as annadata h5ad files with minimal metadata. For each dataset, there are two h5ad files: one storing the raw counts and the other log2 normalization of the counts.\n", "\n", "To load the data by gene for either mouse or human dataset, we need to load two pieces of metadata, the ``cell``s table and the ``gene``s table in addition to our instantiated AbcProjectCache object. These metadata can be found in the directories WMB-10X and WHB-10Xv3 for mouse and human respectively. Below we use the human brain data in our example." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "cell_metadata.csv: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 705M/705M [00:33<00:00, 21.0MMB/s]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cell_barcodebarcoded_cell_sample_labellibrary_labelfeature_matrix_labelentitybrain_section_labellibrary_methoddonor_labeldonor_sexdataset_labelxycluster_aliasregion_of_interest_labelanatomical_division_label
cell_label
10X386_2:CATGGATTCTCGACGGCATGGATTCTCGACGG10X386_2LKTX_210825_01_B01WHB-10Xv3-NeuronsnucleiH19.30.001.CX.5110Xv3H19.30.001MWHB-10Xv37.533584-15.23004820Human MoANMyelencephalon
10X383_5:TCTTGCGGTGAATTGATCTTGCGGTGAATTGA10X383_5LKTX_210818_02_E01WHB-10Xv3-NeuronsnucleiH19.30.002.BS.9410Xv3H19.30.002MWHB-10Xv32.307856-15.54204020Human MoSRMyelencephalon
10X386_2:CTCATCGGTCGAGCAACTCATCGGTCGAGCAA10X386_2LKTX_210825_01_B01WHB-10Xv3-NeuronsnucleiH19.30.001.CX.5110Xv3H19.30.001MWHB-10Xv36.740066-16.18601717Human MoANMyelencephalon
10X378_8:TTGGATGAGACAAGCCTTGGATGAGACAAGCC10X378_8LKTX_210809_01_H01WHB-10Xv3-NeuronsnucleiH19.30.002.BS.9310Xv3H19.30.002MWHB-10Xv35.926133-20.01515118Human PnANPons
10X387_7:TGAACGTAGTATTCCGTGAACGTAGTATTCCG10X387_7LKTX_210825_02_G01WHB-10Xv3-NeuronsnucleiH19.30.001.CX.5110Xv3H19.30.001MWHB-10Xv35.622083-13.56195816Human MoANMyelencephalon
................................................
10X194_8:GAAATGAGTTCGGCTGGAAATGAGTTCGGCTG10X194_8LKTX_190529_02_H01WHB-10Xv3-NonneuronsnucleiH18.30.002.CX.5010Xv3H18.30.002MWHB-10Xv3-32.15431821.5854803264Human SNMidbrain
10X350_4:TTTACCATCGCACGACTTTACCATCGCACGAC10X350_4LKTX_210421_03_D01WHB-10Xv3-NonneuronsnucleiH19.30.002.CB.6210Xv3H19.30.002MWHB-10Xv3-29.90617523.9149793265Human CbDNCerebellum
10X225_1:AGAAGCGTCCATATGGAGAAGCGTCCATATGG10X225_1LKTX_190913_02_A01WHB-10Xv3-NonneuronsnucleiH18.30.002.CX.5110Xv3H18.30.002MWHB-10Xv3-32.09191521.2122103264Human PAGMidbrain
10X221_5:TTGAACGCAGCCTTCTTTGAACGCAGCCTTCT10X221_5LKTX_190830_01_E01WHB-10Xv3-NonneuronsnucleiH18.30.002.CX.4910Xv3H18.30.002MWHB-10Xv3-30.66583722.9931203265Human STGCerebral cortex
10X385_3:CTACCCAGTGGCGCTTCTACCCAGTGGCGCTT10X385_3LKTX_210818_04_C01WHB-10Xv3-NonneuronsnucleiH19.30.002.CX.4910Xv3H19.30.002MWHB-10Xv3-30.57928422.8534193264Human GPiBasal nuclei
\n", "

3369219 rows × 15 columns

\n", "
" ], "text/plain": [ " cell_barcode barcoded_cell_sample_label \\\n", "cell_label \n", "10X386_2:CATGGATTCTCGACGG CATGGATTCTCGACGG 10X386_2 \n", "10X383_5:TCTTGCGGTGAATTGA TCTTGCGGTGAATTGA 10X383_5 \n", "10X386_2:CTCATCGGTCGAGCAA CTCATCGGTCGAGCAA 10X386_2 \n", "10X378_8:TTGGATGAGACAAGCC TTGGATGAGACAAGCC 10X378_8 \n", "10X387_7:TGAACGTAGTATTCCG TGAACGTAGTATTCCG 10X387_7 \n", "... ... ... \n", "10X194_8:GAAATGAGTTCGGCTG GAAATGAGTTCGGCTG 10X194_8 \n", "10X350_4:TTTACCATCGCACGAC TTTACCATCGCACGAC 10X350_4 \n", "10X225_1:AGAAGCGTCCATATGG AGAAGCGTCCATATGG 10X225_1 \n", "10X221_5:TTGAACGCAGCCTTCT TTGAACGCAGCCTTCT 10X221_5 \n", "10X385_3:CTACCCAGTGGCGCTT CTACCCAGTGGCGCTT 10X385_3 \n", "\n", " library_label feature_matrix_label entity \\\n", "cell_label \n", "10X386_2:CATGGATTCTCGACGG LKTX_210825_01_B01 WHB-10Xv3-Neurons nuclei \n", "10X383_5:TCTTGCGGTGAATTGA LKTX_210818_02_E01 WHB-10Xv3-Neurons nuclei \n", "10X386_2:CTCATCGGTCGAGCAA LKTX_210825_01_B01 WHB-10Xv3-Neurons nuclei \n", "10X378_8:TTGGATGAGACAAGCC LKTX_210809_01_H01 WHB-10Xv3-Neurons nuclei \n", "10X387_7:TGAACGTAGTATTCCG LKTX_210825_02_G01 WHB-10Xv3-Neurons nuclei \n", "... ... ... ... \n", "10X194_8:GAAATGAGTTCGGCTG LKTX_190529_02_H01 WHB-10Xv3-Nonneurons nuclei \n", "10X350_4:TTTACCATCGCACGAC LKTX_210421_03_D01 WHB-10Xv3-Nonneurons nuclei \n", "10X225_1:AGAAGCGTCCATATGG LKTX_190913_02_A01 WHB-10Xv3-Nonneurons nuclei \n", "10X221_5:TTGAACGCAGCCTTCT LKTX_190830_01_E01 WHB-10Xv3-Nonneurons nuclei \n", "10X385_3:CTACCCAGTGGCGCTT LKTX_210818_04_C01 WHB-10Xv3-Nonneurons nuclei \n", "\n", " brain_section_label library_method donor_label \\\n", "cell_label \n", "10X386_2:CATGGATTCTCGACGG H19.30.001.CX.51 10Xv3 H19.30.001 \n", "10X383_5:TCTTGCGGTGAATTGA H19.30.002.BS.94 10Xv3 H19.30.002 \n", "10X386_2:CTCATCGGTCGAGCAA H19.30.001.CX.51 10Xv3 H19.30.001 \n", "10X378_8:TTGGATGAGACAAGCC H19.30.002.BS.93 10Xv3 H19.30.002 \n", "10X387_7:TGAACGTAGTATTCCG H19.30.001.CX.51 10Xv3 H19.30.001 \n", "... ... ... ... \n", "10X194_8:GAAATGAGTTCGGCTG H18.30.002.CX.50 10Xv3 H18.30.002 \n", "10X350_4:TTTACCATCGCACGAC H19.30.002.CB.62 10Xv3 H19.30.002 \n", "10X225_1:AGAAGCGTCCATATGG H18.30.002.CX.51 10Xv3 H18.30.002 \n", "10X221_5:TTGAACGCAGCCTTCT H18.30.002.CX.49 10Xv3 H18.30.002 \n", "10X385_3:CTACCCAGTGGCGCTT H19.30.002.CX.49 10Xv3 H19.30.002 \n", "\n", " donor_sex dataset_label x y \\\n", "cell_label \n", "10X386_2:CATGGATTCTCGACGG M WHB-10Xv3 7.533584 -15.230048 \n", "10X383_5:TCTTGCGGTGAATTGA M WHB-10Xv3 2.307856 -15.542040 \n", "10X386_2:CTCATCGGTCGAGCAA M WHB-10Xv3 6.740066 -16.186017 \n", "10X378_8:TTGGATGAGACAAGCC M WHB-10Xv3 5.926133 -20.015151 \n", "10X387_7:TGAACGTAGTATTCCG M WHB-10Xv3 5.622083 -13.561958 \n", "... ... ... ... ... \n", "10X194_8:GAAATGAGTTCGGCTG M WHB-10Xv3 -32.154318 21.585480 \n", "10X350_4:TTTACCATCGCACGAC M WHB-10Xv3 -29.906175 23.914979 \n", "10X225_1:AGAAGCGTCCATATGG M WHB-10Xv3 -32.091915 21.212210 \n", "10X221_5:TTGAACGCAGCCTTCT M WHB-10Xv3 -30.665837 22.993120 \n", "10X385_3:CTACCCAGTGGCGCTT M WHB-10Xv3 -30.579284 22.853419 \n", "\n", " cluster_alias region_of_interest_label \\\n", "cell_label \n", "10X386_2:CATGGATTCTCGACGG 20 Human MoAN \n", "10X383_5:TCTTGCGGTGAATTGA 20 Human MoSR \n", "10X386_2:CTCATCGGTCGAGCAA 17 Human MoAN \n", "10X378_8:TTGGATGAGACAAGCC 18 Human PnAN \n", "10X387_7:TGAACGTAGTATTCCG 16 Human MoAN \n", "... ... ... \n", "10X194_8:GAAATGAGTTCGGCTG 3264 Human SN \n", "10X350_4:TTTACCATCGCACGAC 3265 Human CbDN \n", "10X225_1:AGAAGCGTCCATATGG 3264 Human PAG \n", "10X221_5:TTGAACGCAGCCTTCT 3265 Human STG \n", "10X385_3:CTACCCAGTGGCGCTT 3264 Human GPi \n", "\n", " anatomical_division_label \n", "cell_label \n", "10X386_2:CATGGATTCTCGACGG Myelencephalon \n", "10X383_5:TCTTGCGGTGAATTGA Myelencephalon \n", "10X386_2:CTCATCGGTCGAGCAA Myelencephalon \n", "10X378_8:TTGGATGAGACAAGCC Pons \n", "10X387_7:TGAACGTAGTATTCCG Myelencephalon \n", "... ... \n", "10X194_8:GAAATGAGTTCGGCTG Midbrain \n", "10X350_4:TTTACCATCGCACGAC Cerebellum \n", "10X225_1:AGAAGCGTCCATATGG Midbrain \n", "10X221_5:TTGAACGCAGCCTTCT Cerebral cortex \n", "10X385_3:CTACCCAGTGGCGCTT Basal nuclei \n", "\n", "[3369219 rows x 15 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cell = abc_cache.get_metadata_dataframe(directory='WHB-10Xv3', file_name='cell_metadata').set_index('cell_label')\n", "cell" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "gene.csv: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.23M/4.23M [00:00<00:00, 7.51MMB/s]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gene_symbolbiotypename
gene_identifier
ENSG00000000003TSPAN6protein_codingtetraspanin 6
ENSG00000000005TNMDprotein_codingtenomodulin
ENSG00000000419DPM1protein_codingdolichyl-phosphate mannosyltransferase subunit...
ENSG00000000457SCYL3protein_codingSCY1 like pseudokinase 3
ENSG00000000460C1orf112protein_codingchromosome 1 open reading frame 112
............
ENSG00000288638AL627443.4lncRNAnovel transcript
ENSG00000288639AC093246.1protein_codingnovel protein
ENSG00000288642CDR1protein_codingcerebellar degeneration related protein 1
ENSG00000288643AC114982.3protein_codingnovel transcript
ENSG00000288645AC084756.2protein_codingnovel protein
\n", "

59357 rows × 3 columns

\n", "
" ], "text/plain": [ " gene_symbol biotype \\\n", "gene_identifier \n", "ENSG00000000003 TSPAN6 protein_coding \n", "ENSG00000000005 TNMD protein_coding \n", "ENSG00000000419 DPM1 protein_coding \n", "ENSG00000000457 SCYL3 protein_coding \n", "ENSG00000000460 C1orf112 protein_coding \n", "... ... ... \n", "ENSG00000288638 AL627443.4 lncRNA \n", "ENSG00000288639 AC093246.1 protein_coding \n", "ENSG00000288642 CDR1 protein_coding \n", "ENSG00000288643 AC114982.3 protein_coding \n", "ENSG00000288645 AC084756.2 protein_coding \n", "\n", " name \n", "gene_identifier \n", "ENSG00000000003 tetraspanin 6 \n", "ENSG00000000005 tenomodulin \n", "ENSG00000000419 dolichyl-phosphate mannosyltransferase subunit... \n", "ENSG00000000457 SCY1 like pseudokinase 3 \n", "ENSG00000000460 chromosome 1 open reading frame 112 \n", "... ... \n", "ENSG00000288638 novel transcript \n", "ENSG00000288639 novel protein \n", "ENSG00000288642 cerebellar degeneration related protein 1 \n", "ENSG00000288643 novel transcript \n", "ENSG00000288645 novel protein \n", "\n", "[59357 rows x 3 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gene = abc_cache.get_metadata_dataframe(directory='WHB-10Xv3', file_name='gene').set_index('gene_identifier')\n", "gene" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading specific genes from the data\n", "\n", "The Whole Human Brain dataset [(Siletti et al. 2023)](https://www.science.org/doi/10.1126/science.add7046) consists of two sub-datasets: Neuron cells and Non-neuron cells. The neuron files are 30 GB in size and if we were to attempt to slice the dataset by gene we would have to load >30 GB once the data is uncompressed into memory. To avoid this, we load the data in chunks and recombine them into a pandas dataframe with all cells and the requested genes loaded.\n", "\n", "Below, we specify a select set of genes from the data that we'll load into our output data frame. These are the same genes that are available in the example file ``example_genes_all_cells_expression`` in the WHB-10Xv3 metadata directory." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "gene_names = ['SLC17A6', 'SLC17A7', 'SLC32A1', 'PTPRC', 'PLP1', 'AQP4', 'TTR']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is a function that can be used to load specific genes from the from the full set of data in the WHB dataset. Adjust the chunk size down if you find your system still running out of memory. Additionally, reduce the number of genes if you still have issues. Note that the funciton can load either the raw expression data or the log2 data. The full code of this function can be found [here](https://github.com/AllenInstitute/abc_atlas_access/blob/7a53a08cff0f07e9b67911c0db04fa6932fa6e9d/src/abc_atlas_access/abc_atlas_cache/anndata_utils.py#L9)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[0;31mSignature:\u001b[0m\n", "\u001b[0mget_gene_data\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mabc_atlas_cache\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mabc_atlas_access\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mabc_atlas_cache\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mabc_project_cache\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mAbcProjectCache\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mall_cells\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mpandas\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcore\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mframe\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mDataFrame\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mall_genes\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mpandas\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcore\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mframe\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mDataFrame\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mselected_genes\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mList\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mdata_type\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mstr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'log2'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mchunk_size\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m8192\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mDocstring:\u001b[0m\n", "Load expression matrix data from the ABC Atlas and extract data for\n", "specific genes.\n", "\n", "Method will load all expression data required to process across multiple\n", "files to extract the full set of genes. This may result in downloading\n", "potentially ~100 GB of data.\n", "\n", "Parameters\n", "----------\n", "abc_atlas_cache: AbcProjectCache\n", " An AbcProjectCache instance object to handle downloading and serving\n", " the path to the expression matrix data.\n", "all_cells: pandas.DataFrame\n", " cells metadata loaded as a pandas Dataframe from the AbcProjectCache\n", " indexed on cell_label.\n", "all_genes: pandas.DataFrame\n", " genes metadata loaded as a pandas Dataframe from the AbcProjectCache\n", " indexed on gene_identifier.\n", "selected_genes: list of strings\n", " List of gene_symbols that are a subset of those in the full genes\n", " DataFrame.\n", "data_type: str (Default: \"log2\")\n", " Kind of expression matrix to load either \"log2\" or \"raw\". Defaults to\n", " \"log2\".\n", "chunk_size: int (Default: 8192)\n", " Size of the chunk to load from the anndata files. Adjust this size if\n", " needed based on memory/file io. Default: 8192.\n", "\n", "Returns\n", "-------\n", "output_gene_data: pandas.DataFrame\n", " Subset of gene data indexed by cell.\n", "\u001b[0;31mFile:\u001b[0m ~/src/abc_atlas_access/src/abc_atlas_access/abc_atlas_cache/anndata_utils.py\n", "\u001b[0;31mType:\u001b[0m function" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "?get_gene_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code commented out below will create a gene expression DataFrame over the full WHB dataset. Running this full process takes around 10 minutes to processes, however downloading the full data can take up to an hour depending on your download speed. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'\\ngene_data = get_gene_data(\\n abc_atlas_cache=abc_cache,\\n all_cells=cell,\\n all_genes=gene,\\n selected_genes=gene_names\\n)'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"\"\"\n", "gene_data = get_gene_data(\n", " abc_atlas_cache=abc_cache,\n", " all_cells=cell,\n", " all_genes=gene,\n", " selected_genes=gene_names\n", ")\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a quicker example, we'll load the Non-neurons from the WHB dataset only. This should take roughly a handful of minutes to complete and around 10 minutes download depending on your connection speed. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "loading file: WHB-10Xv3-Nonneurons\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WHB-10Xv3-Nonneurons-log2.h5ad: 100%|███████████████████████████████████████████████████████████████████████████████████████| 4.78G/4.78G [17:16<00:00, 4.62MMB/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " - time taken: 237.755479\n", "total time taken: 381.04928600000005\n", "\ttotal cells: 888263 processed cells: 888263\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gene_symbolPTPRCSLC17A6SLC32A1SLC17A7TTRPLP1AQP4
cell_label
10X362_3:TCAGTGAGTATTGACC0.00.00.00.00.010.1779270.0
10X362_5:TCCGTGTGTGAAAGTT0.00.00.00.00.09.2623790.0
10X362_5:CACGGGTAGAGCAGAA0.00.00.00.00.011.2401140.0
10X362_5:GATTCTTGTATGTCAC0.00.00.00.00.08.3145130.0
10X362_6:AGGACTTGTATCCTTT0.00.00.00.00.09.7361560.0
........................
10X194_8:GAAATGAGTTCGGCTG0.00.00.00.00.010.2108330.0
10X350_4:TTTACCATCGCACGAC9.5873010.00.00.00.08.0060840.0
10X225_1:AGAAGCGTCCATATGG8.5679610.00.00.00.00.00.0
10X221_5:TTGAACGCAGCCTTCT10.1196190.00.00.00.00.00.0
10X385_3:CTACCCAGTGGCGCTT0.00.00.00.00.00.00.0
\n", "

888263 rows × 7 columns

\n", "
" ], "text/plain": [ "gene_symbol PTPRC SLC17A6 SLC32A1 SLC17A7 TTR PLP1 \\\n", "cell_label \n", "10X362_3:TCAGTGAGTATTGACC 0.0 0.0 0.0 0.0 0.0 10.177927 \n", "10X362_5:TCCGTGTGTGAAAGTT 0.0 0.0 0.0 0.0 0.0 9.262379 \n", "10X362_5:CACGGGTAGAGCAGAA 0.0 0.0 0.0 0.0 0.0 11.240114 \n", "10X362_5:GATTCTTGTATGTCAC 0.0 0.0 0.0 0.0 0.0 8.314513 \n", "10X362_6:AGGACTTGTATCCTTT 0.0 0.0 0.0 0.0 0.0 9.736156 \n", "... ... ... ... ... ... ... \n", "10X194_8:GAAATGAGTTCGGCTG 0.0 0.0 0.0 0.0 0.0 10.210833 \n", "10X350_4:TTTACCATCGCACGAC 9.587301 0.0 0.0 0.0 0.0 8.006084 \n", "10X225_1:AGAAGCGTCCATATGG 8.567961 0.0 0.0 0.0 0.0 0.0 \n", "10X221_5:TTGAACGCAGCCTTCT 10.119619 0.0 0.0 0.0 0.0 0.0 \n", "10X385_3:CTACCCAGTGGCGCTT 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", "gene_symbol AQP4 \n", "cell_label \n", "10X362_3:TCAGTGAGTATTGACC 0.0 \n", "10X362_5:TCCGTGTGTGAAAGTT 0.0 \n", "10X362_5:CACGGGTAGAGCAGAA 0.0 \n", "10X362_5:GATTCTTGTATGTCAC 0.0 \n", "10X362_6:AGGACTTGTATCCTTT 0.0 \n", "... ... \n", "10X194_8:GAAATGAGTTCGGCTG 0.0 \n", "10X350_4:TTTACCATCGCACGAC 0.0 \n", "10X225_1:AGAAGCGTCCATATGG 0.0 \n", "10X221_5:TTGAACGCAGCCTTCT 0.0 \n", "10X385_3:CTACCCAGTGGCGCTT 0.0 \n", "\n", "[888263 rows x 7 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nonneuron_cells = cell[cell['feature_matrix_label'] == 'WHB-10Xv3-Nonneurons']\n", "gene_data = get_gene_data(\n", " abc_atlas_cache=abc_cache,\n", " all_cells=nonneuron_cells,\n", " all_genes=gene,\n", " selected_genes=gene_names\n", ")\n", "gene_data[pd.notna(gene_data[gene_data.columns[0]])]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The returned DataFrame is indexed by ``cell_label`` and can thus be joined with the ``cell`` DataFrame for further analysis." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cell_barcodebarcoded_cell_sample_labellibrary_labelfeature_matrix_labelentitybrain_section_labellibrary_methoddonor_labeldonor_sexdataset_label...cluster_aliasregion_of_interest_labelanatomical_division_labelPTPRCSLC17A6SLC32A1SLC17A7TTRPLP1AQP4
cell_label
10X386_2:CATGGATTCTCGACGGCATGGATTCTCGACGG10X386_2LKTX_210825_01_B01WHB-10Xv3-NeuronsnucleiH19.30.001.CX.5110Xv3H19.30.001MWHB-10Xv3...20Human MoANMyelencephalonNaNNaNNaNNaNNaNNaNNaN
10X383_5:TCTTGCGGTGAATTGATCTTGCGGTGAATTGA10X383_5LKTX_210818_02_E01WHB-10Xv3-NeuronsnucleiH19.30.002.BS.9410Xv3H19.30.002MWHB-10Xv3...20Human MoSRMyelencephalonNaNNaNNaNNaNNaNNaNNaN
10X386_2:CTCATCGGTCGAGCAACTCATCGGTCGAGCAA10X386_2LKTX_210825_01_B01WHB-10Xv3-NeuronsnucleiH19.30.001.CX.5110Xv3H19.30.001MWHB-10Xv3...17Human MoANMyelencephalonNaNNaNNaNNaNNaNNaNNaN
10X378_8:TTGGATGAGACAAGCCTTGGATGAGACAAGCC10X378_8LKTX_210809_01_H01WHB-10Xv3-NeuronsnucleiH19.30.002.BS.9310Xv3H19.30.002MWHB-10Xv3...18Human PnANPonsNaNNaNNaNNaNNaNNaNNaN
10X387_7:TGAACGTAGTATTCCGTGAACGTAGTATTCCG10X387_7LKTX_210825_02_G01WHB-10Xv3-NeuronsnucleiH19.30.001.CX.5110Xv3H19.30.001MWHB-10Xv3...16Human MoANMyelencephalonNaNNaNNaNNaNNaNNaNNaN
..................................................................
10X194_8:GAAATGAGTTCGGCTGGAAATGAGTTCGGCTG10X194_8LKTX_190529_02_H01WHB-10Xv3-NonneuronsnucleiH18.30.002.CX.5010Xv3H18.30.002MWHB-10Xv3...3264Human SNMidbrain0.00.00.00.00.010.2108330.0
10X350_4:TTTACCATCGCACGACTTTACCATCGCACGAC10X350_4LKTX_210421_03_D01WHB-10Xv3-NonneuronsnucleiH19.30.002.CB.6210Xv3H19.30.002MWHB-10Xv3...3265Human CbDNCerebellum9.5873010.00.00.00.08.0060840.0
10X225_1:AGAAGCGTCCATATGGAGAAGCGTCCATATGG10X225_1LKTX_190913_02_A01WHB-10Xv3-NonneuronsnucleiH18.30.002.CX.5110Xv3H18.30.002MWHB-10Xv3...3264Human PAGMidbrain8.5679610.00.00.00.00.00.0
10X221_5:TTGAACGCAGCCTTCTTTGAACGCAGCCTTCT10X221_5LKTX_190830_01_E01WHB-10Xv3-NonneuronsnucleiH18.30.002.CX.4910Xv3H18.30.002MWHB-10Xv3...3265Human STGCerebral cortex10.1196190.00.00.00.00.00.0
10X385_3:CTACCCAGTGGCGCTTCTACCCAGTGGCGCTT10X385_3LKTX_210818_04_C01WHB-10Xv3-NonneuronsnucleiH19.30.002.CX.4910Xv3H19.30.002MWHB-10Xv3...3264Human GPiBasal nuclei0.00.00.00.00.00.00.0
\n", "

3369219 rows × 22 columns

\n", "
" ], "text/plain": [ " cell_barcode barcoded_cell_sample_label \\\n", "cell_label \n", "10X386_2:CATGGATTCTCGACGG CATGGATTCTCGACGG 10X386_2 \n", "10X383_5:TCTTGCGGTGAATTGA TCTTGCGGTGAATTGA 10X383_5 \n", "10X386_2:CTCATCGGTCGAGCAA CTCATCGGTCGAGCAA 10X386_2 \n", "10X378_8:TTGGATGAGACAAGCC TTGGATGAGACAAGCC 10X378_8 \n", "10X387_7:TGAACGTAGTATTCCG TGAACGTAGTATTCCG 10X387_7 \n", "... ... ... \n", "10X194_8:GAAATGAGTTCGGCTG GAAATGAGTTCGGCTG 10X194_8 \n", "10X350_4:TTTACCATCGCACGAC TTTACCATCGCACGAC 10X350_4 \n", "10X225_1:AGAAGCGTCCATATGG AGAAGCGTCCATATGG 10X225_1 \n", "10X221_5:TTGAACGCAGCCTTCT TTGAACGCAGCCTTCT 10X221_5 \n", "10X385_3:CTACCCAGTGGCGCTT CTACCCAGTGGCGCTT 10X385_3 \n", "\n", " library_label feature_matrix_label entity \\\n", "cell_label \n", "10X386_2:CATGGATTCTCGACGG LKTX_210825_01_B01 WHB-10Xv3-Neurons nuclei \n", "10X383_5:TCTTGCGGTGAATTGA LKTX_210818_02_E01 WHB-10Xv3-Neurons nuclei \n", "10X386_2:CTCATCGGTCGAGCAA LKTX_210825_01_B01 WHB-10Xv3-Neurons nuclei \n", "10X378_8:TTGGATGAGACAAGCC LKTX_210809_01_H01 WHB-10Xv3-Neurons nuclei \n", "10X387_7:TGAACGTAGTATTCCG LKTX_210825_02_G01 WHB-10Xv3-Neurons nuclei \n", "... ... ... ... \n", "10X194_8:GAAATGAGTTCGGCTG LKTX_190529_02_H01 WHB-10Xv3-Nonneurons nuclei \n", "10X350_4:TTTACCATCGCACGAC LKTX_210421_03_D01 WHB-10Xv3-Nonneurons nuclei \n", "10X225_1:AGAAGCGTCCATATGG LKTX_190913_02_A01 WHB-10Xv3-Nonneurons nuclei \n", "10X221_5:TTGAACGCAGCCTTCT LKTX_190830_01_E01 WHB-10Xv3-Nonneurons nuclei \n", "10X385_3:CTACCCAGTGGCGCTT LKTX_210818_04_C01 WHB-10Xv3-Nonneurons nuclei \n", "\n", " brain_section_label library_method donor_label \\\n", "cell_label \n", "10X386_2:CATGGATTCTCGACGG H19.30.001.CX.51 10Xv3 H19.30.001 \n", "10X383_5:TCTTGCGGTGAATTGA H19.30.002.BS.94 10Xv3 H19.30.002 \n", "10X386_2:CTCATCGGTCGAGCAA H19.30.001.CX.51 10Xv3 H19.30.001 \n", "10X378_8:TTGGATGAGACAAGCC H19.30.002.BS.93 10Xv3 H19.30.002 \n", "10X387_7:TGAACGTAGTATTCCG H19.30.001.CX.51 10Xv3 H19.30.001 \n", "... ... ... ... \n", "10X194_8:GAAATGAGTTCGGCTG H18.30.002.CX.50 10Xv3 H18.30.002 \n", "10X350_4:TTTACCATCGCACGAC H19.30.002.CB.62 10Xv3 H19.30.002 \n", "10X225_1:AGAAGCGTCCATATGG H18.30.002.CX.51 10Xv3 H18.30.002 \n", "10X221_5:TTGAACGCAGCCTTCT H18.30.002.CX.49 10Xv3 H18.30.002 \n", "10X385_3:CTACCCAGTGGCGCTT H19.30.002.CX.49 10Xv3 H19.30.002 \n", "\n", " donor_sex dataset_label ... cluster_alias \\\n", "cell_label ... \n", "10X386_2:CATGGATTCTCGACGG M WHB-10Xv3 ... 20 \n", "10X383_5:TCTTGCGGTGAATTGA M WHB-10Xv3 ... 20 \n", "10X386_2:CTCATCGGTCGAGCAA M WHB-10Xv3 ... 17 \n", "10X378_8:TTGGATGAGACAAGCC M WHB-10Xv3 ... 18 \n", "10X387_7:TGAACGTAGTATTCCG M WHB-10Xv3 ... 16 \n", "... ... ... ... ... \n", "10X194_8:GAAATGAGTTCGGCTG M WHB-10Xv3 ... 3264 \n", "10X350_4:TTTACCATCGCACGAC M WHB-10Xv3 ... 3265 \n", "10X225_1:AGAAGCGTCCATATGG M WHB-10Xv3 ... 3264 \n", "10X221_5:TTGAACGCAGCCTTCT M WHB-10Xv3 ... 3265 \n", "10X385_3:CTACCCAGTGGCGCTT M WHB-10Xv3 ... 3264 \n", "\n", " region_of_interest_label \\\n", "cell_label \n", "10X386_2:CATGGATTCTCGACGG Human MoAN \n", "10X383_5:TCTTGCGGTGAATTGA Human MoSR \n", "10X386_2:CTCATCGGTCGAGCAA Human MoAN \n", "10X378_8:TTGGATGAGACAAGCC Human PnAN \n", "10X387_7:TGAACGTAGTATTCCG Human MoAN \n", "... ... \n", "10X194_8:GAAATGAGTTCGGCTG Human SN \n", "10X350_4:TTTACCATCGCACGAC Human CbDN \n", "10X225_1:AGAAGCGTCCATATGG Human PAG \n", "10X221_5:TTGAACGCAGCCTTCT Human STG \n", "10X385_3:CTACCCAGTGGCGCTT Human GPi \n", "\n", " anatomical_division_label PTPRC SLC17A6 \\\n", "cell_label \n", "10X386_2:CATGGATTCTCGACGG Myelencephalon NaN NaN \n", "10X383_5:TCTTGCGGTGAATTGA Myelencephalon NaN NaN \n", "10X386_2:CTCATCGGTCGAGCAA Myelencephalon NaN NaN \n", "10X378_8:TTGGATGAGACAAGCC Pons NaN NaN \n", "10X387_7:TGAACGTAGTATTCCG Myelencephalon NaN NaN \n", "... ... ... ... \n", "10X194_8:GAAATGAGTTCGGCTG Midbrain 0.0 0.0 \n", "10X350_4:TTTACCATCGCACGAC Cerebellum 9.587301 0.0 \n", "10X225_1:AGAAGCGTCCATATGG Midbrain 8.567961 0.0 \n", "10X221_5:TTGAACGCAGCCTTCT Cerebral cortex 10.119619 0.0 \n", "10X385_3:CTACCCAGTGGCGCTT Basal nuclei 0.0 0.0 \n", "\n", " SLC32A1 SLC17A7 TTR PLP1 AQP4 \n", "cell_label \n", "10X386_2:CATGGATTCTCGACGG NaN NaN NaN NaN NaN \n", "10X383_5:TCTTGCGGTGAATTGA NaN NaN NaN NaN NaN \n", "10X386_2:CTCATCGGTCGAGCAA NaN NaN NaN NaN NaN \n", "10X378_8:TTGGATGAGACAAGCC NaN NaN NaN NaN NaN \n", "10X387_7:TGAACGTAGTATTCCG NaN NaN NaN NaN NaN \n", "... ... ... ... ... ... \n", "10X194_8:GAAATGAGTTCGGCTG 0.0 0.0 0.0 10.210833 0.0 \n", "10X350_4:TTTACCATCGCACGAC 0.0 0.0 0.0 8.006084 0.0 \n", "10X225_1:AGAAGCGTCCATATGG 0.0 0.0 0.0 0.0 0.0 \n", "10X221_5:TTGAACGCAGCCTTCT 0.0 0.0 0.0 0.0 0.0 \n", "10X385_3:CTACCCAGTGGCGCTT 0.0 0.0 0.0 0.0 0.0 \n", "\n", "[3369219 rows x 22 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cells_with_genes = cell.join(gene_data)\n", "cells_with_genes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 4 }