{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "IPython magic command to render matplotlib plots."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 10x RNA-seq gene expression data (part 2a)\n",
    "\n",
    "In part 1, we explore two examples looking at the expression of canonical neurotransmitter transporter genes and gene Tac2 in the thalamus. In this notebook, we will prepare data so that we can repeat the examples for all cells spanning the whole brain. This notebook takes ~ 5 minutes to run.\n",
    "\n",
    "The results from this notebook has already been cached and saved. As such, you can skip this notebook and continue with part 2b. A cleaner example of accessing genes from the expression matricies can also be found in the ``general_accessing_10x_snRNASeq_tutorial.ipynb`` notebook.\n",
    "\n",
    "You need to be connected to the internet to run this notebook and have run through the [getting started notebook](https://alleninstitute.github.io/abc_atlas_access/notebooks/getting_started.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from pathlib import Path\n",
    "import numpy as np\n",
    "import anndata\n",
    "import time\n",
    "\n",
    "from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will interact with the data using the **AbcProjectCache**. This cache object tracks which data has been downloaded and serves the path to the requsted data on disk. For metadata, the cache can also directly serve a up a Pandas Dataframe. See the ``getting_started`` notebook for more details on using the cache including installing it if it has not already been.\n",
    "\n",
    "**Change the download_base variable to where you have downloaded the data in your system.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'releases/20240831/manifest.json'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "download_base = Path('../../data/abc_atlas')\n",
    "abc_cache = AbcProjectCache.from_cache_dir(download_base)\n",
    "\n",
    "abc_cache.current_manifest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "cell = abc_cache.get_metadata_dataframe(directory='WMB-10X', file_name='cell_metadata')\n",
    "cell.set_index('cell_label', inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Gene expression matrices\n",
    "\n",
    "The large 4 million cell dataset has been divided into 23 packages to make data transfer and download more efficient. Each package is formatted as annadata h5ad file with minimal metadata. In this next section, we provide example code on how to open the file and connect with the rich cell level metadata discussed above.\n",
    "\n",
    "For each subset, there are two h5ad files one storing the raw counts and the other log normalization of it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>cell_count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dataset_label</th>\n",
       "      <th>feature_matrix_label</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>WMB-10XMulti</th>\n",
       "      <th>WMB-10XMulti</th>\n",
       "      <td>1687</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"10\" valign=\"top\">WMB-10Xv2</th>\n",
       "      <th>WMB-10Xv2-CTXsp</th>\n",
       "      <td>43985</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WMB-10Xv2-HPF</th>\n",
       "      <td>207281</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WMB-10Xv2-HY</th>\n",
       "      <td>99879</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WMB-10Xv2-Isocortex-1</th>\n",
       "      <td>248776</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WMB-10Xv2-Isocortex-2</th>\n",
       "      <td>249360</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WMB-10Xv2-Isocortex-3</th>\n",
       "      <td>249356</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WMB-10Xv2-Isocortex-4</th>\n",
       "      <td>248784</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WMB-10Xv2-MB</th>\n",
       "      <td>29781</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WMB-10Xv2-OLF</th>\n",
       "      <td>192182</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WMB-10Xv2-TH</th>\n",
       "      <td>130555</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"13\" valign=\"top\">WMB-10Xv3</th>\n",
       "      <th>WMB-10Xv3-CB</th>\n",
       "      <td>181723</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WMB-10Xv3-CTXsp</th>\n",
       "      <td>78223</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WMB-10Xv3-HPF</th>\n",
       "      <td>181055</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WMB-10Xv3-HY</th>\n",
       "      <td>162296</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WMB-10Xv3-Isocortex-1</th>\n",
       "      <td>227670</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WMB-10Xv3-Isocortex-2</th>\n",
       "      <td>227537</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WMB-10Xv3-MB</th>\n",
       "      <td>337101</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WMB-10Xv3-MY</th>\n",
       "      <td>191746</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WMB-10Xv3-OLF</th>\n",
       "      <td>88560</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WMB-10Xv3-P</th>\n",
       "      <td>143157</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WMB-10Xv3-PAL</th>\n",
       "      <td>108046</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WMB-10Xv3-STR</th>\n",
       "      <td>283782</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WMB-10Xv3-TH</th>\n",
       "      <td>130454</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                     cell_count\n",
       "dataset_label feature_matrix_label             \n",
       "WMB-10XMulti  WMB-10XMulti                 1687\n",
       "WMB-10Xv2     WMB-10Xv2-CTXsp             43985\n",
       "              WMB-10Xv2-HPF              207281\n",
       "              WMB-10Xv2-HY                99879\n",
       "              WMB-10Xv2-Isocortex-1      248776\n",
       "              WMB-10Xv2-Isocortex-2      249360\n",
       "              WMB-10Xv2-Isocortex-3      249356\n",
       "              WMB-10Xv2-Isocortex-4      248784\n",
       "              WMB-10Xv2-MB                29781\n",
       "              WMB-10Xv2-OLF              192182\n",
       "              WMB-10Xv2-TH               130555\n",
       "WMB-10Xv3     WMB-10Xv3-CB               181723\n",
       "              WMB-10Xv3-CTXsp             78223\n",
       "              WMB-10Xv3-HPF              181055\n",
       "              WMB-10Xv3-HY               162296\n",
       "              WMB-10Xv3-Isocortex-1      227670\n",
       "              WMB-10Xv3-Isocortex-2      227537\n",
       "              WMB-10Xv3-MB               337101\n",
       "              WMB-10Xv3-MY               191746\n",
       "              WMB-10Xv3-OLF               88560\n",
       "              WMB-10Xv3-P                143157\n",
       "              WMB-10Xv3-PAL              108046\n",
       "              WMB-10Xv3-STR              283782\n",
       "              WMB-10Xv3-TH               130454"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "matrices = cell.groupby(['dataset_label', 'feature_matrix_label'])[['library_label']].count()\n",
    "matrices.columns  = ['cell_count']\n",
    "matrices"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example use cases\n",
    "\n",
    "In this section, we explore two use cases. The first example looks at the expression of nine canonical neurotransmitter transporter genes and the second the expression of gene Tac2.\n",
    "\n",
    "To support these use cases, we will create a smaller submatrix (all cells and 10 genes) and write to file for resue in part 2b. *Note this operation takes around 5 minutes*.\n",
    "\n",
    "A cleaner example of loading data from the expression matricies can be found in the ``general_accessing_10x_snRNASeq_tutorial.ipynb`` notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['WMB-10XMulti/log2', 'WMB-10XMulti/raw']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "abc_cache.list_data_files('WMB-10XMulti')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WMB-10XMulti-log2.h5ad: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 89.3M/89.3M [00:04<00:00, 19.0MMB/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/Users/chris.morrison/src/data/abc_atlas/expression_matrices/WMB-10XMulti/20230830/WMB-10XMulti-log2.h5ad\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "file = abc_cache.get_data_path(directory='WMB-10XMulti', file_name='WMB-10XMulti/log2')\n",
    "print(file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "ad = anndata.read_h5ad(file,backed='r')\n",
    "gene = ad.var"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>gene_symbol</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>gene_identifier</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ENSMUSG00000037771</th>\n",
       "      <td>Slc32a1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSMUSG00000070570</th>\n",
       "      <td>Slc17a7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSMUSG00000039728</th>\n",
       "      <td>Slc6a5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSMUSG00000030500</th>\n",
       "      <td>Slc17a6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSMUSG00000055368</th>\n",
       "      <td>Slc6a2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSMUSG00000019935</th>\n",
       "      <td>Slc17a8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSMUSG00000025400</th>\n",
       "      <td>Tac2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSMUSG00000020838</th>\n",
       "      <td>Slc6a4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSMUSG00000021609</th>\n",
       "      <td>Slc6a3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ENSMUSG00000100241</th>\n",
       "      <td>Slc18a3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   gene_symbol\n",
       "gene_identifier               \n",
       "ENSMUSG00000037771     Slc32a1\n",
       "ENSMUSG00000070570     Slc17a7\n",
       "ENSMUSG00000039728      Slc6a5\n",
       "ENSMUSG00000030500     Slc17a6\n",
       "ENSMUSG00000055368      Slc6a2\n",
       "ENSMUSG00000019935     Slc17a8\n",
       "ENSMUSG00000025400        Tac2\n",
       "ENSMUSG00000020838      Slc6a4\n",
       "ENSMUSG00000021609      Slc6a3\n",
       "ENSMUSG00000100241     Slc18a3"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ntgenes = ['Slc17a7', 'Slc17a6', 'Slc17a8', 'Slc32a1', 'Slc6a5', 'Slc18a3', 'Slc6a3', 'Slc6a4', 'Slc6a2']\n",
    "exgenes = ['Tac2']\n",
    "gnames = ntgenes + exgenes\n",
    "pred = [x in gnames for x in gene.gene_symbol]\n",
    "gene_filtered = gene[pred]\n",
    "gene_filtered"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WMB-10XMulti\n",
      " - time taken:  0.125119999999999\n",
      "WMB-10Xv2-CTXsp\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WMB-10Xv2-CTXsp-log2.h5ad: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 1.74G/1.74G [01:17<00:00, 22.5MMB/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " - time taken:  2.433600999999996\n",
      "WMB-10Xv2-HPF\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WMB-10Xv2-HPF-log2.h5ad: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6.10G/6.10G [15:59<00:00, 6.35MMB/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " - time taken:  9.911672999999979\n",
      "total time taken:  208.38595400000003\n"
     ]
    }
   ],
   "source": [
    "# create empty gene expression dataframe\n",
    "gdata = pd.DataFrame(index=cell.index, columns=gene_filtered.index)\n",
    "count = 0\n",
    "total_start = time.process_time()\n",
    "\n",
    "for matindex in matrices.index:\n",
    "    \n",
    "    ds = matindex[0]\n",
    "    mp = matindex[1]\n",
    "    \n",
    "    print(mp)\n",
    "    \n",
    "    file = abc_cache.get_data_path(directory=ds, file_name=mp + '/log2')\n",
    "    \n",
    "    pred = (cell['feature_matrix_label'] == mp)\n",
    "    cell_filtered = cell[pred]\n",
    "    \n",
    "    start = time.process_time()\n",
    "    ad = anndata.read_h5ad(file, backed='r')\n",
    "    exp = ad[cell_filtered.index, gene_filtered.index].to_df()\n",
    "    gdata.loc[ exp.index, gene_filtered.index ] = exp\n",
    "    print(\" - time taken: \", time.process_time() - start)\n",
    "    \n",
    "    ad.file.close()\n",
    "    del ad\n",
    "    \n",
    "    count += 1\n",
    "    \n",
    "    if count > 2 :\n",
    "       break\n",
    "        \n",
    "print(\"total time taken: \", time.process_time() - total_start)\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "252953\n"
     ]
    }
   ],
   "source": [
    "# change columns from index to gene symbol\n",
    "gdata.columns = gene_filtered.gene_symbol\n",
    "pred = pd.notna(gdata[gdata.columns[0]])\n",
    "gdata = gdata[pred].copy(deep=True)\n",
    "print(len(gdata))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}