Getting started#

Data associated with the Allen Brain Cell Atlas is hosted on Amazon Web Services (AWS) in an S3 bucket as a AWS Public Dataset. No account or login is required. The S3 bucket is located here arn:aws:s3:::allen-brain-cell-atlas. You will need to be connected to the internet to run this notebook.

Each data release has an associated manifest.json which lists all the specific version of directories and files that are part of the release. We recommend using the AbcProjectCache to download the data.

Expression matricies are stored in the anndata h5ad format and need to be downloaded to a local file system for usage.

This notebook shows how to use the AbcProjectCache to download the data required for the tutorials.

Below we install the python library we will be using throughout to this python enviroment.

pip install -U git+https://github.com/alleninstitute/abc_atlas_access
Collecting git+https://github.com/alleninstitute/abc_atlas_access
  Cloning https://github.com/alleninstitute/abc_atlas_access to /tmp/pip-req-build-xec2y8u2
  Running command git clone --quiet https://github.com/alleninstitute/abc_atlas_access /tmp/pip-req-build-xec2y8u2
  Resolved https://github.com/alleninstitute/abc_atlas_access to commit 8b52e7ccc086a7932c9d9289ffe18111630f333f
  Installing build dependencies ... ?25ldone
?25h  Getting requirements to build wheel ... ?25ldone
?25h  Preparing metadata (pyproject.toml) ... ?25ldone
?25hRequirement already satisfied: anndata in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (0.10.9)
Requirement already satisfied: boto3 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (1.35.42)
Requirement already satisfied: ghp-import in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (2.1.0)
Requirement already satisfied: matplotlib in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (3.9.2)
Requirement already satisfied: moto in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (5.0.17)
Requirement already satisfied: numpy in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (2.1.2)
Requirement already satisfied: pandas in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (2.2.3)
Requirement already satisfied: pydantic in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (2.9.2)
Requirement already satisfied: pytest in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (8.3.3)
Requirement already satisfied: requests in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (2.32.3)
Requirement already satisfied: scipy in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (1.14.1)
Requirement already satisfied: simpleitk in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (2.4.0)
Requirement already satisfied: tqdm in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (4.66.5)
Requirement already satisfied: array-api-compat!=1.5,>1.4 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from anndata->abc_atlas_access==0.2.0) (1.9)
Requirement already satisfied: h5py>=3.1 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from anndata->abc_atlas_access==0.2.0) (3.12.1)
Requirement already satisfied: natsort in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from anndata->abc_atlas_access==0.2.0) (8.4.0)
Requirement already satisfied: packaging>=20.0 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from anndata->abc_atlas_access==0.2.0) (24.1)
Requirement already satisfied: python-dateutil>=2.8.2 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from pandas->abc_atlas_access==0.2.0) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from pandas->abc_atlas_access==0.2.0) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from pandas->abc_atlas_access==0.2.0) (2024.2)
Requirement already satisfied: botocore<1.36.0,>=1.35.42 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from boto3->abc_atlas_access==0.2.0) (1.35.42)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from boto3->abc_atlas_access==0.2.0) (1.0.1)
Requirement already satisfied: s3transfer<0.11.0,>=0.10.0 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from boto3->abc_atlas_access==0.2.0) (0.10.3)
Requirement already satisfied: contourpy>=1.0.1 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from matplotlib->abc_atlas_access==0.2.0) (1.3.0)
Requirement already satisfied: cycler>=0.10 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from matplotlib->abc_atlas_access==0.2.0) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from matplotlib->abc_atlas_access==0.2.0) (4.54.1)
Requirement already satisfied: kiwisolver>=1.3.1 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from matplotlib->abc_atlas_access==0.2.0) (1.4.7)
Requirement already satisfied: pillow>=8 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from matplotlib->abc_atlas_access==0.2.0) (11.0.0)
Requirement already satisfied: pyparsing>=2.3.1 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from matplotlib->abc_atlas_access==0.2.0) (3.2.0)
Requirement already satisfied: cryptography>=3.3.1 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from moto->abc_atlas_access==0.2.0) (43.0.1)
Requirement already satisfied: xmltodict in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from moto->abc_atlas_access==0.2.0) (0.14.2)
Requirement already satisfied: werkzeug!=2.2.0,!=2.2.1,>=0.5 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from moto->abc_atlas_access==0.2.0) (3.0.4)
Requirement already satisfied: responses>=0.15.0 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from moto->abc_atlas_access==0.2.0) (0.25.3)
Requirement already satisfied: Jinja2>=2.10.1 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from moto->abc_atlas_access==0.2.0) (3.1.4)
Requirement already satisfied: charset-normalizer<4,>=2 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from requests->abc_atlas_access==0.2.0) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from requests->abc_atlas_access==0.2.0) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from requests->abc_atlas_access==0.2.0) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from requests->abc_atlas_access==0.2.0) (2024.8.30)
Requirement already satisfied: annotated-types>=0.6.0 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from pydantic->abc_atlas_access==0.2.0) (0.7.0)
Requirement already satisfied: pydantic-core==2.23.4 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from pydantic->abc_atlas_access==0.2.0) (2.23.4)
Requirement already satisfied: typing-extensions>=4.6.1 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from pydantic->abc_atlas_access==0.2.0) (4.12.2)
Requirement already satisfied: iniconfig in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from pytest->abc_atlas_access==0.2.0) (2.0.0)
Requirement already satisfied: pluggy<2,>=1.5 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from pytest->abc_atlas_access==0.2.0) (1.5.0)
Requirement already satisfied: cffi>=1.12 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from cryptography>=3.3.1->moto->abc_atlas_access==0.2.0) (1.17.1)
Requirement already satisfied: MarkupSafe>=2.0 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from Jinja2>=2.10.1->moto->abc_atlas_access==0.2.0) (2.1.3)
Requirement already satisfied: six>=1.5 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas->abc_atlas_access==0.2.0) (1.16.0)
Requirement already satisfied: pyyaml in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from responses>=0.15.0->moto->abc_atlas_access==0.2.0) (6.0.2)
Requirement already satisfied: pycparser in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from cffi>=1.12->cryptography>=3.3.1->moto->abc_atlas_access==0.2.0) (2.22)
Building wheels for collected packages: abc_atlas_access
  Building wheel for abc_atlas_access (pyproject.toml) ... ?25ldone
?25h  Created wheel for abc_atlas_access: filename=abc_atlas_access-0.2.0-py3-none-any.whl size=21329 sha256=083b81a227621479a6b9f839df3d92395811566595dc0b9d75b8c1fd0400e274
  Stored in directory: /tmp/pip-ephem-wheel-cache-it5hd7hl/wheels/10/64/b1/5ba3e93d1c252bf1b997c46ee8b4aaa4c21e4e5888caeaea20
Successfully built abc_atlas_access
Installing collected packages: abc_atlas_access
  Attempting uninstall: abc_atlas_access
    Found existing installation: abc_atlas_access 0.1.2
    Uninstalling abc_atlas_access-0.1.2:
      Successfully uninstalled abc_atlas_access-0.1.2
Successfully installed abc_atlas_access-0.2.0
Note: you may need to restart the kernel to use updated packages.

After installing these new packages we need to restart the python kernel in this notebook. This can either be done by selecting Restart Kernel... under the Kernel drop down menu above or uncommenting and running the cell below.

get_ipython().kernel.do_shutdown(restart=True)
{'status': 'ok', 'restart': True}

IPython magic command to render matplotlib plots.

from pathlib import Path
from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache

Using the cache#

Below we show how to setup up the cache to download from S3, how to list and switch to a different data release, and additionally how to list the directories available, their size, and the files in that directory.

Setup the AbcProjectCache object by specifying a directory and calling from_cache_dir as shown below. We also print what version of the manifest is being currently loaded by the cache. This will automatically instantiate the cache and set it up to either download data via a AWS S3 enabled cache or to load it through local read only cache depending on if the user has write access. The later is useful if accessing the data directly through a s3fs-fuse or similar mount of the AWS S3 bucket directly such as on CodeOcean.

Users can also specify a download enabled or read only local cache explicitly by using the funcitons from_s3_cache and from_local_cache respectively.

download_base = Path('../../data/abc_atlas')
abc_cache = AbcProjectCache.from_cache_dir(download_base)

abc_cache.current_manifest
/Users/chris.morrison/src/abc_atlas_access/src/abc_atlas_access/abc_atlas_cache/cloud_cache.py:630: OutdatedManifestWarning: You are loading releases/20240831/manifest.json. A more up to date version of the dataset -- releases/20241115/manifest.json -- exists online. To see the changes between the two versions of the dataset, run
type.compare_manifests('releases/20240831/manifest.json', 'releases/20241115/manifest.json')
To load another version of the dataset, run
type.load_manifest('releases/20241115/manifest.json')
  warnings.warn(msg, OutdatedManifestWarning)
'releases/20240831/manifest.json'

List the all of the different releases available and usable by the cache object we have just loaded.

abc_cache.list_manifest_file_names
['releases/20230630/manifest.json',
 'releases/20230830/manifest.json',
 'releases/20231215/manifest.json',
 'releases/20240330/manifest.json',
 'releases/20240831/manifest.json',
 'releases/20241115/manifest.json']

We can switch to a specific manifest and release version of the data using the load_manifest method. This determines what version of the released data the cache will download/return to the user. The cache will keep track of which version was last used across sessions. Upon instantiating a cache, the current manifest can be viewed with the method: current_manifest. Note that a warning will be thrown if the manifest loaded by the cache is older than the most recent manifest available.

Below we show an example of loading an older manifest. Any of the strings returned by list_manifest_file_names will be valid manifests, however, we’ll stick to the current manifest for this tutorial to avoid confusion.

abc_cache.load_manifest('releases/20230630/manifest.json')
print("old manifest loaded:", abc_cache.current_manifest)

# Return to the latest manifest
abc_cache.load_latest_manifest()
print("after latest manifest loaded:", abc_cache.current_manifest)
/Users/chris.morrison/src/abc_atlas_access/src/abc_atlas_access/abc_atlas_cache/cloud_cache.py:648: OutdatedManifestWarning: You are loading
releases/20241115/manifest.json
which is newer than the most recent manifest file you have previously been working with
releases/20240831/manifest.json
It is possible that some data files have changed between these two data releases, which will force you to re-download those data files (currently downloaded files will not be overwritten). To continue using releases/20240831/manifest.json, run
type.load_manifest('releases/20240831/manifest.json')
  warnings.warn(msg, OutdatedManifestWarning)
old manifest loaded: releases/20230630/manifest.json
after latest manifest loaded: releases/20241115/manifest.json

We can list all available directories in the release we loaded using the method below. We can then list all the available data and metadata files in those directories. Note that the cache will raise an exception if the requested kind of files (data files [e.g. h5ad expression_matricies, nii.gz image_volumes] or metadata files [e.g. csv files]) are not available in the directory.

abc_cache.list_directories
['Allen-CCF-2020',
 'MERFISH-C57BL6J-638850',
 'MERFISH-C57BL6J-638850-CCF',
 'MERFISH-C57BL6J-638850-imputed',
 'MERFISH-C57BL6J-638850-sections',
 'WHB-10Xv3',
 'WHB-taxonomy',
 'WMB-10X',
 'WMB-10XMulti',
 'WMB-10Xv2',
 'WMB-10Xv3',
 'WMB-neighborhoods',
 'WMB-taxonomy',
 'Zhuang-ABCA-1',
 'Zhuang-ABCA-1-CCF',
 'Zhuang-ABCA-2',
 'Zhuang-ABCA-2-CCF',
 'Zhuang-ABCA-3',
 'Zhuang-ABCA-3-CCF',
 'Zhuang-ABCA-4',
 'Zhuang-ABCA-4-CCF']
abc_cache.list_data_files('WMB-10Xv2')
['WMB-10Xv2-CTXsp/log2',
 'WMB-10Xv2-CTXsp/raw',
 'WMB-10Xv2-HPF/log2',
 'WMB-10Xv2-HPF/raw',
 'WMB-10Xv2-HY/log2',
 'WMB-10Xv2-HY/raw',
 'WMB-10Xv2-Isocortex-1/log2',
 'WMB-10Xv2-Isocortex-1/raw',
 'WMB-10Xv2-Isocortex-2/log2',
 'WMB-10Xv2-Isocortex-2/raw',
 'WMB-10Xv2-Isocortex-3/log2',
 'WMB-10Xv2-Isocortex-3/raw',
 'WMB-10Xv2-Isocortex-4/log2',
 'WMB-10Xv2-Isocortex-4/raw',
 'WMB-10Xv2-MB/log2',
 'WMB-10Xv2-MB/raw',
 'WMB-10Xv2-OLF/log2',
 'WMB-10Xv2-OLF/raw',
 'WMB-10Xv2-TH/log2',
 'WMB-10Xv2-TH/raw']
abc_cache.list_metadata_files('WMB-taxonomy')
['cluster',
 'cluster_annotation_term',
 'cluster_annotation_term_set',
 'cluster_annotation_term_with_counts',
 'cluster_to_cluster_annotation_membership',
 'cluster_to_cluster_annotation_membership_color',
 'cluster_to_cluster_annotation_membership_pivoted']

Before we start downloading data, we can check how much total data is in a given directory for both data files and metadata files.

abc_cache.get_directory_data_size('WMB-10Xv2')
'104.16 GB'
abc_cache.get_directory_metadata_size('WMB-taxonomy')
'4.65 MB'

Downloading files#

The next set of examples shows how to download data to the directory you specified when setting up the cache object. There are two main ways of downloading the data: individually by file or by full directory.

Downloading all data files or metadata files in a directory.#

Here we show how one can download the full set of data files or metadata files contained in a directory in the release. Use the list_directories as a guide here as to what data is available. Here we download all the data in two directories we know to be small. Once the download of all files is complete, a list of Paths to the downloaded files is returned.

The user should be warned that several directories are significant in size, >100 GB. If a directory is over 10 GB in size total, the cache will warn the user when requesting to download the data in the directory.

allen_ccf_list = abc_cache.get_directory_data('Allen-CCF-2020')
print("Allen-CCF-2020 data files:\n\t", allen_ccf_list)
annotation_10.nii.gz: 100%|████████████████████████████████████████████████| 27.5M/27.5M [00:04<00:00, 5.82MMB/s]
annotation_boundary_10.nii.gz: 100%|███████████████████████████████████████| 27.4M/27.4M [00:04<00:00, 6.03MMB/s]
average_template_10.nii.gz: 100%|████████████████████████████████████████████| 343M/343M [00:56<00:00, 6.11MMB/s]
Allen-CCF-2020 data files:
	 [PosixPath('/Users/chris.morrison/src/data/abc_atlas/image_volumes/Allen-CCF-2020/20230630/annotation_10.nii.gz'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/image_volumes/Allen-CCF-2020/20230630/annotation_boundary_10.nii.gz'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/image_volumes/Allen-CCF-2020/20230630/average_template_10.nii.gz')]

allen_ccf_list = abc_cache.get_directory_metadata('WMB-taxonomy')
print("WMB-taxonomy metadata files:\n\t", allen_ccf_list)
cluster.csv: 100%|████████████████████████████████████████████████████████████| 131k/131k [00:00<00:00, 930kMB/s]
cluster_annotation_term.csv: 100%|███████████████████████████████████████████| 861k/861k [00:00<00:00, 3.29MMB/s]
cluster_annotation_term_set.csv: 100%|█████████████████████████████████████| 1.11k/1.11k [00:00<00:00, 14.5kMB/s]
cluster_annotation_term_with_counts.csv: 100%|███████████████████████████████| 902k/902k [00:00<00:00, 3.57MMB/s]
cluster_to_cluster_annotation_membership.csv: 100%|████████████████████████| 2.21M/2.21M [00:00<00:00, 4.34MMB/s]
WMB-taxonomy metadata files:
	 [PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/cluster.csv'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/cluster_annotation_term.csv'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/cluster_annotation_term_set.csv'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/views/cluster_annotation_term_with_counts.csv'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/cluster_to_cluster_annotation_membership.csv'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/views/cluster_to_cluster_annotation_membership_color.csv'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/views/cluster_to_cluster_annotation_membership_pivoted.csv')]

Note that, after downloading the file successfully, running the get_directory_data or get_directory_metadata methods will return the list of the local paths without having to redownload the files.

allen_ccf_list = abc_cache.get_directory_data('Allen-CCF-2020')
print("Allen-CCF-2020 data files:\n\t", allen_ccf_list, "\n\n")
allen_ccf_list = abc_cache.get_directory_metadata('WMB-taxonomy')
print("WMB-taxonomy metadata files:\n\t", allen_ccf_list)
Allen-CCF-2020 data files:
	 [PosixPath('/Users/chris.morrison/src/data/abc_atlas/image_volumes/Allen-CCF-2020/20230630/annotation_10.nii.gz'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/image_volumes/Allen-CCF-2020/20230630/annotation_boundary_10.nii.gz'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/image_volumes/Allen-CCF-2020/20230630/average_template_10.nii.gz')] 


WMB-taxonomy metadata files:
	 [PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/cluster.csv'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/cluster_annotation_term.csv'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/cluster_annotation_term_set.csv'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/views/cluster_annotation_term_with_counts.csv'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/cluster_to_cluster_annotation_membership.csv'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/views/cluster_to_cluster_annotation_membership_color.csv'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/views/cluster_to_cluster_annotation_membership_pivoted.csv')]

Downloading individual files.#

The option also exists to download files individually. We can use list_directories and the methods list_data_files and list_metadata_files to guide us as to what is available to download. Below we will download one metadata file from the WMB-10X directory/dataset and one expression matrix data file from the WMB-10XMulti directory/dataset.

Downloading individual metadata files#

abc_cache.list_metadata_files('WMB-10X')
['cell_metadata',
 'cell_metadata_with_cluster_annotation',
 'example_genes_all_cells_expression',
 'gene',
 'region_of_interest_metadata']
abc_cache.get_metadata_path(directory='WMB-10X', file_name='gene')
gene.csv: 100%|████████████████████████████████████████████████████████████| 2.30M/2.30M [00:00<00:00, 4.04MMB/s]
PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-10X/20241115/gene.csv')

The cache can also return a dataframe for metadata objects. They are loaded with a generic index. Note that when using this method, it can accept additional argments that will be passed to the pandas.read_csv method. Examples of this are used throughout the notebooks in this repo.

abc_cache.get_metadata_dataframe(directory='WMB-10X', file_name='gene')
gene_identifier gene_symbol name mapped_ncbi_identifier comment
0 ENSMUSG00000051951 Xkr4 X-linked Kx blood group related 4 NCBIGene:497097 NaN
1 ENSMUSG00000089699 Gm1992 predicted gene 1992 NaN NaN
2 ENSMUSG00000102331 Gm19938 predicted gene, 19938 NaN NaN
3 ENSMUSG00000102343 Gm37381 predicted gene, 37381 NaN NaN
4 ENSMUSG00000025900 Rp1 retinitis pigmentosa 1 (human) NCBIGene:19888 NaN
... ... ... ... ... ...
32280 ENSMUSG00000095523 AC124606.1 PRAME family member 8-like NCBIGene:100038995 no expression
32281 ENSMUSG00000095475 AC133095.2 uncharacterized LOC545763 NCBIGene:545763 no expression
32282 ENSMUSG00000094855 AC133095.1 uncharacterized LOC620639 NCBIGene:620639 no expression
32283 ENSMUSG00000095019 AC234645.1 NaN NaN no expression
32284 ENSMUSG00000095041 AC149090.1 NaN NaN NaN

32285 rows × 5 columns

Downloading individual data files#

abc_cache.list_data_files('WMB-10XMulti')
['WMB-10XMulti/log2', 'WMB-10XMulti/raw']

Note how log2 and raw is added to the end of the file name returned by the above function and used below. If we were not to specify this in the input, the code will throw an error describing the ambiguity.

abc_cache.get_data_path(directory='WMB-10XMulti', file_name='WMB-10XMulti/log2')
WMB-10XMulti-log2.h5ad: 100%|██████████████████████████████████████████████| 89.3M/89.3M [00:14<00:00, 6.11MMB/s]
PosixPath('/Users/chris.morrison/src/data/abc_atlas/expression_matrices/WMB-10XMulti/20230830/WMB-10XMulti-log2.h5ad')
abc_cache.get_data_path(directory='WMB-10XMulti', file_name='WMB-10XMulti')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[18], line 1
----> 1 abc_cache.get_data_path(directory='WMB-10XMulti', file_name='WMB-10XMulti')

File ~/src/abc_atlas_access/src/abc_atlas_access/abc_atlas_cache/abc_project_cache.py:508, in AbcProjectCache.get_data_path(self, directory, file_name, force_download, skip_hash_check)
    503     data_path = self._get_local_path(
    504         directory=directory,
    505         file_name=file_name
    506     )
    507 else:
--> 508     data_path = self.cache.download_data(
    509         directory=directory,
    510         file_name=file_name,
    511         force_download=force_download,
    512         skip_hash_check=skip_hash_check
    513     )
    514 return data_path

File ~/src/abc_atlas_access/src/abc_atlas_access/abc_atlas_cache/cloud_cache.py:836, in CloudCacheBase.download_data(self, directory, file_name, force_download, skip_hash_check)
    802 def download_data(
    803     self,
    804     directory: str,
   (...)
    807     skip_hash_check: bool = False
    808 ) -> Path:
    809     """
    810     Return the local path to a data file, downloading the file
    811     if necessary
   (...)
    834         If the file cannot be downloaded
    835     """
--> 836     super_attributes = self.data_path(directory=directory,
    837                                       file_name=file_name)
    838     file_attributes = super_attributes['file_attributes']
    839     # If the file exists, check that it was downloaded successfully.

File ~/src/abc_atlas_access/src/abc_atlas_access/abc_atlas_cache/cloud_cache.py:403, in BasicLocalCache.data_path(self, directory, file_name)
    374 def data_path(self, directory: str, file_name: str) -> dict:
    375     """
    376     Return the local path to a data file, and test for the
    377     file's existence
   (...)
    401         If the file cannot be downloaded
    402     """
--> 403     output = self._get_file_path(directory=directory, file_name=file_name)
    405     return output

File ~/src/abc_atlas_access/src/abc_atlas_access/abc_atlas_cache/cloud_cache.py:329, in BasicLocalCache._get_file_path(self, directory, file_name)
    300 def _get_file_path(self, directory: str, file_name: str) -> dict:
    301     """
    302     Return the local path to a data file, and test for the
    303     file's existence.
   (...)
    327         If the file cannot be downloaded
    328     """
--> 329     file_attributes = self._manifest.get_file_attributes(
    330         directory=directory,
    331         file_name=file_name
    332     )
    333     exists = self._file_exists(file_attributes)
    334     local_path = file_attributes.local_path

File ~/src/abc_atlas_access/src/abc_atlas_access/abc_atlas_cache/manifest.py:238, in Manifest.get_file_attributes(self, directory, file_name)
    226             file_attributes = self._create_file_attributes(
    227                 remote_path=files_data[kind]["files"][file_type][
    228                     'url'],
   (...)
    235                 file_hash=files_data[kind]["files"][file_type]['file_hash']  # noqa: E501
    236             )
    237         elif kind is None and "files" not in files_data.keys():
--> 238             raise KeyError(
    239                 f"File {file_name} found in directory but multiple "
    240                 f"files found: {list(files_data.keys())}. Please "
    241                 "specify the file name as one of "
    242                 f"{['%s/%s' % (file_name, key) for key in files_data.keys()]}"  # noqa: E501
    243             )
    244 if file_attributes is None:
    245     raise KeyError(
    246         f"File {file_name} not found in directory {directory}."
    247     )

KeyError: "File WMB-10XMulti found in directory but multiple files found: ['log2', 'raw']. Please specify the file name as one of ['WMB-10XMulti/log2', 'WMB-10XMulti/raw']"

Advanced Options#

Forcing the cache to redownload data#

For all methods that download files, the option exists to force the cache to redownload the file(s). This can be useful if the downloaded file has become corrupted or accidentially deleted/changed. Below are examples of using it while downloading an inividual file or a full directory of files.

abc_cache.get_metadata_dataframe(directory='WMB-10X', file_name='gene', force_download=True)
gene.csv: 100%|████████████████████████████████████████████████████████████| 2.30M/2.30M [00:00<00:00, 3.74MMB/s]
gene_identifier gene_symbol name mapped_ncbi_identifier comment
0 ENSMUSG00000051951 Xkr4 X-linked Kx blood group related 4 NCBIGene:497097 NaN
1 ENSMUSG00000089699 Gm1992 predicted gene 1992 NaN NaN
2 ENSMUSG00000102331 Gm19938 predicted gene, 19938 NaN NaN
3 ENSMUSG00000102343 Gm37381 predicted gene, 37381 NaN NaN
4 ENSMUSG00000025900 Rp1 retinitis pigmentosa 1 (human) NCBIGene:19888 NaN
... ... ... ... ... ...
32280 ENSMUSG00000095523 AC124606.1 PRAME family member 8-like NCBIGene:100038995 no expression
32281 ENSMUSG00000095475 AC133095.2 uncharacterized LOC545763 NCBIGene:545763 no expression
32282 ENSMUSG00000094855 AC133095.1 uncharacterized LOC620639 NCBIGene:620639 no expression
32283 ENSMUSG00000095019 AC234645.1 NaN NaN no expression
32284 ENSMUSG00000095041 AC149090.1 NaN NaN NaN

32285 rows × 5 columns

allen_ccf_list = abc_cache.get_directory_data(directory='Allen-CCF-2020', force_download=True)
print("Allen-CCF-2020 data files:\n\t", allen_ccf_list)
annotation_10.nii.gz: 100%|████████████████████████████████████████████████| 27.5M/27.5M [00:04<00:00, 5.82MMB/s]
annotation_boundary_10.nii.gz: 100%|███████████████████████████████████████| 27.4M/27.4M [00:04<00:00, 5.92MMB/s]
average_template_10.nii.gz: 100%|████████████████████████████████████████████| 343M/343M [00:53<00:00, 6.36MMB/s]
Allen-CCF-2020 data files:
	 [PosixPath('/Users/chris.morrison/src/data/abc_atlas/image_volumes/Allen-CCF-2020/20230630/annotation_10.nii.gz'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/image_volumes/Allen-CCF-2020/20230630/annotation_boundary_10.nii.gz'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/image_volumes/Allen-CCF-2020/20230630/average_template_10.nii.gz')]

Skipping the file hashing check#

When a download is completed, a hash of the downloaded file is computed and checked against the expected hash in the manifest. While this check is recommeneded it can add overhead to the download process. skip_hash_check allows the user to skip computing the hash and assume the download has been completed successfully.

abc_cache.get_metadata_dataframe(directory='WMB-neighborhoods', file_name='UMAP20230830-TH-EPI-Glut', skip_hash_check=True)
UMAP20230830-TH-EPI-Glut.csv: 100%|████████████████████████████████████████| 6.46M/6.46M [00:01<00:00, 5.91MMB/s]
cell_label x y
0 CTCACACTCGTAGATC-044_D01 -4.603476 -6.148670
1 CCGTACTCATCCAACA-036_D01 -4.817812 -6.366151
2 AGCGGTCCATGGGAAC-037_A01 -4.798783 -6.577992
3 GATGAGGCATGTTCCC-037_B01 -5.188138 -5.892220
4 TAGTGGTAGGCGACAT-037_B01 -4.715829 -6.606307
... ... ... ...
126166 TTTGTTGTCCGACATA-290_B01 10.042111 10.349521
126167 TTTGTTGTCGTCTACC-294_B05 -1.630137 9.033476
126168 TTTGTTGTCGTTCCTG-463_A05 -6.848272 12.645908
126169 TTTGTTGTCGTTGCCT-621_A02 -6.982306 14.718120
126170 TTTGTTGTCTTTCGAT-574_A02 -5.292696 6.804039

126171 rows × 3 columns

abc_cache.get_directory_metadata(directory='Allen-CCF-2020', skip_hash_check=True)
parcellation.csv: 100%|█████████████████████████████████████████████████████| 41.2k/41.2k [00:00<00:00, 766kMB/s]
parcellation_term.csv: 100%|█████████████████████████████████████████████████| 177k/177k [00:00<00:00, 1.87MMB/s]
parcellation_term_set.csv: 100%|███████████████████████████████████████████████| 628/628 [00:00<00:00, 9.80kMB/s]
parcellation_term_set_membership.csv: 100%|███████████████████████████████████| 114k/114k [00:00<00:00, 918kMB/s]
parcellation_term_with_counts.csv: 100%|█████████████████████████████████████| 137k/137k [00:00<00:00, 1.20MMB/s]
parcellation_to_parcellation_term_membership.csv: 100%|██████████████████████| 680k/680k [00:00<00:00, 4.86MMB/s]
parcellation_to_parcellation_term_membership_acronym.csv: 100%|█████████████| 22.3k/22.3k [00:00<00:00, 588kMB/s]
parcellation_to_parcellation_term_membership_blue.csv: 100%|████████████████| 16.4k/16.4k [00:00<00:00, 321kMB/s]
parcellation_to_parcellation_term_membership_color.csv: 100%|███████████████| 30.5k/30.5k [00:00<00:00, 382kMB/s]
parcellation_to_parcellation_term_membership_green.csv: 100%|███████████████| 16.5k/16.5k [00:00<00:00, 163kMB/s]
parcellation_to_parcellation_term_membership_name.csv: 100%|████████████████| 75.8k/75.8k [00:00<00:00, 609kMB/s]
parcellation_to_parcellation_term_membership_red.csv: 100%|█████████████████| 16.0k/16.0k [00:00<00:00, 325kMB/s]
[PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/Allen-CCF-2020/20230630/parcellation.csv'),
 PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/Allen-CCF-2020/20230630/parcellation_term.csv'),
 PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/Allen-CCF-2020/20230630/parcellation_term_set.csv'),
 PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/Allen-CCF-2020/20230630/parcellation_term_set_membership.csv'),
 PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/Allen-CCF-2020/20230630/views/parcellation_term_with_counts.csv'),
 PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/Allen-CCF-2020/20230630/parcellation_to_parcellation_term_membership.csv'),
 PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/Allen-CCF-2020/20230630/views/parcellation_to_parcellation_term_membership_acronym.csv'),
 PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/Allen-CCF-2020/20230630/views/parcellation_to_parcellation_term_membership_blue.csv'),
 PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/Allen-CCF-2020/20230630/views/parcellation_to_parcellation_term_membership_color.csv'),
 PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/Allen-CCF-2020/20230630/views/parcellation_to_parcellation_term_membership_green.csv'),
 PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/Allen-CCF-2020/20230630/views/parcellation_to_parcellation_term_membership_name.csv'),
 PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/Allen-CCF-2020/20230630/views/parcellation_to_parcellation_term_membership_red.csv')]