Getting started#
Data associated with the Allen Brain Cell Atlas is hosted on Amazon Web Services (AWS) in an S3 bucket as a AWS Public Dataset. No account or login is required. The S3 bucket is located here arn:aws:s3:::allen-brain-cell-atlas. You will need to be connected to the internet to run this notebook.
Each data release has an associated manifest.json which lists all the specific version of directories and files that are part of the release. We recommend using the AbcProjectCache to download the data.
Expression matricies are stored in the anndata h5ad format and need to be downloaded to a local file system for usage.
This notebook shows how to use the AbcProjectCache to download the data required for the tutorials.
Below we install the python library we will be using throughout to this python enviroment.
pip install -U git+https://github.com/alleninstitute/abc_atlas_access
Collecting git+https://github.com/alleninstitute/abc_atlas_access
Cloning https://github.com/alleninstitute/abc_atlas_access to /tmp/pip-req-build-xec2y8u2
Running command git clone --quiet https://github.com/alleninstitute/abc_atlas_access /tmp/pip-req-build-xec2y8u2
Resolved https://github.com/alleninstitute/abc_atlas_access to commit 8b52e7ccc086a7932c9d9289ffe18111630f333f
Installing build dependencies ... ?25ldone
?25h Getting requirements to build wheel ... ?25ldone
?25h Preparing metadata (pyproject.toml) ... ?25ldone
?25hRequirement already satisfied: anndata in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (0.10.9)
Requirement already satisfied: boto3 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (1.35.42)
Requirement already satisfied: ghp-import in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (2.1.0)
Requirement already satisfied: matplotlib in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (3.9.2)
Requirement already satisfied: moto in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (5.0.17)
Requirement already satisfied: numpy in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (2.1.2)
Requirement already satisfied: pandas in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (2.2.3)
Requirement already satisfied: pydantic in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (2.9.2)
Requirement already satisfied: pytest in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (8.3.3)
Requirement already satisfied: requests in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (2.32.3)
Requirement already satisfied: scipy in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (1.14.1)
Requirement already satisfied: simpleitk in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (2.4.0)
Requirement already satisfied: tqdm in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.2.0) (4.66.5)
Requirement already satisfied: array-api-compat!=1.5,>1.4 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from anndata->abc_atlas_access==0.2.0) (1.9)
Requirement already satisfied: h5py>=3.1 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from anndata->abc_atlas_access==0.2.0) (3.12.1)
Requirement already satisfied: natsort in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from anndata->abc_atlas_access==0.2.0) (8.4.0)
Requirement already satisfied: packaging>=20.0 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from anndata->abc_atlas_access==0.2.0) (24.1)
Requirement already satisfied: python-dateutil>=2.8.2 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from pandas->abc_atlas_access==0.2.0) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from pandas->abc_atlas_access==0.2.0) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from pandas->abc_atlas_access==0.2.0) (2024.2)
Requirement already satisfied: botocore<1.36.0,>=1.35.42 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from boto3->abc_atlas_access==0.2.0) (1.35.42)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from boto3->abc_atlas_access==0.2.0) (1.0.1)
Requirement already satisfied: s3transfer<0.11.0,>=0.10.0 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from boto3->abc_atlas_access==0.2.0) (0.10.3)
Requirement already satisfied: contourpy>=1.0.1 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from matplotlib->abc_atlas_access==0.2.0) (1.3.0)
Requirement already satisfied: cycler>=0.10 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from matplotlib->abc_atlas_access==0.2.0) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from matplotlib->abc_atlas_access==0.2.0) (4.54.1)
Requirement already satisfied: kiwisolver>=1.3.1 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from matplotlib->abc_atlas_access==0.2.0) (1.4.7)
Requirement already satisfied: pillow>=8 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from matplotlib->abc_atlas_access==0.2.0) (11.0.0)
Requirement already satisfied: pyparsing>=2.3.1 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from matplotlib->abc_atlas_access==0.2.0) (3.2.0)
Requirement already satisfied: cryptography>=3.3.1 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from moto->abc_atlas_access==0.2.0) (43.0.1)
Requirement already satisfied: xmltodict in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from moto->abc_atlas_access==0.2.0) (0.14.2)
Requirement already satisfied: werkzeug!=2.2.0,!=2.2.1,>=0.5 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from moto->abc_atlas_access==0.2.0) (3.0.4)
Requirement already satisfied: responses>=0.15.0 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from moto->abc_atlas_access==0.2.0) (0.25.3)
Requirement already satisfied: Jinja2>=2.10.1 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from moto->abc_atlas_access==0.2.0) (3.1.4)
Requirement already satisfied: charset-normalizer<4,>=2 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from requests->abc_atlas_access==0.2.0) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from requests->abc_atlas_access==0.2.0) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from requests->abc_atlas_access==0.2.0) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from requests->abc_atlas_access==0.2.0) (2024.8.30)
Requirement already satisfied: annotated-types>=0.6.0 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from pydantic->abc_atlas_access==0.2.0) (0.7.0)
Requirement already satisfied: pydantic-core==2.23.4 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from pydantic->abc_atlas_access==0.2.0) (2.23.4)
Requirement already satisfied: typing-extensions>=4.6.1 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from pydantic->abc_atlas_access==0.2.0) (4.12.2)
Requirement already satisfied: iniconfig in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from pytest->abc_atlas_access==0.2.0) (2.0.0)
Requirement already satisfied: pluggy<2,>=1.5 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from pytest->abc_atlas_access==0.2.0) (1.5.0)
Requirement already satisfied: cffi>=1.12 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from cryptography>=3.3.1->moto->abc_atlas_access==0.2.0) (1.17.1)
Requirement already satisfied: MarkupSafe>=2.0 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from Jinja2>=2.10.1->moto->abc_atlas_access==0.2.0) (2.1.3)
Requirement already satisfied: six>=1.5 in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas->abc_atlas_access==0.2.0) (1.16.0)
Requirement already satisfied: pyyaml in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from responses>=0.15.0->moto->abc_atlas_access==0.2.0) (6.0.2)
Requirement already satisfied: pycparser in /allen/aibs/informatics/chris.morrison/miniconda/envs/abc_atlas_access/lib/python3.11/site-packages (from cffi>=1.12->cryptography>=3.3.1->moto->abc_atlas_access==0.2.0) (2.22)
Building wheels for collected packages: abc_atlas_access
Building wheel for abc_atlas_access (pyproject.toml) ... ?25ldone
?25h Created wheel for abc_atlas_access: filename=abc_atlas_access-0.2.0-py3-none-any.whl size=21329 sha256=083b81a227621479a6b9f839df3d92395811566595dc0b9d75b8c1fd0400e274
Stored in directory: /tmp/pip-ephem-wheel-cache-it5hd7hl/wheels/10/64/b1/5ba3e93d1c252bf1b997c46ee8b4aaa4c21e4e5888caeaea20
Successfully built abc_atlas_access
Installing collected packages: abc_atlas_access
Attempting uninstall: abc_atlas_access
Found existing installation: abc_atlas_access 0.1.2
Uninstalling abc_atlas_access-0.1.2:
Successfully uninstalled abc_atlas_access-0.1.2
Successfully installed abc_atlas_access-0.2.0
Note: you may need to restart the kernel to use updated packages.
After installing these new packages we need to restart the python kernel in this notebook. This can either be done by selecting Restart Kernel...
under the Kernel
drop down menu above or uncommenting and running the cell below.
get_ipython().kernel.do_shutdown(restart=True)
{'status': 'ok', 'restart': True}
IPython magic command to render matplotlib plots.
from pathlib import Path
from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache
Using the cache#
Below we show how to setup up the cache to download from S3, how to list and switch to a different data release, and additionally how to list the directories available, their size, and the files in that directory.
Setup the AbcProjectCache object by specifying a directory and calling from_cache_dir
as shown below. We also print what version of the manifest is being currently loaded by the cache. This will automatically instantiate the cache and set it up to either download data via a AWS S3 enabled cache or to load it through local read only cache depending on if the user has write access. The later is useful if accessing the data directly through a s3fs-fuse or similar mount of the AWS S3 bucket directly such as on CodeOcean.
Users can also specify a download enabled or read only local cache explicitly by using the funcitons from_s3_cache
and from_local_cache
respectively.
download_base = Path('../../data/abc_atlas')
abc_cache = AbcProjectCache.from_cache_dir(download_base)
abc_cache.current_manifest
/Users/chris.morrison/src/abc_atlas_access/src/abc_atlas_access/abc_atlas_cache/cloud_cache.py:630: OutdatedManifestWarning: You are loading releases/20240831/manifest.json. A more up to date version of the dataset -- releases/20241115/manifest.json -- exists online. To see the changes between the two versions of the dataset, run
type.compare_manifests('releases/20240831/manifest.json', 'releases/20241115/manifest.json')
To load another version of the dataset, run
type.load_manifest('releases/20241115/manifest.json')
warnings.warn(msg, OutdatedManifestWarning)
'releases/20240831/manifest.json'
List the all of the different releases available and usable by the cache object we have just loaded.
abc_cache.list_manifest_file_names
['releases/20230630/manifest.json',
'releases/20230830/manifest.json',
'releases/20231215/manifest.json',
'releases/20240330/manifest.json',
'releases/20240831/manifest.json',
'releases/20241115/manifest.json']
We can switch to a specific manifest and release version of the data using the load_manifest
method. This determines what version of the released data the cache will download/return to the user. The cache will keep track of which version was last used across sessions. Upon instantiating a cache, the current manifest can be viewed with the method: current_manifest
. Note that a warning will be thrown if the manifest loaded by the cache is older than the most recent manifest available.
Below we show an example of loading an older manifest. Any of the strings returned by list_manifest_file_names
will be valid manifests, however, we’ll stick to the current manifest for this tutorial to avoid confusion.
abc_cache.load_manifest('releases/20230630/manifest.json')
print("old manifest loaded:", abc_cache.current_manifest)
# Return to the latest manifest
abc_cache.load_latest_manifest()
print("after latest manifest loaded:", abc_cache.current_manifest)
/Users/chris.morrison/src/abc_atlas_access/src/abc_atlas_access/abc_atlas_cache/cloud_cache.py:648: OutdatedManifestWarning: You are loading
releases/20241115/manifest.json
which is newer than the most recent manifest file you have previously been working with
releases/20240831/manifest.json
It is possible that some data files have changed between these two data releases, which will force you to re-download those data files (currently downloaded files will not be overwritten). To continue using releases/20240831/manifest.json, run
type.load_manifest('releases/20240831/manifest.json')
warnings.warn(msg, OutdatedManifestWarning)
old manifest loaded: releases/20230630/manifest.json
after latest manifest loaded: releases/20241115/manifest.json
We can list all available directories in the release we loaded using the method below. We can then list all the available data and metadata files in those directories. Note that the cache will raise an exception if the requested kind of files (data files [e.g. h5ad expression_matricies, nii.gz image_volumes] or metadata files [e.g. csv files]) are not available in the directory.
abc_cache.list_directories
['Allen-CCF-2020',
'MERFISH-C57BL6J-638850',
'MERFISH-C57BL6J-638850-CCF',
'MERFISH-C57BL6J-638850-imputed',
'MERFISH-C57BL6J-638850-sections',
'WHB-10Xv3',
'WHB-taxonomy',
'WMB-10X',
'WMB-10XMulti',
'WMB-10Xv2',
'WMB-10Xv3',
'WMB-neighborhoods',
'WMB-taxonomy',
'Zhuang-ABCA-1',
'Zhuang-ABCA-1-CCF',
'Zhuang-ABCA-2',
'Zhuang-ABCA-2-CCF',
'Zhuang-ABCA-3',
'Zhuang-ABCA-3-CCF',
'Zhuang-ABCA-4',
'Zhuang-ABCA-4-CCF']
abc_cache.list_data_files('WMB-10Xv2')
['WMB-10Xv2-CTXsp/log2',
'WMB-10Xv2-CTXsp/raw',
'WMB-10Xv2-HPF/log2',
'WMB-10Xv2-HPF/raw',
'WMB-10Xv2-HY/log2',
'WMB-10Xv2-HY/raw',
'WMB-10Xv2-Isocortex-1/log2',
'WMB-10Xv2-Isocortex-1/raw',
'WMB-10Xv2-Isocortex-2/log2',
'WMB-10Xv2-Isocortex-2/raw',
'WMB-10Xv2-Isocortex-3/log2',
'WMB-10Xv2-Isocortex-3/raw',
'WMB-10Xv2-Isocortex-4/log2',
'WMB-10Xv2-Isocortex-4/raw',
'WMB-10Xv2-MB/log2',
'WMB-10Xv2-MB/raw',
'WMB-10Xv2-OLF/log2',
'WMB-10Xv2-OLF/raw',
'WMB-10Xv2-TH/log2',
'WMB-10Xv2-TH/raw']
abc_cache.list_metadata_files('WMB-taxonomy')
['cluster',
'cluster_annotation_term',
'cluster_annotation_term_set',
'cluster_annotation_term_with_counts',
'cluster_to_cluster_annotation_membership',
'cluster_to_cluster_annotation_membership_color',
'cluster_to_cluster_annotation_membership_pivoted']
Before we start downloading data, we can check how much total data is in a given directory for both data files and metadata files.
abc_cache.get_directory_data_size('WMB-10Xv2')
'104.16 GB'
abc_cache.get_directory_metadata_size('WMB-taxonomy')
'4.65 MB'
Downloading files#
The next set of examples shows how to download data to the directory you specified when setting up the cache object. There are two main ways of downloading the data: individually by file or by full directory.
Downloading all data files or metadata files in a directory.#
Here we show how one can download the full set of data files or metadata files contained in a directory in the release. Use the list_directories
as a guide here as to what data is available. Here we download all the data in two directories we know to be small. Once the download of all files is complete, a list of Paths to the downloaded files is returned.
The user should be warned that several directories are significant in size, >100 GB. If a directory is over 10 GB in size total, the cache will warn the user when requesting to download the data in the directory.
allen_ccf_list = abc_cache.get_directory_data('Allen-CCF-2020')
print("Allen-CCF-2020 data files:\n\t", allen_ccf_list)
annotation_10.nii.gz: 100%|████████████████████████████████████████████████| 27.5M/27.5M [00:04<00:00, 5.82MMB/s]
annotation_boundary_10.nii.gz: 100%|███████████████████████████████████████| 27.4M/27.4M [00:04<00:00, 6.03MMB/s]
average_template_10.nii.gz: 100%|████████████████████████████████████████████| 343M/343M [00:56<00:00, 6.11MMB/s]
Allen-CCF-2020 data files:
[PosixPath('/Users/chris.morrison/src/data/abc_atlas/image_volumes/Allen-CCF-2020/20230630/annotation_10.nii.gz'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/image_volumes/Allen-CCF-2020/20230630/annotation_boundary_10.nii.gz'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/image_volumes/Allen-CCF-2020/20230630/average_template_10.nii.gz')]
allen_ccf_list = abc_cache.get_directory_metadata('WMB-taxonomy')
print("WMB-taxonomy metadata files:\n\t", allen_ccf_list)
cluster.csv: 100%|████████████████████████████████████████████████████████████| 131k/131k [00:00<00:00, 930kMB/s]
cluster_annotation_term.csv: 100%|███████████████████████████████████████████| 861k/861k [00:00<00:00, 3.29MMB/s]
cluster_annotation_term_set.csv: 100%|█████████████████████████████████████| 1.11k/1.11k [00:00<00:00, 14.5kMB/s]
cluster_annotation_term_with_counts.csv: 100%|███████████████████████████████| 902k/902k [00:00<00:00, 3.57MMB/s]
cluster_to_cluster_annotation_membership.csv: 100%|████████████████████████| 2.21M/2.21M [00:00<00:00, 4.34MMB/s]
WMB-taxonomy metadata files:
[PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/cluster.csv'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/cluster_annotation_term.csv'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/cluster_annotation_term_set.csv'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/views/cluster_annotation_term_with_counts.csv'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/cluster_to_cluster_annotation_membership.csv'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/views/cluster_to_cluster_annotation_membership_color.csv'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/views/cluster_to_cluster_annotation_membership_pivoted.csv')]
Note that, after downloading the file successfully, running the get_directory_data
or get_directory_metadata
methods will return the list of the local paths without having to redownload the files.
allen_ccf_list = abc_cache.get_directory_data('Allen-CCF-2020')
print("Allen-CCF-2020 data files:\n\t", allen_ccf_list, "\n\n")
allen_ccf_list = abc_cache.get_directory_metadata('WMB-taxonomy')
print("WMB-taxonomy metadata files:\n\t", allen_ccf_list)
Allen-CCF-2020 data files:
[PosixPath('/Users/chris.morrison/src/data/abc_atlas/image_volumes/Allen-CCF-2020/20230630/annotation_10.nii.gz'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/image_volumes/Allen-CCF-2020/20230630/annotation_boundary_10.nii.gz'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/image_volumes/Allen-CCF-2020/20230630/average_template_10.nii.gz')]
WMB-taxonomy metadata files:
[PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/cluster.csv'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/cluster_annotation_term.csv'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/cluster_annotation_term_set.csv'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/views/cluster_annotation_term_with_counts.csv'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/cluster_to_cluster_annotation_membership.csv'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/views/cluster_to_cluster_annotation_membership_color.csv'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-taxonomy/20231215/views/cluster_to_cluster_annotation_membership_pivoted.csv')]
Downloading individual files.#
The option also exists to download files individually. We can use list_directories
and the methods list_data_files
and list_metadata_files
to guide us as to what is available to download. Below we will download one metadata file from the WMB-10X directory/dataset and one expression matrix data file from the WMB-10XMulti directory/dataset.
Downloading individual metadata files#
abc_cache.list_metadata_files('WMB-10X')
['cell_metadata',
'cell_metadata_with_cluster_annotation',
'example_genes_all_cells_expression',
'gene',
'region_of_interest_metadata']
abc_cache.get_metadata_path(directory='WMB-10X', file_name='gene')
gene.csv: 100%|████████████████████████████████████████████████████████████| 2.30M/2.30M [00:00<00:00, 4.04MMB/s]
PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/WMB-10X/20241115/gene.csv')
The cache can also return a dataframe for metadata objects. They are loaded with a generic index. Note that when using this method, it can accept additional argments that will be passed to the pandas.read_csv method. Examples of this are used throughout the notebooks in this repo.
abc_cache.get_metadata_dataframe(directory='WMB-10X', file_name='gene')
gene_identifier | gene_symbol | name | mapped_ncbi_identifier | comment | |
---|---|---|---|---|---|
0 | ENSMUSG00000051951 | Xkr4 | X-linked Kx blood group related 4 | NCBIGene:497097 | NaN |
1 | ENSMUSG00000089699 | Gm1992 | predicted gene 1992 | NaN | NaN |
2 | ENSMUSG00000102331 | Gm19938 | predicted gene, 19938 | NaN | NaN |
3 | ENSMUSG00000102343 | Gm37381 | predicted gene, 37381 | NaN | NaN |
4 | ENSMUSG00000025900 | Rp1 | retinitis pigmentosa 1 (human) | NCBIGene:19888 | NaN |
... | ... | ... | ... | ... | ... |
32280 | ENSMUSG00000095523 | AC124606.1 | PRAME family member 8-like | NCBIGene:100038995 | no expression |
32281 | ENSMUSG00000095475 | AC133095.2 | uncharacterized LOC545763 | NCBIGene:545763 | no expression |
32282 | ENSMUSG00000094855 | AC133095.1 | uncharacterized LOC620639 | NCBIGene:620639 | no expression |
32283 | ENSMUSG00000095019 | AC234645.1 | NaN | NaN | no expression |
32284 | ENSMUSG00000095041 | AC149090.1 | NaN | NaN | NaN |
32285 rows × 5 columns
Downloading individual data files#
abc_cache.list_data_files('WMB-10XMulti')
['WMB-10XMulti/log2', 'WMB-10XMulti/raw']
Note how log2
and raw
is added to the end of the file name returned by the above function and used below. If we were not to specify this in the input, the code will throw an error describing the ambiguity.
abc_cache.get_data_path(directory='WMB-10XMulti', file_name='WMB-10XMulti/log2')
WMB-10XMulti-log2.h5ad: 100%|██████████████████████████████████████████████| 89.3M/89.3M [00:14<00:00, 6.11MMB/s]
PosixPath('/Users/chris.morrison/src/data/abc_atlas/expression_matrices/WMB-10XMulti/20230830/WMB-10XMulti-log2.h5ad')
abc_cache.get_data_path(directory='WMB-10XMulti', file_name='WMB-10XMulti')
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[18], line 1
----> 1 abc_cache.get_data_path(directory='WMB-10XMulti', file_name='WMB-10XMulti')
File ~/src/abc_atlas_access/src/abc_atlas_access/abc_atlas_cache/abc_project_cache.py:508, in AbcProjectCache.get_data_path(self, directory, file_name, force_download, skip_hash_check)
503 data_path = self._get_local_path(
504 directory=directory,
505 file_name=file_name
506 )
507 else:
--> 508 data_path = self.cache.download_data(
509 directory=directory,
510 file_name=file_name,
511 force_download=force_download,
512 skip_hash_check=skip_hash_check
513 )
514 return data_path
File ~/src/abc_atlas_access/src/abc_atlas_access/abc_atlas_cache/cloud_cache.py:836, in CloudCacheBase.download_data(self, directory, file_name, force_download, skip_hash_check)
802 def download_data(
803 self,
804 directory: str,
(...)
807 skip_hash_check: bool = False
808 ) -> Path:
809 """
810 Return the local path to a data file, downloading the file
811 if necessary
(...)
834 If the file cannot be downloaded
835 """
--> 836 super_attributes = self.data_path(directory=directory,
837 file_name=file_name)
838 file_attributes = super_attributes['file_attributes']
839 # If the file exists, check that it was downloaded successfully.
File ~/src/abc_atlas_access/src/abc_atlas_access/abc_atlas_cache/cloud_cache.py:403, in BasicLocalCache.data_path(self, directory, file_name)
374 def data_path(self, directory: str, file_name: str) -> dict:
375 """
376 Return the local path to a data file, and test for the
377 file's existence
(...)
401 If the file cannot be downloaded
402 """
--> 403 output = self._get_file_path(directory=directory, file_name=file_name)
405 return output
File ~/src/abc_atlas_access/src/abc_atlas_access/abc_atlas_cache/cloud_cache.py:329, in BasicLocalCache._get_file_path(self, directory, file_name)
300 def _get_file_path(self, directory: str, file_name: str) -> dict:
301 """
302 Return the local path to a data file, and test for the
303 file's existence.
(...)
327 If the file cannot be downloaded
328 """
--> 329 file_attributes = self._manifest.get_file_attributes(
330 directory=directory,
331 file_name=file_name
332 )
333 exists = self._file_exists(file_attributes)
334 local_path = file_attributes.local_path
File ~/src/abc_atlas_access/src/abc_atlas_access/abc_atlas_cache/manifest.py:238, in Manifest.get_file_attributes(self, directory, file_name)
226 file_attributes = self._create_file_attributes(
227 remote_path=files_data[kind]["files"][file_type][
228 'url'],
(...)
235 file_hash=files_data[kind]["files"][file_type]['file_hash'] # noqa: E501
236 )
237 elif kind is None and "files" not in files_data.keys():
--> 238 raise KeyError(
239 f"File {file_name} found in directory but multiple "
240 f"files found: {list(files_data.keys())}. Please "
241 "specify the file name as one of "
242 f"{['%s/%s' % (file_name, key) for key in files_data.keys()]}" # noqa: E501
243 )
244 if file_attributes is None:
245 raise KeyError(
246 f"File {file_name} not found in directory {directory}."
247 )
KeyError: "File WMB-10XMulti found in directory but multiple files found: ['log2', 'raw']. Please specify the file name as one of ['WMB-10XMulti/log2', 'WMB-10XMulti/raw']"
Advanced Options#
Forcing the cache to redownload data#
For all methods that download files, the option exists to force the cache to redownload the file(s). This can be useful if the downloaded file has become corrupted or accidentially deleted/changed. Below are examples of using it while downloading an inividual file or a full directory of files.
abc_cache.get_metadata_dataframe(directory='WMB-10X', file_name='gene', force_download=True)
gene.csv: 100%|████████████████████████████████████████████████████████████| 2.30M/2.30M [00:00<00:00, 3.74MMB/s]
gene_identifier | gene_symbol | name | mapped_ncbi_identifier | comment | |
---|---|---|---|---|---|
0 | ENSMUSG00000051951 | Xkr4 | X-linked Kx blood group related 4 | NCBIGene:497097 | NaN |
1 | ENSMUSG00000089699 | Gm1992 | predicted gene 1992 | NaN | NaN |
2 | ENSMUSG00000102331 | Gm19938 | predicted gene, 19938 | NaN | NaN |
3 | ENSMUSG00000102343 | Gm37381 | predicted gene, 37381 | NaN | NaN |
4 | ENSMUSG00000025900 | Rp1 | retinitis pigmentosa 1 (human) | NCBIGene:19888 | NaN |
... | ... | ... | ... | ... | ... |
32280 | ENSMUSG00000095523 | AC124606.1 | PRAME family member 8-like | NCBIGene:100038995 | no expression |
32281 | ENSMUSG00000095475 | AC133095.2 | uncharacterized LOC545763 | NCBIGene:545763 | no expression |
32282 | ENSMUSG00000094855 | AC133095.1 | uncharacterized LOC620639 | NCBIGene:620639 | no expression |
32283 | ENSMUSG00000095019 | AC234645.1 | NaN | NaN | no expression |
32284 | ENSMUSG00000095041 | AC149090.1 | NaN | NaN | NaN |
32285 rows × 5 columns
allen_ccf_list = abc_cache.get_directory_data(directory='Allen-CCF-2020', force_download=True)
print("Allen-CCF-2020 data files:\n\t", allen_ccf_list)
annotation_10.nii.gz: 100%|████████████████████████████████████████████████| 27.5M/27.5M [00:04<00:00, 5.82MMB/s]
annotation_boundary_10.nii.gz: 100%|███████████████████████████████████████| 27.4M/27.4M [00:04<00:00, 5.92MMB/s]
average_template_10.nii.gz: 100%|████████████████████████████████████████████| 343M/343M [00:53<00:00, 6.36MMB/s]
Allen-CCF-2020 data files:
[PosixPath('/Users/chris.morrison/src/data/abc_atlas/image_volumes/Allen-CCF-2020/20230630/annotation_10.nii.gz'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/image_volumes/Allen-CCF-2020/20230630/annotation_boundary_10.nii.gz'), PosixPath('/Users/chris.morrison/src/data/abc_atlas/image_volumes/Allen-CCF-2020/20230630/average_template_10.nii.gz')]
Skipping the file hashing check#
When a download is completed, a hash of the downloaded file is computed and checked against the expected hash in the manifest. While this check is recommeneded it can add overhead to the download process. skip_hash_check
allows the user to skip computing the hash and assume the download has been completed successfully.
abc_cache.get_metadata_dataframe(directory='WMB-neighborhoods', file_name='UMAP20230830-TH-EPI-Glut', skip_hash_check=True)
UMAP20230830-TH-EPI-Glut.csv: 100%|████████████████████████████████████████| 6.46M/6.46M [00:01<00:00, 5.91MMB/s]
cell_label | x | y | |
---|---|---|---|
0 | CTCACACTCGTAGATC-044_D01 | -4.603476 | -6.148670 |
1 | CCGTACTCATCCAACA-036_D01 | -4.817812 | -6.366151 |
2 | AGCGGTCCATGGGAAC-037_A01 | -4.798783 | -6.577992 |
3 | GATGAGGCATGTTCCC-037_B01 | -5.188138 | -5.892220 |
4 | TAGTGGTAGGCGACAT-037_B01 | -4.715829 | -6.606307 |
... | ... | ... | ... |
126166 | TTTGTTGTCCGACATA-290_B01 | 10.042111 | 10.349521 |
126167 | TTTGTTGTCGTCTACC-294_B05 | -1.630137 | 9.033476 |
126168 | TTTGTTGTCGTTCCTG-463_A05 | -6.848272 | 12.645908 |
126169 | TTTGTTGTCGTTGCCT-621_A02 | -6.982306 | 14.718120 |
126170 | TTTGTTGTCTTTCGAT-574_A02 | -5.292696 | 6.804039 |
126171 rows × 3 columns
abc_cache.get_directory_metadata(directory='Allen-CCF-2020', skip_hash_check=True)
parcellation.csv: 100%|█████████████████████████████████████████████████████| 41.2k/41.2k [00:00<00:00, 766kMB/s]
parcellation_term.csv: 100%|█████████████████████████████████████████████████| 177k/177k [00:00<00:00, 1.87MMB/s]
parcellation_term_set.csv: 100%|███████████████████████████████████████████████| 628/628 [00:00<00:00, 9.80kMB/s]
parcellation_term_set_membership.csv: 100%|███████████████████████████████████| 114k/114k [00:00<00:00, 918kMB/s]
parcellation_term_with_counts.csv: 100%|█████████████████████████████████████| 137k/137k [00:00<00:00, 1.20MMB/s]
parcellation_to_parcellation_term_membership.csv: 100%|██████████████████████| 680k/680k [00:00<00:00, 4.86MMB/s]
parcellation_to_parcellation_term_membership_acronym.csv: 100%|█████████████| 22.3k/22.3k [00:00<00:00, 588kMB/s]
parcellation_to_parcellation_term_membership_blue.csv: 100%|████████████████| 16.4k/16.4k [00:00<00:00, 321kMB/s]
parcellation_to_parcellation_term_membership_color.csv: 100%|███████████████| 30.5k/30.5k [00:00<00:00, 382kMB/s]
parcellation_to_parcellation_term_membership_green.csv: 100%|███████████████| 16.5k/16.5k [00:00<00:00, 163kMB/s]
parcellation_to_parcellation_term_membership_name.csv: 100%|████████████████| 75.8k/75.8k [00:00<00:00, 609kMB/s]
parcellation_to_parcellation_term_membership_red.csv: 100%|█████████████████| 16.0k/16.0k [00:00<00:00, 325kMB/s]
[PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/Allen-CCF-2020/20230630/parcellation.csv'),
PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/Allen-CCF-2020/20230630/parcellation_term.csv'),
PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/Allen-CCF-2020/20230630/parcellation_term_set.csv'),
PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/Allen-CCF-2020/20230630/parcellation_term_set_membership.csv'),
PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/Allen-CCF-2020/20230630/views/parcellation_term_with_counts.csv'),
PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/Allen-CCF-2020/20230630/parcellation_to_parcellation_term_membership.csv'),
PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/Allen-CCF-2020/20230630/views/parcellation_to_parcellation_term_membership_acronym.csv'),
PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/Allen-CCF-2020/20230630/views/parcellation_to_parcellation_term_membership_blue.csv'),
PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/Allen-CCF-2020/20230630/views/parcellation_to_parcellation_term_membership_color.csv'),
PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/Allen-CCF-2020/20230630/views/parcellation_to_parcellation_term_membership_green.csv'),
PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/Allen-CCF-2020/20230630/views/parcellation_to_parcellation_term_membership_name.csv'),
PosixPath('/Users/chris.morrison/src/data/abc_atlas/metadata/Allen-CCF-2020/20230630/views/parcellation_to_parcellation_term_membership_red.csv')]