Getting started#
Data associated with the Allen Brain Cell Atlas is hosted on Amazon Web Services (AWS) in an S3 bucket as a AWS Public Dataset. No account or login is required. The S3 bucket is located here arn:aws:s3:::allen-brain-cell-atlas. You will need to be connected to the internet to run this notebook.
Each data release has an associated manifest.json which lists all the specific version of directories and files that are part of the release. We recommend using the AbcProjectCache to download the data.
Expression matricies are stored in the anndata h5ad format and need to be downloaded to a local file system for usage.
This notebook shows how to use the AbcProjectCache to download the data required for the tutorials.
Below we install the python library we will be using throughout to this python enviroment.
pip install -U git+https://github.com/alleninstitute/abc_atlas_access
Collecting git+https://github.com/alleninstitute/abc_atlas_access@u/morriscb/updateJupyterBook
Cloning https://github.com/alleninstitute/abc_atlas_access (to revision u/morriscb/updateJupyterBook) to /private/var/folders/kc/7glrmt5n67x16yj_tg86t49c0000gp/T/pip-req-build-k_k2xa_5
Running command git clone --filter=blob:none --quiet https://github.com/alleninstitute/abc_atlas_access /private/var/folders/kc/7glrmt5n67x16yj_tg86t49c0000gp/T/pip-req-build-k_k2xa_5
Running command git checkout -b u/morriscb/updateJupyterBook --track origin/u/morriscb/updateJupyterBook
Switched to a new branch 'u/morriscb/updateJupyterBook'
branch 'u/morriscb/updateJupyterBook' set up to track 'origin/u/morriscb/updateJupyterBook'.
Resolved https://github.com/alleninstitute/abc_atlas_access to commit 5cbeb4e1fe7ebb6492696b6dae9a76697b8c4cd0
Installing build dependencies ... ?25ldone
?25h Getting requirements to build wheel ... ?25ldone
?25h Preparing metadata (pyproject.toml) ... ?25ldone
?25hRequirement already satisfied: anndata in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.0.1) (0.10.5.post1)
Requirement already satisfied: boto3 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.0.1) (1.34.55)
Requirement already satisfied: ghp-import in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.0.1) (2.1.0)
Requirement already satisfied: matplotlib in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.0.1) (3.8.3)
Requirement already satisfied: moto in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.0.1) (5.0.2)
Requirement already satisfied: numpy in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.0.1) (1.26.4)
Requirement already satisfied: pandas in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.0.1) (2.2.1)
Requirement already satisfied: pydantic in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.0.1) (2.6.3)
Requirement already satisfied: pytest in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.0.1) (8.0.2)
Requirement already satisfied: requests in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.0.1) (2.31.0)
Requirement already satisfied: scipy in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.0.1) (1.12.0)
Requirement already satisfied: simpleitk in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.0.1) (2.3.1)
Requirement already satisfied: tqdm in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from abc_atlas_access==0.0.1) (4.66.2)
Requirement already satisfied: array-api-compat in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from anndata->abc_atlas_access==0.0.1) (1.4.1)
Requirement already satisfied: h5py>=3 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from anndata->abc_atlas_access==0.0.1) (3.10.0)
Requirement already satisfied: natsort in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from anndata->abc_atlas_access==0.0.1) (8.4.0)
Requirement already satisfied: packaging>=20 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from anndata->abc_atlas_access==0.0.1) (23.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from pandas->abc_atlas_access==0.0.1) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from pandas->abc_atlas_access==0.0.1) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from pandas->abc_atlas_access==0.0.1) (2024.1)
Requirement already satisfied: botocore<1.35.0,>=1.34.55 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from boto3->abc_atlas_access==0.0.1) (1.34.55)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from boto3->abc_atlas_access==0.0.1) (1.0.1)
Requirement already satisfied: s3transfer<0.11.0,>=0.10.0 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from boto3->abc_atlas_access==0.0.1) (0.10.0)
Requirement already satisfied: contourpy>=1.0.1 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from matplotlib->abc_atlas_access==0.0.1) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from matplotlib->abc_atlas_access==0.0.1) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from matplotlib->abc_atlas_access==0.0.1) (4.49.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from matplotlib->abc_atlas_access==0.0.1) (1.4.5)
Requirement already satisfied: pillow>=8 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from matplotlib->abc_atlas_access==0.0.1) (10.2.0)
Requirement already satisfied: pyparsing>=2.3.1 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from matplotlib->abc_atlas_access==0.0.1) (3.1.1)
Requirement already satisfied: cryptography>=3.3.1 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from moto->abc_atlas_access==0.0.1) (42.0.5)
Requirement already satisfied: xmltodict in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from moto->abc_atlas_access==0.0.1) (0.13.0)
Requirement already satisfied: werkzeug!=2.2.0,!=2.2.1,>=0.5 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from moto->abc_atlas_access==0.0.1) (3.0.1)
Requirement already satisfied: responses>=0.15.0 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from moto->abc_atlas_access==0.0.1) (0.25.0)
Requirement already satisfied: Jinja2>=2.10.1 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from moto->abc_atlas_access==0.0.1) (3.1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from requests->abc_atlas_access==0.0.1) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from requests->abc_atlas_access==0.0.1) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from requests->abc_atlas_access==0.0.1) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from requests->abc_atlas_access==0.0.1) (2024.2.2)
Requirement already satisfied: annotated-types>=0.4.0 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from pydantic->abc_atlas_access==0.0.1) (0.6.0)
Requirement already satisfied: pydantic-core==2.16.3 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from pydantic->abc_atlas_access==0.0.1) (2.16.3)
Requirement already satisfied: typing-extensions>=4.6.1 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from pydantic->abc_atlas_access==0.0.1) (4.10.0)
Requirement already satisfied: iniconfig in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from pytest->abc_atlas_access==0.0.1) (2.0.0)
Requirement already satisfied: pluggy<2.0,>=1.3.0 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from pytest->abc_atlas_access==0.0.1) (1.4.0)
Requirement already satisfied: cffi>=1.12 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from cryptography>=3.3.1->moto->abc_atlas_access==0.0.1) (1.16.0)
Requirement already satisfied: MarkupSafe>=2.0 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from Jinja2>=2.10.1->moto->abc_atlas_access==0.0.1) (2.1.5)
Requirement already satisfied: six>=1.5 in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas->abc_atlas_access==0.0.1) (1.16.0)
Requirement already satisfied: pyyaml in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from responses>=0.15.0->moto->abc_atlas_access==0.0.1) (6.0.1)
Requirement already satisfied: pycparser in /Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages (from cffi>=1.12->cryptography>=3.3.1->moto->abc_atlas_access==0.0.1) (2.21)
Building wheels for collected packages: abc_atlas_access
Building wheel for abc_atlas_access (pyproject.toml) ... ?25ldone
?25h Created wheel for abc_atlas_access: filename=abc_atlas_access-0.0.1-py3-none-any.whl size=20811 sha256=d8098f32f446e44b53e54ad633cdea61188d55b3cc708d45cb702c309ee9224c
Stored in directory: /private/var/folders/kc/7glrmt5n67x16yj_tg86t49c0000gp/T/pip-ephem-wheel-cache-c8464kd3/wheels/d8/36/c5/498927e9ff1fdc24cca5fd8a35f6daeb0d2b61623f3ee82a07
Successfully built abc_atlas_access
Installing collected packages: abc_atlas_access
Attempting uninstall: abc_atlas_access
Found existing installation: abc_atlas_access 0.0.1
Uninstalling abc_atlas_access-0.0.1:
Successfully uninstalled abc_atlas_access-0.0.1
Successfully installed abc_atlas_access-0.0.1
Note: you may need to restart the kernel to use updated packages.
After installing these new packages we need to restart the python kernel in this notebook. This can either be done by selecting Restart Kernel...
under the Kernel
drop down menu above or uncommenting and running the cell below.
# get_ipython().kernel.do_shutdown()
from pathlib import Path
from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache
Using the cache#
Below we show how to setup up the cache to download from S3, how to list and switch to a different data release, and additionally how to list the directories available, their size, and the files in that directory.
Setup the AbcProjectCache object by specifying a directory and calling from_s3_cache
as shown below. We also print what version of the manifest is being currently loaded by the cache.
download_base = Path('../../abc_download_root') # Path to where you would like to write the downloaded data.
abc_cache = AbcProjectCache.from_s3_cache(download_base)
abc_cache.current_manifest
'releases/20240330/manifest.json'
List the all of the different releases available and usable by the cache object we have just loaded.
abc_cache.list_manifest_file_names
['releases/20230630/manifest.json',
'releases/20230830/manifest.json',
'releases/20231215/manifest.json',
'releases/20240330/manifest.json']
We can switch to a specific manifest and release version of the data using the load_manifest
method. This determines what version of the released data the cache will download/return to the user. The cache will keep track of which version was last used across sessions. Upon instantiating a cache, the current manifest can be viewed with the method: current_manifest
. Note that a warning will be thrown if the manifest loaded by the cache is older than the most recent manifest available.
Below we show an example of loading an older manifest. Any of the strings returned by list_manifest_file_names
will be valid manifests, however, we’ll stick to the current manifest for this tutorial to avoid confusion.
abc_cache.load_manifest('releases/20230630/manifest.json')
print("old manifest loaded:", abc_cache.current_manifest)
# Return to the latest manifest
abc_cache.load_latest_manifest()
print("after latest manifest loaded:", abc_cache.current_manifest)
/Users/chris.morrison/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages/abc_atlas_access/abc_atlas_cache/cloud_cache.py:567: OutdatedManifestWarning:
The manifest file you are loading is not the most up to date manifest file available for this dataset. The most up to data manifest file available for this dataset is
releases/20240330/manifest.json
To see the differences between these manifests,run
type.compare_manifests('releases/20240330/manifest.json', 'releases/20230630/manifest.json')
To see all of the manifest files currently downloaded onto your local system, run
self.list_all_downloaded_manifests()
If you just want to load the latest manifest, run
self.load_latest_manifest()
warnings.warn(msg, OutdatedManifestWarning)
old manifest loaded: releases/20230630/manifest.json
after latest manifest loaded: releases/20240330/manifest.json
We can list all available directories in the release we loaded using the method below. We can then list all the available data and metadata files in those directories. Note that the cache will raise an exception if the requested kind of files (data files [e.g. h5ad expression_matricies, nii.gz image_volumes] or metadata files [e.g. csv files]) are not available in the directory.
abc_cache.list_directories
['Allen-CCF-2020',
'MERFISH-C57BL6J-638850',
'MERFISH-C57BL6J-638850-CCF',
'MERFISH-C57BL6J-638850-sections',
'WHB-10Xv3',
'WHB-taxonomy',
'WMB-10X',
'WMB-10XMulti',
'WMB-10Xv2',
'WMB-10Xv3',
'WMB-neighborhoods',
'WMB-taxonomy',
'Zhuang-ABCA-1',
'Zhuang-ABCA-1-CCF',
'Zhuang-ABCA-2',
'Zhuang-ABCA-2-CCF',
'Zhuang-ABCA-3',
'Zhuang-ABCA-3-CCF',
'Zhuang-ABCA-4',
'Zhuang-ABCA-4-CCF']
abc_cache.list_data_files('WMB-10Xv2')
['WMB-10Xv2-CTXsp/log2',
'WMB-10Xv2-CTXsp/raw',
'WMB-10Xv2-HPF/log2',
'WMB-10Xv2-HPF/raw',
'WMB-10Xv2-HY/log2',
'WMB-10Xv2-HY/raw',
'WMB-10Xv2-Isocortex-1/log2',
'WMB-10Xv2-Isocortex-1/raw',
'WMB-10Xv2-Isocortex-2/log2',
'WMB-10Xv2-Isocortex-2/raw',
'WMB-10Xv2-Isocortex-3/log2',
'WMB-10Xv2-Isocortex-3/raw',
'WMB-10Xv2-Isocortex-4/log2',
'WMB-10Xv2-Isocortex-4/raw',
'WMB-10Xv2-MB/log2',
'WMB-10Xv2-MB/raw',
'WMB-10Xv2-OLF/log2',
'WMB-10Xv2-OLF/raw',
'WMB-10Xv2-TH/log2',
'WMB-10Xv2-TH/raw']
abc_cache.list_metadata_files('WMB-taxonomy')
['cluster',
'cluster_annotation_term',
'cluster_annotation_term_set',
'cluster_annotation_term_with_counts',
'cluster_to_cluster_annotation_membership',
'cluster_to_cluster_annotation_membership_color',
'cluster_to_cluster_annotation_membership_pivoted']
Before we start downloading data, we can check how much total data is in a given directory for both data files and metadata files.
abc_cache.get_directory_data_size('WMB-10Xv2')
'104.16 GB'
abc_cache.get_directory_metadata_size('WMB-taxonomy')
'4.65 MB'
Downloading files#
The next set of examples shows how to download data to the directory you specified when setting up the cache object. There are two main ways of downloading the data: individually by file or by full directory.
Downloading all data files or metadata files in a directory.#
Here we show how one can download the full set of data files or metadata files contained in a directory in the release. Use the list_directories
as a guide here as to what data is available. Here we download all the data in two directories we know to be small. Once the download of all files is complete, a list of Paths to the downloaded files is returned.
The user should be warned that several directories are significant in size, >100 GB. If a directory is over 10 GB in size total, the cache will warn the user when requesting to download the data in the directory.
allen_ccf_list = abc_cache.get_directory_data('Allen-CCF-2020')
print("Allen-CCF-2020 data files:\n\t", allen_ccf_list)
annotation_10.nii.gz: 100%|██████████| 27.5M/27.5M [00:01<00:00, 22.8MMB/s]
annotation_boundary_10.nii.gz: 100%|██████████| 27.4M/27.4M [00:01<00:00, 19.3MMB/s]
average_template_10.nii.gz: 100%|██████████| 343M/343M [00:11<00:00, 29.2MMB/s]
Allen-CCF-2020 data files:
[PosixPath('/Users/chris.morrison/src/abc_download_root/image_volumes/Allen-CCF-2020/20230630/annotation_10.nii.gz'), PosixPath('/Users/chris.morrison/src/abc_download_root/image_volumes/Allen-CCF-2020/20230630/annotation_boundary_10.nii.gz'), PosixPath('/Users/chris.morrison/src/abc_download_root/image_volumes/Allen-CCF-2020/20230630/average_template_10.nii.gz')]
allen_ccf_list = abc_cache.get_directory_metadata('WMB-taxonomy')
print("WMB-taxonomy metadata files:\n\t", allen_ccf_list)
cluster.csv: 100%|██████████| 131k/131k [00:00<00:00, 838kMB/s]
cluster_annotation_term.csv: 100%|██████████| 861k/861k [00:00<00:00, 5.60MMB/s]
cluster_annotation_term_set.csv: 100%|██████████| 1.11k/1.11k [00:00<00:00, 13.7kMB/s]
cluster_annotation_term_with_counts.csv: 100%|██████████| 902k/902k [00:00<00:00, 5.30MMB/s]
cluster_to_cluster_annotation_membership.csv: 100%|██████████| 2.21M/2.21M [00:00<00:00, 15.0MMB/s]
cluster_to_cluster_annotation_membership_color.csv: 100%|██████████| 239k/239k [00:00<00:00, 1.99MMB/s]
cluster_to_cluster_annotation_membership_pivoted.csv: 100%|██████████| 531k/531k [00:00<00:00, 3.41MMB/s]
WMB-taxonomy metadata files:
[PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/WMB-taxonomy/20231215/cluster.csv'), PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/WMB-taxonomy/20231215/cluster_annotation_term.csv'), PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/WMB-taxonomy/20231215/cluster_annotation_term_set.csv'), PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/WMB-taxonomy/20231215/views/cluster_annotation_term_with_counts.csv'), PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/WMB-taxonomy/20231215/cluster_to_cluster_annotation_membership.csv'), PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/WMB-taxonomy/20231215/views/cluster_to_cluster_annotation_membership_color.csv'), PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/WMB-taxonomy/20231215/views/cluster_to_cluster_annotation_membership_pivoted.csv')]
Note that, after downloading the file successfully, running the get_directory_data
or get_directory_metadata
methods will return the list of the local paths without having to redownload the files.
allen_ccf_list = abc_cache.get_directory_data('Allen-CCF-2020')
print("Allen-CCF-2020 data files:\n\t", allen_ccf_list, "\n\n")
allen_ccf_list = abc_cache.get_directory_metadata('WMB-taxonomy')
print("WMB-taxonomy metadata files:\n\t", allen_ccf_list)
Allen-CCF-2020 data files:
[PosixPath('/Users/chris.morrison/src/abc_download_root/image_volumes/Allen-CCF-2020/20230630/annotation_10.nii.gz'), PosixPath('/Users/chris.morrison/src/abc_download_root/image_volumes/Allen-CCF-2020/20230630/annotation_boundary_10.nii.gz'), PosixPath('/Users/chris.morrison/src/abc_download_root/image_volumes/Allen-CCF-2020/20230630/average_template_10.nii.gz')]
WMB-taxonomy metadata files:
[PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/WMB-taxonomy/20231215/cluster.csv'), PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/WMB-taxonomy/20231215/cluster_annotation_term.csv'), PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/WMB-taxonomy/20231215/cluster_annotation_term_set.csv'), PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/WMB-taxonomy/20231215/views/cluster_annotation_term_with_counts.csv'), PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/WMB-taxonomy/20231215/cluster_to_cluster_annotation_membership.csv'), PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/WMB-taxonomy/20231215/views/cluster_to_cluster_annotation_membership_color.csv'), PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/WMB-taxonomy/20231215/views/cluster_to_cluster_annotation_membership_pivoted.csv')]
Downloading individual files.#
The option also exists to download files individually. We can use list_directories
and the methods list_data_files
and list_metadata_files
to guide us as to what is available to download. Below we will download one metadata file from the WMB-10X directory/dataset and one expression matrix data file from the WMB-10XMulti directory/dataset.
Downloading individual metadata files#
abc_cache.list_metadata_files('WMB-10X')
['cell_metadata',
'cell_metadata_with_cluster_annotation',
'example_genes_all_cells_expression',
'gene',
'region_of_interest_metadata']
abc_cache.get_metadata_path(directory='WMB-10X', file_name='gene')
gene.csv: 100%|██████████| 2.30M/2.30M [00:00<00:00, 4.95MMB/s]
PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/WMB-10X/20231215/gene.csv')
The cache can also return a dataframe for metadata objects. They are loaded with a generic index. Note that when using this method, it can accept additional argments that will be passed to the pandas.read_csv method. Examples of this are used throughout the notebooks in this repo.
abc_cache.get_metadata_dataframe(directory='WMB-10X', file_name='gene')
gene_identifier | gene_symbol | name | mapped_ncbi_identifier | comment | |
---|---|---|---|---|---|
0 | ENSMUSG00000051951 | Xkr4 | X-linked Kx blood group related 4 | NCBIGene:497097 | NaN |
1 | ENSMUSG00000089699 | Gm1992 | predicted gene 1992 | NaN | NaN |
2 | ENSMUSG00000102331 | Gm19938 | predicted gene, 19938 | NaN | NaN |
3 | ENSMUSG00000102343 | Gm37381 | predicted gene, 37381 | NaN | NaN |
4 | ENSMUSG00000025900 | Rp1 | retinitis pigmentosa 1 (human) | NCBIGene:19888 | NaN |
... | ... | ... | ... | ... | ... |
32280 | ENSMUSG00000095523 | AC124606.1 | PRAME family member 8-like | NCBIGene:100038995 | no expression |
32281 | ENSMUSG00000095475 | AC133095.2 | uncharacterized LOC545763 | NCBIGene:545763 | no expression |
32282 | ENSMUSG00000094855 | AC133095.1 | uncharacterized LOC620639 | NCBIGene:620639 | no expression |
32283 | ENSMUSG00000095019 | AC234645.1 | NaN | NaN | no expression |
32284 | ENSMUSG00000095041 | AC149090.1 | NaN | NaN | NaN |
32285 rows × 5 columns
Downloading individual data files#
abc_cache.list_data_files('WMB-10XMulti')
['WMB-10XMulti/log2', 'WMB-10XMulti/raw']
Note how log2
and raw
is added to the end of the file name returned by the above function and used below. If we were not to specify this in the input, the code will throw an error describing the ambiguity.
abc_cache.get_data_path(directory='WMB-10XMulti', file_name='WMB-10XMulti/log2')
WMB-10XMulti-log2.h5ad: 100%|██████████| 89.3M/89.3M [00:03<00:00, 24.0MMB/s]
PosixPath('/Users/chris.morrison/src/abc_download_root/expression_matrices/WMB-10XMulti/20230830/WMB-10XMulti-log2.h5ad')
abc_cache.get_data_path(directory='WMB-10XMulti', file_name='WMB-10XMulti')
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[20], line 1
----> 1 abc_cache.get_data_path(directory='WMB-10XMulti', file_name='WMB-10XMulti')
File ~/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages/abc_atlas_access/abc_atlas_cache/abc_project_cache.py:477, in AbcProjectCache.get_data_path(self, directory, file_name, force_download, skip_hash_check)
472 data_path = self._get_local_path(
473 directory=directory,
474 file_name=file_name
475 )
476 else:
--> 477 data_path = self.cache.download_data(
478 directory=directory,
479 file_name=file_name,
480 force_download=force_download,
481 skip_hash_check=skip_hash_check
482 )
483 return data_path
File ~/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages/abc_atlas_access/abc_atlas_cache/cloud_cache.py:822, in CloudCacheBase.download_data(self, directory, file_name, force_download, skip_hash_check)
788 def download_data(
789 self,
790 directory: str,
(...)
793 skip_hash_check: bool = False
794 ) -> Path:
795 """
796 Return the local path to a data file, downloading the file
797 if necessary
(...)
820 If the file cannot be downloaded
821 """
--> 822 super_attributes = self.data_path(directory=directory,
823 file_name=file_name)
824 file_attributes = super_attributes['file_attributes']
825 # If the file exists, check that it was downloaded successfully.
File ~/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages/abc_atlas_access/abc_atlas_cache/cloud_cache.py:395, in BasicLocalCache.data_path(self, directory, file_name)
366 def data_path(self, directory: str, file_name: str) -> dict:
367 """
368 Return the local path to a data file, and test for the
369 file's existence
(...)
393 If the file cannot be downloaded
394 """
--> 395 output = self._get_file_path(directory=directory, file_name=file_name)
397 return output
File ~/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages/abc_atlas_access/abc_atlas_cache/cloud_cache.py:321, in BasicLocalCache._get_file_path(self, directory, file_name)
292 def _get_file_path(self, directory: str, file_name: str) -> dict:
293 """
294 Return the local path to a data file, and test for the
295 file's existence.
(...)
319 If the file cannot be downloaded
320 """
--> 321 file_attributes = self._manifest.get_file_attributes(
322 directory=directory,
323 file_name=file_name
324 )
325 exists = self._file_exists(file_attributes)
326 local_path = file_attributes.local_path
File ~/src/miniconda3/envs/abc_atlas_access/lib/python3.11/site-packages/abc_atlas_access/abc_atlas_cache/manifest.py:238, in Manifest.get_file_attributes(self, directory, file_name)
226 file_attributes = self._create_file_attributes(
227 remote_path=files_data[kind]["files"][file_type][
228 'url'],
(...)
235 file_hash=files_data[kind]["files"][file_type]['file_hash'] # noqa: E501
236 )
237 elif kind is None and "files" not in files_data.keys():
--> 238 raise KeyError(
239 f"File {file_name} found in directory but multiple "
240 f"files found: {list(files_data.keys())}. Please "
241 "specify the file name as one of "
242 f"{['%s/%s' % (file_name, key) for key in files_data.keys()]}" # noqa: E501
243 )
244 if file_attributes is None:
245 raise KeyError(
246 f"File {file_name} not found in directory {directory}."
247 )
KeyError: "File WMB-10XMulti found in directory but multiple files found: ['log2', 'raw']. Please specify the file name as one of ['WMB-10XMulti/log2', 'WMB-10XMulti/raw']"
Advanced Options#
Forcing the cache to redownload data#
For all methods that download files, the option exists to force the cache to redownload the file(s). This can be useful if the downloaded file has become corrupted or accidentially deleted/changed. Below are examples of using it while downloading an inividual file or a full directory of files.
abc_cache.get_metadata_dataframe(directory='WMB-10X', file_name='gene', force_download=True)
gene.csv: 100%|██████████| 2.30M/2.30M [00:00<00:00, 5.64MMB/s]
gene_identifier | gene_symbol | name | mapped_ncbi_identifier | comment | |
---|---|---|---|---|---|
0 | ENSMUSG00000051951 | Xkr4 | X-linked Kx blood group related 4 | NCBIGene:497097 | NaN |
1 | ENSMUSG00000089699 | Gm1992 | predicted gene 1992 | NaN | NaN |
2 | ENSMUSG00000102331 | Gm19938 | predicted gene, 19938 | NaN | NaN |
3 | ENSMUSG00000102343 | Gm37381 | predicted gene, 37381 | NaN | NaN |
4 | ENSMUSG00000025900 | Rp1 | retinitis pigmentosa 1 (human) | NCBIGene:19888 | NaN |
... | ... | ... | ... | ... | ... |
32280 | ENSMUSG00000095523 | AC124606.1 | PRAME family member 8-like | NCBIGene:100038995 | no expression |
32281 | ENSMUSG00000095475 | AC133095.2 | uncharacterized LOC545763 | NCBIGene:545763 | no expression |
32282 | ENSMUSG00000094855 | AC133095.1 | uncharacterized LOC620639 | NCBIGene:620639 | no expression |
32283 | ENSMUSG00000095019 | AC234645.1 | NaN | NaN | no expression |
32284 | ENSMUSG00000095041 | AC149090.1 | NaN | NaN | NaN |
32285 rows × 5 columns
allen_ccf_list = abc_cache.get_directory_data(directory='Allen-CCF-2020', force_download=True)
print("Allen-CCF-2020 data files:\n\t", allen_ccf_list)
annotation_10.nii.gz: 100%|██████████| 27.5M/27.5M [00:00<00:00, 42.1MMB/s]
annotation_boundary_10.nii.gz: 100%|██████████| 27.4M/27.4M [00:00<00:00, 38.3MMB/s]
average_template_10.nii.gz: 100%|██████████| 343M/343M [00:07<00:00, 48.1MMB/s]
Allen-CCF-2020 data files:
[PosixPath('/Users/chris.morrison/src/abc_download_root/image_volumes/Allen-CCF-2020/20230630/annotation_10.nii.gz'), PosixPath('/Users/chris.morrison/src/abc_download_root/image_volumes/Allen-CCF-2020/20230630/annotation_boundary_10.nii.gz'), PosixPath('/Users/chris.morrison/src/abc_download_root/image_volumes/Allen-CCF-2020/20230630/average_template_10.nii.gz')]
Skipping the file hashing check#
When a download is completed, a hash of the downloaded file is computed and checked against the expected hash in the manifest. While this check is recommeneded it can add overhead to the download process. skip_hash_check
allows the user to skip computing the hash and assume the download has been completed successfully.
abc_cache.get_metadata_dataframe(directory='WMB-neighborhoods', file_name='UMAP20230830-TH-EPI-Glut', skip_hash_check=True)
UMAP20230830-TH-EPI-Glut.csv: 100%|██████████| 6.46M/6.46M [00:00<00:00, 16.1MMB/s]
cell_label | x | y | |
---|---|---|---|
0 | CTCACACTCGTAGATC-044_D01 | -4.603476 | -6.148670 |
1 | CCGTACTCATCCAACA-036_D01 | -4.817812 | -6.366151 |
2 | AGCGGTCCATGGGAAC-037_A01 | -4.798783 | -6.577992 |
3 | GATGAGGCATGTTCCC-037_B01 | -5.188138 | -5.892220 |
4 | TAGTGGTAGGCGACAT-037_B01 | -4.715829 | -6.606307 |
... | ... | ... | ... |
126166 | TTTGTTGTCCGACATA-290_B01 | 10.042111 | 10.349521 |
126167 | TTTGTTGTCGTCTACC-294_B05 | -1.630137 | 9.033476 |
126168 | TTTGTTGTCGTTCCTG-463_A05 | -6.848272 | 12.645908 |
126169 | TTTGTTGTCGTTGCCT-621_A02 | -6.982306 | 14.718120 |
126170 | TTTGTTGTCTTTCGAT-574_A02 | -5.292696 | 6.804039 |
126171 rows × 3 columns
abc_cache.get_directory_metadata(directory='Allen-CCF-2020', skip_hash_check=True)
parcellation.csv: 100%|██████████| 41.2k/41.2k [00:00<00:00, 606kMB/s]
parcellation_term.csv: 100%|██████████| 177k/177k [00:00<00:00, 1.27MMB/s]
parcellation_term_set.csv: 100%|██████████| 628/628 [00:00<00:00, 8.04kMB/s]
parcellation_term_set_membership.csv: 100%|██████████| 114k/114k [00:00<00:00, 956kMB/s]
parcellation_term_with_counts.csv: 100%|██████████| 137k/137k [00:00<00:00, 908kMB/s]
parcellation_to_parcellation_term_membership.csv: 100%|██████████| 680k/680k [00:00<00:00, 3.71MMB/s]
parcellation_to_parcellation_term_membership_acronym.csv: 100%|██████████| 22.3k/22.3k [00:00<00:00, 96.8kMB/s]
parcellation_to_parcellation_term_membership_blue.csv: 100%|██████████| 16.4k/16.4k [00:00<00:00, 232kMB/s]
parcellation_to_parcellation_term_membership_color.csv: 100%|██████████| 30.5k/30.5k [00:00<00:00, 432kMB/s]
parcellation_to_parcellation_term_membership_green.csv: 100%|██████████| 16.5k/16.5k [00:00<00:00, 234kMB/s]
parcellation_to_parcellation_term_membership_name.csv: 100%|██████████| 75.8k/75.8k [00:00<00:00, 315kMB/s]
parcellation_to_parcellation_term_membership_red.csv: 100%|██████████| 16.0k/16.0k [00:00<00:00, 215kMB/s]
[PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/Allen-CCF-2020/20230630/parcellation.csv'),
PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/Allen-CCF-2020/20230630/parcellation_term.csv'),
PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/Allen-CCF-2020/20230630/parcellation_term_set.csv'),
PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/Allen-CCF-2020/20230630/parcellation_term_set_membership.csv'),
PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/Allen-CCF-2020/20230630/views/parcellation_term_with_counts.csv'),
PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/Allen-CCF-2020/20230630/parcellation_to_parcellation_term_membership.csv'),
PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/Allen-CCF-2020/20230630/views/parcellation_to_parcellation_term_membership_acronym.csv'),
PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/Allen-CCF-2020/20230630/views/parcellation_to_parcellation_term_membership_blue.csv'),
PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/Allen-CCF-2020/20230630/views/parcellation_to_parcellation_term_membership_color.csv'),
PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/Allen-CCF-2020/20230630/views/parcellation_to_parcellation_term_membership_green.csv'),
PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/Allen-CCF-2020/20230630/views/parcellation_to_parcellation_term_membership_name.csv'),
PosixPath('/Users/chris.morrison/src/abc_download_root/metadata/Allen-CCF-2020/20230630/views/parcellation_to_parcellation_term_membership_red.csv')]