Getting started#

Data associated with the Allen Brain Cell Atlas is hosted on Amazon Web Services (AWS) in an S3 bucket as a AWS Public Dataset. No account or login is required. The S3 bucket is located here arn:aws:s3:::allen-brain-cell-atlas. You will need to be connected to the internet to run this notebook.

Each release has an associated manifest.json which list all the specific version of directories and files that are part of the release. We recommend using the manifest as the starting point of data download and usage.

Expression matrices are stored in the anndata h5ad format and needs to be downloaded to a local file system for usage.

The AWS Command Line Interface (AWS CLI) is a simple option to download specific directories and files from S3. Download and installation instructructions can be found here: https://aws.amazon.com/cli/.

This notebook shows how to format AWS CLI commands to download the data required for the tutorials. You can copy those command onto a terminal shell or optionally you can run those command directly in this notebook by uncommenting the “subprocess.run” lines in the code.

import requests
import json
import os
import pathlib
import subprocess
import time

Using the file manifest#

Let’s open the manifest.json file associated with the current release.

version = '20231215'
url = 'https://allen-brain-cell-atlas.s3-us-west-2.amazonaws.com/releases/%s/manifest.json' % version
manifest = json.loads(requests.get(url).text)
print("version: ", manifest['version'])
version:  20230830

At the top level, the manifest consists of the release version tag, S3 resource_uri, dictionaries directory_listing and file_listing. A simple option to download data is to use the AWS CLI to download specific directories or files. All the example notebooks in this repository assumes that data has been downloaded locally in the same file organization as specified by the “relative_path” field in the manifest.

manifest.keys()
print("version:",manifest['version'])
print("resource_uri:",manifest['resource_uri'])
version: 20230830
resource_uri: s3://allen-brain-cell-atlas/

Let’s look at the information associated with the spatial transcriptomics dataset MERFISH-C57BL6J-638850. This dataset has two related directories: expression_matrices containing a set of h5ad files and metadata containing a set of csv files. Use the view_link url to browse the directories on a web-browser.

expression_matrices = manifest['directory_listing']['MERFISH-C57BL6J-638850']['directories']['expression_matrices']
print(expression_matrices)
print(expression_matrices['view_link'])
{'version': '20230830', 'relative_path': 'expression_matrices/MERFISH-C57BL6J-638850/20230830', 'url': 'https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/expression_matrices/MERFISH-C57BL6J-638850/20230830/', 'view_link': 'https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html#expression_matrices/MERFISH-C57BL6J-638850/20230830/', 'total_size': 15255179148}
https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html#expression_matrices/MERFISH-C57BL6J-638850/20230830/
metadata = manifest['directory_listing']['MERFISH-C57BL6J-638850']['directories']['metadata']
print(metadata)
print(metadata['view_link'])
{'version': '20230830', 'relative_path': 'metadata/MERFISH-C57BL6J-638850/20230830', 'url': 'https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/metadata/MERFISH-C57BL6J-638850/20230830/', 'view_link': 'https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html#metadata/MERFISH-C57BL6J-638850/20230830/', 'total_size': 1942603772}
https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html#metadata/MERFISH-C57BL6J-638850/20230830/

Directory sizes are also reported as part to the manifest.json. WARNING: the expression matrices directories can get very large > 100 GB.

GB = float(float(1024) ** 3)

for r in manifest['directory_listing'] :    
    r_dict =  manifest['directory_listing'][r]
    for d in r_dict['directories'] :
        d_dict = r_dict['directories'][d]        
        print(d_dict['relative_path'],":",'%0.2f GB' % (d_dict['total_size']/GB))
        
expression_matrices/MERFISH-C57BL6J-638850/20230830 : 14.21 GB
metadata/MERFISH-C57BL6J-638850/20230830 : 1.81 GB
expression_matrices/MERFISH-C57BL6J-638850-sections/20230630 : 14.31 GB
expression_matrices/WMB-10Xv2/20230630 : 104.16 GB
expression_matrices/WMB-10Xv3/20230630 : 176.41 GB
expression_matrices/WMB-10XMulti/20230830 : 0.21 GB
metadata/WMB-10X/20230830 : 2.39 GB
metadata/WMB-taxonomy/20230830 : 0.00 GB
metadata/WMB-neighborhoods/20230830 : 3.00 GB
image_volumes/Allen-CCF-2020/20230630 : 0.37 GB
metadata/Allen-CCF-2020/20230630 : 0.00 GB
image_volumes/MERFISH-C57BL6J-638850-CCF/20230630 : 0.11 GB
metadata/MERFISH-C57BL6J-638850-CCF/20230830 : 0.59 GB
expression_matrices/Zhuang-ABCA-1/20230830 : 3.09 GB
metadata/Zhuang-ABCA-1/20230830 : 1.33 GB
metadata/Zhuang-ABCA-1-CCF/20230830 : 0.21 GB
expression_matrices/Zhuang-ABCA-2/20230830 : 1.30 GB
metadata/Zhuang-ABCA-2/20230830 : 0.57 GB
metadata/Zhuang-ABCA-2-CCF/20230830 : 0.08 GB
expression_matrices/Zhuang-ABCA-3/20230830 : 1.69 GB
metadata/Zhuang-ABCA-3/20230830 : 0.74 GB
metadata/Zhuang-ABCA-3-CCF/20230830 : 0.12 GB
expression_matrices/Zhuang-ABCA-4/20230830 : 0.16 GB
metadata/Zhuang-ABCA-4/20230830 : 0.08 GB
metadata/Zhuang-ABCA-4-CCF/20230830 : 0.01 GB

Downloading files for the tutorial notebooks#

Suppose you would like to download data to your local path ../abc_download_root.

download_base = '../../abc_download_root'

Downloading all metadata directories#

Since the metadata directories are relatively small we will download all the metadata directories. We loop through the manifest and download each metadata directory using AWS CLI sync command. This should take < 5 minutes.

for r in manifest['directory_listing'] :
    
    r_dict =  manifest['directory_listing'][r]
    
    for d in r_dict['directories'] :
        
        if d != 'metadata' :
            continue
        d_dict = r_dict['directories'][d]
        local_path = os.path.join( download_base, d_dict['relative_path'])
        local_path = pathlib.Path( local_path )
        remote_path = manifest['resource_uri'] + d_dict['relative_path']
        
        command = "aws s3 sync --no-sign-request %s %s" % (remote_path, local_path)
        print(command)
        
        start = time.process_time()
        # Uncomment to download directories
        #result = subprocess.run(command.split(),stdout=subprocess.PIPE)
        #print("time taken: ", time.process_time() - start)
  
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/MERFISH-C57BL6J-638850/20230830 ../../abc_download_root/metadata/MERFISH-C57BL6J-638850/20230830
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/WMB-10X/20230830 ../../abc_download_root/metadata/WMB-10X/20230830
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/WMB-taxonomy/20230830 ../../abc_download_root/metadata/WMB-taxonomy/20230830
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/WMB-neighborhoods/20230830 ../../abc_download_root/metadata/WMB-neighborhoods/20230830
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/Allen-CCF-2020/20230630 ../../abc_download_root/metadata/Allen-CCF-2020/20230630
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/MERFISH-C57BL6J-638850-CCF/20230830 ../../abc_download_root/metadata/MERFISH-C57BL6J-638850-CCF/20230830
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/Zhuang-ABCA-1/20230830 ../../abc_download_root/metadata/Zhuang-ABCA-1/20230830
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/Zhuang-ABCA-1-CCF/20230830 ../../abc_download_root/metadata/Zhuang-ABCA-1-CCF/20230830
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/Zhuang-ABCA-2/20230830 ../../abc_download_root/metadata/Zhuang-ABCA-2/20230830
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/Zhuang-ABCA-2-CCF/20230830 ../../abc_download_root/metadata/Zhuang-ABCA-2-CCF/20230830
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/Zhuang-ABCA-3/20230830 ../../abc_download_root/metadata/Zhuang-ABCA-3/20230830
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/Zhuang-ABCA-3-CCF/20230830 ../../abc_download_root/metadata/Zhuang-ABCA-3-CCF/20230830
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/Zhuang-ABCA-4/20230830 ../../abc_download_root/metadata/Zhuang-ABCA-4/20230830
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/Zhuang-ABCA-4-CCF/20230830 ../../abc_download_root/metadata/Zhuang-ABCA-4-CCF/20230830

Downloading one 10x expression matrix#

The prerequisite to run the 10x part 1 notebook is to have downloaded the log2 version of the “‘WMB-10Xv2-TH’” matrix (4GB). Download takes ~ 1 min depending on your network speed.

We define a simple helper function to create the require AWS command. You can copy the command into a terminal shell to run or optionally run it inside this notebook if you uncomment the “subprocess.run” line of code.

def download_file( file_dict ) :
    
    print(file_dict['relative_path'],file_dict['size'])
    local_path = os.path.join( download_base, file_dict['relative_path'] )
    local_path = pathlib.Path( local_path )
    remote_path = manifest['resource_uri'] + file_dict['relative_path']

    command = "aws s3 cp --no-sign-request %s %s" % (remote_path, local_path)
    print(command)

    start = time.process_time()
    # Uncomment to download file
    #result = subprocess.run(command.split(' '),stdout=subprocess.PIPE)
    #print("time taken: ", time.process_time() - start)
expression_matrices = manifest['file_listing']['WMB-10Xv2']['expression_matrices']
file_dict = expression_matrices['WMB-10Xv2-TH']['log2']['files']['h5ad']
print('size:',file_dict['size'])
download_file( file_dict )
size: 4038679930
expression_matrices/WMB-10Xv2/20230630/WMB-10Xv2-TH-log2.h5ad 4038679930
aws s3 cp --no-sign-request s3://allen-brain-cell-atlas/expression_matrices/WMB-10Xv2/20230630/WMB-10Xv2-TH-log2.h5ad ../../abc_download_root/expression_matrices/WMB-10Xv2/20230630/WMB-10Xv2-TH-log2.h5ad

Downloading the MERFISH expression matrix#

The prerequisite to run the MERFISH part 1 notebook is to have downloaded the log2 version of the “C57BL6J-638850” matrix (7GB). Download takes ~3 mins depending on tour network speed.

datasets = ['MERFISH-C57BL6J-638850']
for d in datasets :
    expression_matrices = manifest['file_listing'][d]['expression_matrices']
    file_dict = expression_matrices['C57BL6J-638850']['log2']['files']['h5ad']
    print('size:',file_dict['size'])
    download_file( file_dict )
size: 7627589574
expression_matrices/MERFISH-C57BL6J-638850/20230830/C57BL6J-638850-log2.h5ad 7627589574
aws s3 cp --no-sign-request s3://allen-brain-cell-atlas/expression_matrices/MERFISH-C57BL6J-638850/20230830/C57BL6J-638850-log2.h5ad ../../abc_download_root/expression_matrices/MERFISH-C57BL6J-638850/20230830/C57BL6J-638850-log2.h5ad

The prerequisite to run the Zhuang MERFISH notebook is to have downloaded the log2 version of the expression matrices of all 4 brain specimens

datasets = ['Zhuang-ABCA-1','Zhuang-ABCA-2','Zhuang-ABCA-3','Zhuang-ABCA-4']
for d in datasets :
    expression_matrices = manifest['file_listing'][d]['expression_matrices']
    file_dict = expression_matrices[d]['log2']['files']['h5ad']
    print('size:',file_dict['size'])
    download_file( file_dict )
size: 2128478610
expression_matrices/Zhuang-ABCA-1/20230830/Zhuang-ABCA-1-log2.h5ad 2128478610
aws s3 cp --no-sign-request s3://allen-brain-cell-atlas/expression_matrices/Zhuang-ABCA-1/20230830/Zhuang-ABCA-1-log2.h5ad ../../abc_download_root/expression_matrices/Zhuang-ABCA-1/20230830/Zhuang-ABCA-1-log2.h5ad
size: 871420938
expression_matrices/Zhuang-ABCA-2/20230830/Zhuang-ABCA-2-log2.h5ad 871420938
aws s3 cp --no-sign-request s3://allen-brain-cell-atlas/expression_matrices/Zhuang-ABCA-2/20230830/Zhuang-ABCA-2-log2.h5ad ../../abc_download_root/expression_matrices/Zhuang-ABCA-2/20230830/Zhuang-ABCA-2-log2.h5ad
size: 1160586154
expression_matrices/Zhuang-ABCA-3/20230830/Zhuang-ABCA-3-log2.h5ad 1160586154
aws s3 cp --no-sign-request s3://allen-brain-cell-atlas/expression_matrices/Zhuang-ABCA-3/20230830/Zhuang-ABCA-3-log2.h5ad ../../abc_download_root/expression_matrices/Zhuang-ABCA-3/20230830/Zhuang-ABCA-3-log2.h5ad
size: 106739752
expression_matrices/Zhuang-ABCA-4/20230830/Zhuang-ABCA-4-log2.h5ad 106739752
aws s3 cp --no-sign-request s3://allen-brain-cell-atlas/expression_matrices/Zhuang-ABCA-4/20230830/Zhuang-ABCA-4-log2.h5ad ../../abc_download_root/expression_matrices/Zhuang-ABCA-4/20230830/Zhuang-ABCA-4-log2.h5ad

Downloading all image volumes#

The prerequisite to run the CCF and MERFISH to CCF registration notebooks is to have downloaded the two set of image volumes.

for r in manifest['directory_listing'] :
    
    r_dict =  manifest['directory_listing'][r]
    
    for d in r_dict['directories'] :
        
        if d != 'image_volumes' :
            continue
        d_dict = r_dict['directories'][d]
        local_path = os.path.join( download_base, d_dict['relative_path'])
        local_path = pathlib.Path( local_path )
        remote_path = manifest['resource_uri'] + d_dict['relative_path']
        
        command = "aws s3 sync --no-sign-request %s %s" % (remote_path, local_path)
        print(command)
        
        start = time.process_time()
        # Uncomment to download directories
        #result = subprocess.run(command.split(),stdout=subprocess.PIPE)
        #print("time taken: ", time.process_time() - start)
  
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/image_volumes/Allen-CCF-2020/20230630 ../../abc_download_root/image_volumes/Allen-CCF-2020/20230630
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/image_volumes/MERFISH-C57BL6J-638850-CCF/20230630 ../../abc_download_root/image_volumes/MERFISH-C57BL6J-638850-CCF/20230630