Getting Experimental Metadata from DANDI#

It can be helpful to view general information about the experimental sessions that produced your data. Since typically each NWB File represents one session, a dandiset’s files can be examined to get an overview of each of the sessions. This can vary, depending on who produced the NWB file. In this notebook, NWB Files within one of the Allen Institute’s datasets are opened and some basic information is used to make a table of the experimental sessions and their properties.

Environment Setup#

⚠️Note: If running on a new environment, run this cell once and then restart the kernel⚠️

try:
    from databook_utils.dandi_utils import dandi_stream_open
except:
    !git clone https://github.com/AllenInstitute/openscope_databook.git
    %cd openscope_databook
    %pip install -e .
c:\Users\carter.peene\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\_distributor_init.py:30: UserWarning: loaded more than 1 DLL from .libs:
c:\Users\carter.peene\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
c:\Users\carter.peene\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll
  warnings.warn("loaded more than 1 DLL from .libs:"
import os

import h5py
import pandas as pd
import remfile

from dandi import dandiapi
from fsspec.implementations.cached import CachingFileSystem
from pynwb import NWBHDF5IO

%matplotlib inline

Getting Dandiset Metadata#

To view other data, change dandiset_id to be the id of the dandiset you’re interested in. If the dandiset is embargoed, set dandi_api_key to your DANDI API key.

dandiset_id = "000248"
dandi_api_key = None
my_dandiset = dandiapi.DandiAPIClient(token=dandi_api_key).get_dandiset(dandiset_id)
print(f"Got dandiset {my_dandiset}")
A newer version (0.59.0) of dandi/dandi-cli is available. You are using 0.55.1
Got dandiset DANDI:000248/draft

Get NWB Info#

Below are two definitions of thefunction get_nwb_info. These function are tailored to our NWB Files; Our Ophys and our Ecephys datasets respectively. It retrieves a series of important metadata values from the NWB file object. It is likely that the code for accessing the fields of interest to you will be slightly different for your files. This can easily altered to extract any other information from an NWB file you want as long as you’re familiar with the internal layout of your files. However, make sure to change the columns field in the pandas dataframe below to properly reflect any changes to the function.

# get experimental information from within ophys file
# getattr is used because not all nwb files have all properties. If not handled like this, errors will arise
# def get_nwb_info(nwb):
#         session_time = getattr(nwb, "session_start_time", None)

#         metadata_obj = getattr(nwb, "lab_meta_data", {})
#         metadata = metadata_obj.get("metadata", None)
#         session_id = getattr(metadata, "ophys_session_id", None)
#         experiment_id = getattr(metadata, "ophys_experiment_id", None)

#         fov_height = getattr(metadata, "field_of_view_height", None)
#         fov_width = getattr(metadata, "field_of_view_width", None)
#         imaging_depth = getattr(metadata, "imaging_depth", None)
#         group = getattr(metadata, "imaging_plane_group", None)
#         group_count = getattr(metadata, "imaging_plane_group_count", None)
#         container_id = getattr(metadata, "experiment_container_id", None)
        
#         subject = getattr(nwb, "subject", None)
#         specimen_name = getattr(subject, "subject_id", None)
#         age = getattr(subject, "age", None)
#         sex = getattr(subject, "sex", None)
#         genotype = getattr(subject, "genotype", None)
        
#         try: n_rois = nwb.processing["ophys"]["dff"].roi_response_series["traces"].data.shape[1]
#         except: n_rois = None
#         try: location = list(nwb.imaging_planes.values())[0].location
#         except: location = None
        
#         intervals = getattr(nwb, "intervals", {})
#         stim_types = set(intervals.keys())
#         stim_tables = [intervals[table_name] for table_name in intervals]
#         # gets highest value among final "stop times" of all stim tables in intervals
#         session_end = max([table.stop_time[-1] for table in stim_tables if len(table) > 1])

#         return [session_time, session_id, experiment_id, container_id, group, group_count, imaging_depth, location, fov_height, fov_width, specimen_name, sex, age, genotype, stim_types, n_rois, session_end]
# get experimental information from within ecephys file
# getattr is used because not all nwb files have all properties. If not handled like this, errors will arise
def get_nwb_info(nwb):
        session_time = getattr(nwb, "session_start_time", None)

        subject = getattr(nwb, "subject", None)
        specimen_name = getattr(subject, "specimen_name", None)
        age = getattr(subject, "age_in_days", None)
        sex = getattr(subject, "sex", None)
        genotype = getattr(subject, "genotype", None)

        probes = set(getattr(nwb, "devices", {}).keys())
        units = getattr(nwb, "units", [])
        n_units = len(units) if hasattr(units, '__len__') else 0        
        
        intervals = getattr(nwb, "intervals", {})
        stim_types = set(intervals.keys())
        stim_tables = [intervals[table_name] for table_name in intervals]
        # gets highest value among final "stop times" of all stim tables in intervals
        session_end = max([table.stop_time[-1] for table in stim_tables if len(table) > 1])

        return [session_time, specimen_name, sex, age, genotype, probes, stim_types, n_units, session_end]

Getting Table#

Here, each relevant file in the dandiset is streamed and opened remotely to get the information of interest using the function get_nwb_info, defined above, and then it is added to a table of sessions and their metadata. Since some files are for specific probes rather than entire sessions, they are skipped.

nwb_table = []
# skip files that aren't main session files
files = [asset for asset in my_dandiset.get_assets() if "probe" not in asset.path]
# swap this with line above for one of our ophys dandisets
# files = [asset for asset in my_dandiset.get_assets() if "raw" not in asset.path]
n_files = len(files)
print(f"{n_files} files retrieved")

for i, file in enumerate(files):
    print(f"Examining file {i+1}/{n_files}: {file.identifier}")    
    # get basic file metadata
    row = [file.identifier, file.size, file.path]
    
    base_url = file.client.session.head(file.base_download_url)
    file_url = base_url.headers["Location"]

    # open and read nwb file with streaming
    rem_file = remfile.File(file_url)
    h5py_file = h5py.File(rem_file, "r")
    io = NWBHDF5IO(file=h5py_file, mode="r", load_namespaces=True)
    nwb = io.read()

    # extract experimental info from within file
    row += get_nwb_info(nwb)
    nwb_table.append(row)
    del nwb

    # don't run full loop if running in test environment
    if os.environ.get("TESTING", False):
        break
12 files retrieved
Examining file 1/12: 9ab6bfff-70ed-44dc-b384-96b4cef2b566
c:\Users\carter.peene\AppData\Local\Programs\Python\Python39\lib\site-packages\hdmf\spec\namespace.py:531: UserWarning: Ignoring cached namespace 'hdmf-common' version 1.6.0 because version 1.8.0 is already loaded.
  warn("Ignoring cached namespace '%s' version %s because version %s is already loaded."
c:\Users\carter.peene\AppData\Local\Programs\Python\Python39\lib\site-packages\hdmf\spec\namespace.py:531: UserWarning: Ignoring cached namespace 'core' version 2.6.0-alpha because version 2.5.0 is already loaded.
  warn("Ignoring cached namespace '%s' version %s because version %s is already loaded."
c:\Users\carter.peene\AppData\Local\Programs\Python\Python39\lib\site-packages\hdmf\spec\namespace.py:531: UserWarning: Ignoring cached namespace 'hdmf-experimental' version 0.3.0 because version 0.5.0 is already loaded.
  warn("Ignoring cached namespace '%s' version %s because version %s is already loaded."
Examining file 2/12: 15bbb781-f912-4630-b9fe-f3df864290ad
Examining file 3/12: f7b65765-41b5-4605-9585-95338c3a9e5a
Examining file 4/12: 6b321a1c-55c4-4de5-8a25-373f2c5a4bc8
Examining file 5/12: 0073a783-5b41-42a8-882a-2960554d4e43
Examining file 6/12: 1b7aaf88-eabd-46f5-8f2a-c1fecba823f2
Examining file 7/12: 3e3d9ce2-4d45-4f41-a957-532cbdf5bc39
Examining file 8/12: f1a26076-0a3e-43c8-b138-5320bca7a23f
Examining file 9/12: bc340647-7b72-4ec8-aa86-ac236da36713
Examining file 10/12: efb53d2c-2d51-4a26-9b36-5d3d3ff7e19e
Examining file 11/12: e61edcf3-26e9-44e1-9017-3ee558197da5
Examining file 12/12: 233e6f43-d51e-47f0-9cda-838444c274f6
# convert table to pandas dataframe
dandiset_files = pd.DataFrame(nwb_table, columns=("identifier", "size", "path", "session_time", "sub_name", "sub_sex", "sub_age", "sub_genotype", "probes", "stim types", "#_units", "session_length"))
# swap this with line above for one of our ophys dandisets
# dandiset_files = pd.DataFrame(nwb_table, columns=("identifier", "size", "path", "session_time", "session_id", "experiment_id", "container_id", "group", "group_count", "imaging_depth", "location", "fov_height", "fov_width", "specimen_name", "sub_sex", "sub_age", "sub_genotype", "stim_types", "#_rois", "session_end"))
dandiset_files
identifier size path session_time sub_name sub_sex sub_age sub_genotype probes stim types #_units session_length
0 9ab6bfff-70ed-44dc-b384-96b4cef2b566 3308934228 sub-633229/sub-633229_ses-1199247593_ogen.nwb 2022-08-17 00:00:00-07:00 633229 F 101.0 Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt {probeC, probeE, OptogeneticStimulusDevice, pr... {RFCI_presentations, invalid_times, ICkcfg1_pr... 3026 7279.58784
1 15bbb781-f912-4630-b9fe-f3df864290ad 2773818936 sub-631510/sub-631510_ses-1196157974_ogen.nwb 2022-08-03 00:00:00-07:00 631510 F 99.0 Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt {probeC, probeE, OptogeneticStimulusDevice, pr... {RFCI_presentations, invalid_times, ICkcfg1_pr... 2386 7339.21131
2 f7b65765-41b5-4605-9585-95338c3a9e5a 2717870004 sub-620334/sub-620334_ses-1189887297_ogen.nwb 2022-07-06 00:00:00-07:00 620334 M 154.0 Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt {probeC, probeE, OptogeneticStimulusDevice, pr... {RFCI_presentations, invalid_times, ICkcfg1_pr... 2092 7279.89479
3 6b321a1c-55c4-4de5-8a25-373f2c5a4bc8 3036007774 sub-620333/sub-620333_ses-1188137866_ogen.nwb 2022-06-30 00:00:00-07:00 620333 M 148.0 Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt {probeC, probeE, OptogeneticStimulusDevice, pr... {RFCI_presentations, invalid_times, ICkcfg1_pr... 2593 7283.08716
4 0073a783-5b41-42a8-882a-2960554d4e43 2108745405 sub-631570/sub-631570_ses-1194857009_ogen.nwb 2022-07-28 00:00:00-07:00 631570 F 92.0 Pvalb-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt {probeC, probeE, OptogeneticStimulusDevice, pr... {RFCI_presentations, invalid_times, ICkcfg1_pr... 1789 7278.92195
5 1b7aaf88-eabd-46f5-8f2a-c1fecba823f2 2598789798 sub-625555/sub-625555_ses-1183070926_ogen.nwb 2022-06-09 00:00:00-07:00 625555 F 90.0 Pvalb-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt {probeC, probeE, OptogeneticStimulusDevice, pr... {RFCI_presentations, invalid_times, ICkcfg1_pr... 2621 7278.57204
6 3e3d9ce2-4d45-4f41-a957-532cbdf5bc39 3619691457 sub-625554/sub-625554_ses-1181330601_ogen.nwb 2022-06-01 00:00:00-07:00 625554 M 82.0 Pvalb-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt {probeC, probeE, OptogeneticStimulusDevice, pr... {RFCI_presentations, invalid_times, ICkcfg1_pr... 2930 7315.43515
7 f1a26076-0a3e-43c8-b138-5320bca7a23f 2469141920 sub-619296/sub-619296_ses-1187930705_ogen.nwb 2022-06-29 00:00:00-07:00 619296 M 154.0 Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt {probeC, probeE, OptogeneticStimulusDevice, pr... {RFCI_presentations, invalid_times, ICkcfg1_pr... 1918 7278.13709
8 bc340647-7b72-4ec8-aa86-ac236da36713 2709636662 sub-630506/sub-630506_ses-1192952695_ogen.nwb 2022-07-20 00:00:00-07:00 630506 F 92.0 Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt {probeC, probeE, OptogeneticStimulusDevice, pr... {RFCI_presentations, invalid_times, ICkcfg1_pr... 2517 7279.14674
9 efb53d2c-2d51-4a26-9b36-5d3d3ff7e19e 2803551453 sub-625545/sub-625545_ses-1182865981_ogen.nwb 2022-06-08 00:00:00-07:00 625545 M 89.0 Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt {probeC, probeE, OptogeneticStimulusDevice, pr... {RFCI_presentations, invalid_times, ICkcfg1_pr... 2793 7279.21336
10 e61edcf3-26e9-44e1-9017-3ee558197da5 2453451408 sub-619293/sub-619293_ses-1184980079_ogen.nwb 2022-06-16 00:00:00-07:00 619293 M 141.0 Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt {probeC, probeE, OptogeneticStimulusDevice, pr... {RFCI_presentations, invalid_times, ICkcfg1_pr... 2136 7454.53444
11 233e6f43-d51e-47f0-9cda-838444c274f6 2557113684 sub-637484/sub-637484_ses-1208667752_ogen.nwb 2022-09-08 00:00:00-07:00 637484 M 92.0 Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt {probeC, probeE, OptogeneticStimulusDevice, pr... {RFCI_presentations, invalid_times, ICkcfg1_pr... 2373 7349.25154
# summary for ophys files
# n_sessions = len(dandiset_files["session_id"].value_counts())
# subjects_info = dandiset_files.groupby(["sub_name", "sub_sex"]).size().reset_index().to_dict()
# m_count = len([sex for sex in subjects_info["sub_sex"].values() if sex == "M"])
# f_count = len([sex for sex in subjects_info["sub_sex"].values() if sex == "F"])
# print("Dandiset Overview:")
# print(len(dandiset_files), "dandiset_files")
# print(len(subjects_info["sub_name"]), "subjects", m_count, "males", f_count,"females")

# summary for ecephys files
m_count = len(dandiset_files["sub_sex"][dandiset_files["sub_sex"] == "M"])
f_count = len(dandiset_files["sub_sex"][dandiset_files["sub_sex"] == "F"])
sst_count = len(dandiset_files[dandiset_files["sub_genotype"].str.count("Sst") >= 1])
pval_count = len(dandiset_files[dandiset_files["sub_genotype"].str.count("Pval") >= 1])
wt_count = len(dandiset_files[dandiset_files["sub_genotype"].str.count("wt/wt") >= 1])
print("Dandiset Overview:")
print(len(dandiset_files), "dandiset_files")
print(len(set(dandiset_files["sub_name"])), "subjects", m_count, "males,", f_count, "females")
print(sst_count, "sst,", pval_count, "pval,", wt_count, "wt")
Dandiset Overview:
12 dandiset_files
12 subjects 7 males, 5 females
9 sst, 3 pval, 0 wt
# output all session metadata to local CSV file
dandiset_files.to_csv("dandiset_files.csv")

Selecting Files#

Pandas syntax can be used to filter the table above and select individual sessions.

selected_files = dandiset_files[dandiset_files["size"] <= 2_500_000_000]
# selected_files = dandiset_files[dandiset_files["sub sex"] == "F"]
# selected_files = dandiset_files[dandiset_files["# units"] > 2900]
selected_files
identifier size path session_time sub_name sub_sex sub_age sub_genotype probes stim types #_units session_length
4 0073a783-5b41-42a8-882a-2960554d4e43 2108745405 sub-631570/sub-631570_ses-1194857009_ogen.nwb 2022-07-28 00:00:00-07:00 631570 F 92.0 Pvalb-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt {probeC, probeE, OptogeneticStimulusDevice, pr... {RFCI_presentations, invalid_times, ICkcfg1_pr... 1789 7278.92195
7 f1a26076-0a3e-43c8-b138-5320bca7a23f 2469141920 sub-619296/sub-619296_ses-1187930705_ogen.nwb 2022-06-29 00:00:00-07:00 619296 M 154.0 Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt {probeC, probeE, OptogeneticStimulusDevice, pr... {RFCI_presentations, invalid_times, ICkcfg1_pr... 1918 7278.13709
10 e61edcf3-26e9-44e1-9017-3ee558197da5 2453451408 sub-619293/sub-619293_ses-1184980079_ogen.nwb 2022-06-16 00:00:00-07:00 619293 M 141.0 Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt {probeC, probeE, OptogeneticStimulusDevice, pr... {RFCI_presentations, invalid_times, ICkcfg1_pr... 2136 7454.53444

Downloading Selected Files#

To download the files, we use the same method that is explained in Downloading an NWB File. This can be used with the paths from the selected sessions above to just download the files of interest. Set download_loc to be the relative path of where the files should be downloaded. Note that if the files are large, this can take a long time.

download_loc = "."
selected_paths = set(selected_files.path)
selected_paths
{'sub-619293/sub-619293_ses-1184980079_ogen.nwb',
 'sub-619296/sub-619296_ses-1187930705_ogen.nwb',
 'sub-631570/sub-631570_ses-1194857009_ogen.nwb'}
for dandi_filepath in selected_paths:
    filename = dandi_filepath.split("/")[-1]
    file = my_dandiset.get_asset_by_path(dandi_filepath)
    file.download(f"{download_loc}/{filename}")
    print(f"Downloaded file to {download_loc}/{filename}")
Downloaded file to ./sub-619293_ses-1184980079_ogen.nwb
Downloaded file to ./sub-619296_ses-1187930705_ogen.nwb
Downloaded file to ./sub-631570_ses-1194857009_ogen.nwb