Getting Experimental Metadata from DANDI

Getting Experimental Metadata from DANDI#

It can be helpful to view general information about the experimental sessions that produced your data. Since typically each NWB File represents one session, a dandiset’s files can be examined to get an overview of each of the sessions. This can vary, depending on who produced the NWB file. In this notebook, NWB Files within one of the Allen Institute’s datasets are opened and some basic information is used to make a table of the experimental sessions and their properties.

Environment Setup#

⚠️Note: If running on a new environment, run this cell once and then restart the kernel⚠️

import warnings
warnings.filterwarnings('ignore')

try:
    from databook_utils.dandi_utils import dandi_stream_open
except:
    !git clone https://github.com/AllenInstitute/openscope_databook.git
    %cd openscope_databook
    %pip install -e .

import os

import h5py
import pandas as pd
import remfile

from dandi import dandiapi
from pynwb import NWBHDF5IO

%matplotlib inline

Getting Dandiset Metadata#

To view other data, change dandiset_id to be the id of the dandiset you’re interested in. If the dandiset is embargoed, set dandi_api_key to your DANDI API key.

dandiset_id = "000248"
dandi_api_key = None

my_dandiset = dandiapi.DandiAPIClient(token=dandi_api_key).get_dandiset(dandiset_id)
print(f"Got dandiset {my_dandiset}")

Got dandiset DANDI:000248/0.240502.2344

Get NWB Info#

Below are two definitions of thefunction get_nwb_info. These function are tailored to our NWB Files; Our Ophys and our Ecephys datasets respectively. It retrieves a series of important metadata values from the NWB file object. It is likely that the code for accessing the fields of interest to you will be slightly different for your files. This can easily be altered to extract any other information from an NWB file you want as long as you’re familiar with the internal layout of your files. However, make sure to change the columns field in the pandas dataframe below to properly reflect any changes to the function. nwb_type can be set below to ‘ephys’ or ‘ophys’ depending on the type of the NWBs of interest.

nwb_type = "ephys" # or, "ophys"

# get experimental information from within ophys file
# getattr is used because not all nwb files have all properties. If not handled like this, errors will arise
def get_ophys_nwb_info(nwb):
        session_time = getattr(nwb, "session_start_time", None)

        metadata_obj = getattr(nwb, "lab_meta_data", {})
        metadata = metadata_obj.get("metadata", None)
        session_id = getattr(metadata, "ophys_session_id", None)
        experiment_id = getattr(metadata, "ophys_experiment_id", None)

        fov_height = getattr(metadata, "field_of_view_height", None)
        fov_width = getattr(metadata, "field_of_view_width", None)
        imaging_depth = getattr(metadata, "imaging_depth", None)
        group = getattr(metadata, "imaging_plane_group", None)
        group_count = getattr(metadata, "imaging_plane_group_count", None)
        container_id = getattr(metadata, "experiment_container_id", None)
        
        subject = getattr(nwb, "subject", None)
        specimen_name = getattr(subject, "subject_id", None)
        age = getattr(subject, "age", None)
        sex = getattr(subject, "sex", None)
        genotype = getattr(subject, "genotype", None)
        
        try: n_rois = nwb.processing["ophys"]["dff"].roi_response_series["traces"].data.shape[1]
        except: n_rois = None
        try: location = list(nwb.imaging_planes.values())[0].location
        except: location = None
        
        intervals = getattr(nwb, "intervals", {})
        stim_types = set(intervals.keys())
        stim_tables = [intervals[table_name] for table_name in intervals]
        # gets highest value among final "stop times" of all stim tables in intervals
        session_end = max([table.stop_time[-1] for table in stim_tables if len(table) > 1])

        return [session_time, specimen_name, session_id, experiment_id, container_id, group, group_count, imaging_depth, location, fov_height, fov_width, specimen_name, sex, age, genotype, stim_types, n_rois, session_end]

# get experimental information from within ecephys file
# getattr is used because not all nwb files have all properties. If not handled like this, errors will arise
def get_ephys_nwb_info(nwb):
        session_time = getattr(nwb, "session_start_time", None)

        subject = getattr(nwb, "subject", None)
        specimen_name = getattr(subject, "specimen_name", None)
        age = getattr(subject, "age_in_days", None)
        sex = getattr(subject, "sex", None)
        genotype = getattr(subject, "genotype", None)

        probes = set(getattr(nwb, "devices", {}).keys())
        units = getattr(nwb, "units", [])
        n_units = len(units) if hasattr(units, '__len__') else 0        
        
        intervals = getattr(nwb, "intervals", {})
        stim_types = set(intervals.keys())
        stim_tables = [intervals[table_name] for table_name in intervals]
        # gets highest value among final "stop times" of all stim tables in intervals
        session_end = max([table.stop_time[-1] for table in stim_tables if len(table) > 1])

        return [session_time, specimen_name, sex, age, genotype, probes, stim_types, n_units, session_end]

Getting Table#

Here, each relevant file in the dandiset is streamed and opened remotely to get the information of interest using the function get_nwb_info, defined above, and then it is added to a table of sessions and their metadata. Since some files are for specific probes rather than entire sessions, they are skipped.

nwb_table = []
# skip files that aren't main session files
files = [asset for asset in my_dandiset.get_assets() if "probe" not in asset.path]
# swap this with line above for one of our ophys dandisets
# files = [asset for asset in my_dandiset.get_assets() if "raw" not in asset.path]
n_files = len(files)
print(f"{n_files} files retrieved")

extract_nwb_info = get_ephys_nwb_info if nwb_type == "ephys" else get_ophys_nwb_info

for i, file in enumerate(files):
    print(f"Examining file {i+1}/{n_files}: {file.identifier}")    
    # get basic file metadata
    row = [file.identifier, file.size, file.path]
    
    base_url = file.client.session.head(file.base_download_url)
    file_url = base_url.headers["Location"]

    # open and read nwb file with streaming
    rem_file = remfile.File(file_url)
    h5py_file = h5py.File(rem_file, "r")
    io = NWBHDF5IO(file=h5py_file, mode="r", load_namespaces=True)
    nwb = io.read()

    # extract experimental info from within file
    row += extract_nwb_info(nwb)
    nwb_table.append(row)
    del nwb

    # don't run full loop if running in test environment
    if os.environ.get("TESTING", False):
        break

12 files retrieved
Examining file 1/12: 9ab6bfff-70ed-44dc-b384-96b4cef2b566
Examining file 2/12: 15bbb781-f912-4630-b9fe-f3df864290ad
Examining file 3/12: f7b65765-41b5-4605-9585-95338c3a9e5a
Examining file 4/12: 6b321a1c-55c4-4de5-8a25-373f2c5a4bc8
Examining file 5/12: 0073a783-5b41-42a8-882a-2960554d4e43
Examining file 6/12: 1b7aaf88-eabd-46f5-8f2a-c1fecba823f2
Examining file 7/12: 3e3d9ce2-4d45-4f41-a957-532cbdf5bc39
Examining file 8/12: f1a26076-0a3e-43c8-b138-5320bca7a23f
Examining file 9/12: bc340647-7b72-4ec8-aa86-ac236da36713
Examining file 10/12: efb53d2c-2d51-4a26-9b36-5d3d3ff7e19e
Examining file 11/12: e61edcf3-26e9-44e1-9017-3ee558197da5
Examining file 12/12: 233e6f43-d51e-47f0-9cda-838444c274f6

# convert table to pandas dataframe
if nwb_type == "ephys":
    dandiset_files = pd.DataFrame(nwb_table, columns=("identifier", "size", "path", "session_time", "sub_name", "sub_sex", "sub_age", "sub_genotype", "probes", "stim types", "#_units", "session_length"))
else:
    dandiset_files = pd.DataFrame(nwb_table, columns=("identifier", "size", "path", "session_time", "sub_name", "session_id", "experiment_id", "container_id", "group", "group_count", "imaging_depth", "location", "fov_height", "fov_width", "specimen_name", "sub_sex", "sub_age", "sub_genotype", "stim_types", "#_rois", "session_end"))
dandiset_files

	identifier	size	path	session_time	sub_name	sub_sex	sub_age	sub_genotype	probes	stim types	#_units	session_length
0	9ab6bfff-70ed-44dc-b384-96b4cef2b566	3308934228	sub-633229/sub-633229_ses-1199247593_ogen.nwb	2022-08-17 00:00:00-07:00	633229	F	101.0	Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt	{probeC, probeB, probeE, probeA, probeD, probe...	{spontaneous_presentations, ICwcfg1_presentati...	3026	7279.58784
1	15bbb781-f912-4630-b9fe-f3df864290ad	2773818936	sub-631510/sub-631510_ses-1196157974_ogen.nwb	2022-08-03 00:00:00-07:00	631510	F	99.0	Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt	{probeC, probeB, probeE, probeA, probeD, probe...	{spontaneous_presentations, ICwcfg1_presentati...	2386	7339.21131
2	f7b65765-41b5-4605-9585-95338c3a9e5a	2717870004	sub-620334/sub-620334_ses-1189887297_ogen.nwb	2022-07-06 00:00:00-07:00	620334	M	154.0	Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt	{probeC, probeB, probeE, probeA, probeD, probe...	{spontaneous_presentations, ICwcfg1_presentati...	2092	7279.89479
3	6b321a1c-55c4-4de5-8a25-373f2c5a4bc8	3036007774	sub-620333/sub-620333_ses-1188137866_ogen.nwb	2022-06-30 00:00:00-07:00	620333	M	148.0	Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt	{probeC, probeB, probeE, probeA, probeD, probe...	{spontaneous_presentations, ICwcfg1_presentati...	2593	7283.08716
4	0073a783-5b41-42a8-882a-2960554d4e43	2108745405	sub-631570/sub-631570_ses-1194857009_ogen.nwb	2022-07-28 00:00:00-07:00	631570	F	92.0	Pvalb-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt	{probeC, probeB, probeE, probeA, probeD, probe...	{spontaneous_presentations, ICwcfg1_presentati...	1789	7278.92195
5	1b7aaf88-eabd-46f5-8f2a-c1fecba823f2	2598789798	sub-625555/sub-625555_ses-1183070926_ogen.nwb	2022-06-09 00:00:00-07:00	625555	F	90.0	Pvalb-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt	{probeC, probeB, probeE, probeA, probeD, probe...	{spontaneous_presentations, ICwcfg1_presentati...	2621	7278.57204
6	3e3d9ce2-4d45-4f41-a957-532cbdf5bc39	3619691457	sub-625554/sub-625554_ses-1181330601_ogen.nwb	2022-06-01 00:00:00-07:00	625554	M	82.0	Pvalb-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt	{probeC, probeB, probeE, probeA, probeD, probe...	{spontaneous_presentations, ICwcfg1_presentati...	2930	7315.43515
7	f1a26076-0a3e-43c8-b138-5320bca7a23f	2469141920	sub-619296/sub-619296_ses-1187930705_ogen.nwb	2022-06-29 00:00:00-07:00	619296	M	154.0	Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt	{probeC, probeB, probeE, probeA, probeD, probe...	{spontaneous_presentations, ICwcfg1_presentati...	1918	7278.13709
8	bc340647-7b72-4ec8-aa86-ac236da36713	2709636662	sub-630506/sub-630506_ses-1192952695_ogen.nwb	2022-07-20 00:00:00-07:00	630506	F	92.0	Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt	{probeC, probeB, probeE, probeA, probeD, probe...	{spontaneous_presentations, ICwcfg1_presentati...	2517	7279.14674
9	efb53d2c-2d51-4a26-9b36-5d3d3ff7e19e	2803551453	sub-625545/sub-625545_ses-1182865981_ogen.nwb	2022-06-08 00:00:00-07:00	625545	M	89.0	Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt	{probeC, probeB, probeE, probeA, probeD, probe...	{spontaneous_presentations, ICwcfg1_presentati...	2793	7279.21336
10	e61edcf3-26e9-44e1-9017-3ee558197da5	2453451408	sub-619293/sub-619293_ses-1184980079_ogen.nwb	2022-06-16 00:00:00-07:00	619293	M	141.0	Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt	{probeC, probeB, probeE, probeA, probeD, probe...	{spontaneous_presentations, ICwcfg1_presentati...	2136	7454.53444
11	233e6f43-d51e-47f0-9cda-838444c274f6	2557113684	sub-637484/sub-637484_ses-1208667752_ogen.nwb	2022-09-08 00:00:00-07:00	637484	M	92.0	Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt	{probeC, probeB, probeE, probeA, probeD, probe...	{spontaneous_presentations, ICwcfg1_presentati...	2373	7349.25154

if nwb_type == "ephys":
    m_count = len(dandiset_files["sub_sex"][dandiset_files["sub_sex"] == "M"])
    f_count = len(dandiset_files["sub_sex"][dandiset_files["sub_sex"] == "F"])
    sst_count = len(dandiset_files[dandiset_files["sub_genotype"].str.count("Sst") >= 1])
    pval_count = len(dandiset_files[dandiset_files["sub_genotype"].str.count("Pval") >= 1])
    wt_count = len(dandiset_files[dandiset_files["sub_genotype"].str.count("wt/wt") >= 1])
    print("Dandiset Overview:")
    print(len(dandiset_files), "dandiset_files")
    print(len(set(dandiset_files["sub_name"])), "subjects", m_count, "males,", f_count, "females")
    print(sst_count, "sst,", pval_count, "pval,", wt_count, "wt")
else:
    n_sessions = len(dandiset_files["session_id"].value_counts())
    subjects_info = dandiset_files.groupby(["sub_name", "sub_sex"]).size().reset_index().to_dict()
    m_count = len([sex for sex in subjects_info["sub_sex"].values() if sex == "M"])
    f_count = len([sex for sex in subjects_info["sub_sex"].values() if sex == "F"])
    print("Dandiset Overview:")
    print(len(dandiset_files), "dandiset_files")
    print(len(subjects_info["sub_name"]), "subjects", m_count, "males", f_count,"females")

Dandiset Overview:
dandiset_files
subjects 7 males, 5 females
sst, 3 pval, 0 wt

# output all session metadata to local CSV file
dandiset_files.to_csv("dandiset_files.csv")

Selecting Files#

Pandas syntax can be used to filter the table above and select individual sessions.

selected_files = dandiset_files[dandiset_files["size"] <= 2_500_000_000]
# selected_files = dandiset_files[dandiset_files["sub_sex"] == "F"]
# selected_files = dandiset_files[dandiset_files["#_units"] > 2900]
selected_files

	identifier	size	path	session_time	sub_name	sub_sex	sub_age	sub_genotype	probes	stim types	#_units	session_length
4	0073a783-5b41-42a8-882a-2960554d4e43	2108745405	sub-631570/sub-631570_ses-1194857009_ogen.nwb	2022-07-28 00:00:00-07:00	631570	F	92.0	Pvalb-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt	{probeC, probeB, probeE, probeA, probeD, probe...	{spontaneous_presentations, ICwcfg1_presentati...	1789	7278.92195
7	f1a26076-0a3e-43c8-b138-5320bca7a23f	2469141920	sub-619296/sub-619296_ses-1187930705_ogen.nwb	2022-06-29 00:00:00-07:00	619296	M	154.0	Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt	{probeC, probeB, probeE, probeA, probeD, probe...	{spontaneous_presentations, ICwcfg1_presentati...	1918	7278.13709
10	e61edcf3-26e9-44e1-9017-3ee558197da5	2453451408	sub-619293/sub-619293_ses-1184980079_ogen.nwb	2022-06-16 00:00:00-07:00	619293	M	141.0	Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt	{probeC, probeB, probeE, probeA, probeD, probe...	{spontaneous_presentations, ICwcfg1_presentati...	2136	7454.53444

Downloading Selected Files#

To download the files, we use the same method that is explained in Downloading an NWB File. This can be used with the paths from the selected sessions above to just download the files of interest. Set download_loc to be the relative path of where the files should be downloaded. Note that if the files are large, this can take a long time.

download_loc = "."

selected_paths = set(selected_files.path)
selected_paths

{'sub-619293/sub-619293_ses-1184980079_ogen.nwb',
 'sub-619296/sub-619296_ses-1187930705_ogen.nwb',
 'sub-631570/sub-631570_ses-1194857009_ogen.nwb'}

for dandi_filepath in selected_paths:
    filename = dandi_filepath.split("/")[-1]
    file = my_dandiset.get_asset_by_path(dandi_filepath)
    file.download(f"{download_loc}/{filename}")
    print(f"Downloaded file to {download_loc}/{filename}")

Downloaded file to ./sub-631570_ses-1194857009_ogen.nwb
Downloaded file to ./sub-619293_ses-1184980079_ogen.nwb
Downloaded file to ./sub-631570_ses-1194857009_ogen.nwb
Downloaded file to ./sub-619296_ses-1187930705_ogen.nwb