Getting Experimental Metadata from DANDI#
It can be helpful to view general information about the experimental sessions that produced your data. Since typically each NWB File represents one session, a dandiset’s files can be examined to get an overview of each of the sessions. This can vary, depending on who produced the NWB file. In this notebook, NWB Files within one of the Allen Institute’s datasets are opened and some basic information is used to make a table of the experimental sessions and their properties.
Environment Setup#
⚠️Note: If running on a new environment, run this cell once and then restart the kernel⚠️
import warnings
warnings.filterwarnings('ignore')
try:
from databook_utils.dandi_utils import dandi_stream_open
except:
!git clone https://github.com/AllenInstitute/openscope_databook.git
%cd openscope_databook
%pip install -e .
import os
import h5py
import pandas as pd
import remfile
from dandi import dandiapi
from pynwb import NWBHDF5IO
%matplotlib inline
Getting Dandiset Metadata#
To view other data, change dandiset_id
to be the id of the dandiset you’re interested in. If the dandiset is embargoed, set dandi_api_key
to your DANDI API key.
dandiset_id = "000248"
dandi_api_key = None
my_dandiset = dandiapi.DandiAPIClient(token=dandi_api_key).get_dandiset(dandiset_id)
print(f"Got dandiset {my_dandiset}")
Got dandiset DANDI:000248/0.240502.2344
Get NWB Info#
Below are two definitions of thefunction get_nwb_info
. These function are tailored to our NWB Files; Our Ophys and our Ecephys datasets respectively. It retrieves a series of important metadata values from the NWB file object. It is likely that the code for accessing the fields of interest to you will be slightly different for your files. This can easily be altered to extract any other information from an NWB file you want as long as you’re familiar with the internal layout of your files. However, make sure to change the columns
field in the pandas dataframe below to properly reflect any changes to the function. nwb_type
can be set below to ‘ephys’ or ‘ophys’ depending on the type of the NWBs of interest.
nwb_type = "ephys" # or, "ophys"
# get experimental information from within ophys file
# getattr is used because not all nwb files have all properties. If not handled like this, errors will arise
def get_ophys_nwb_info(nwb):
session_time = getattr(nwb, "session_start_time", None)
metadata_obj = getattr(nwb, "lab_meta_data", {})
metadata = metadata_obj.get("metadata", None)
session_id = getattr(metadata, "ophys_session_id", None)
experiment_id = getattr(metadata, "ophys_experiment_id", None)
fov_height = getattr(metadata, "field_of_view_height", None)
fov_width = getattr(metadata, "field_of_view_width", None)
imaging_depth = getattr(metadata, "imaging_depth", None)
group = getattr(metadata, "imaging_plane_group", None)
group_count = getattr(metadata, "imaging_plane_group_count", None)
container_id = getattr(metadata, "experiment_container_id", None)
subject = getattr(nwb, "subject", None)
specimen_name = getattr(subject, "subject_id", None)
age = getattr(subject, "age", None)
sex = getattr(subject, "sex", None)
genotype = getattr(subject, "genotype", None)
try: n_rois = nwb.processing["ophys"]["dff"].roi_response_series["traces"].data.shape[1]
except: n_rois = None
try: location = list(nwb.imaging_planes.values())[0].location
except: location = None
intervals = getattr(nwb, "intervals", {})
stim_types = set(intervals.keys())
stim_tables = [intervals[table_name] for table_name in intervals]
# gets highest value among final "stop times" of all stim tables in intervals
session_end = max([table.stop_time[-1] for table in stim_tables if len(table) > 1])
return [session_time, specimen_name, session_id, experiment_id, container_id, group, group_count, imaging_depth, location, fov_height, fov_width, specimen_name, sex, age, genotype, stim_types, n_rois, session_end]
# get experimental information from within ecephys file
# getattr is used because not all nwb files have all properties. If not handled like this, errors will arise
def get_ephys_nwb_info(nwb):
session_time = getattr(nwb, "session_start_time", None)
subject = getattr(nwb, "subject", None)
specimen_name = getattr(subject, "specimen_name", None)
age = getattr(subject, "age_in_days", None)
sex = getattr(subject, "sex", None)
genotype = getattr(subject, "genotype", None)
probes = set(getattr(nwb, "devices", {}).keys())
units = getattr(nwb, "units", [])
n_units = len(units) if hasattr(units, '__len__') else 0
intervals = getattr(nwb, "intervals", {})
stim_types = set(intervals.keys())
stim_tables = [intervals[table_name] for table_name in intervals]
# gets highest value among final "stop times" of all stim tables in intervals
session_end = max([table.stop_time[-1] for table in stim_tables if len(table) > 1])
return [session_time, specimen_name, sex, age, genotype, probes, stim_types, n_units, session_end]
Getting Table#
Here, each relevant file in the dandiset is streamed and opened remotely to get the information of interest using the function get_nwb_info
, defined above, and then it is added to a table of sessions and their metadata. Since some files are for specific probes rather than entire sessions, they are skipped.
nwb_table = []
# skip files that aren't main session files
files = [asset for asset in my_dandiset.get_assets() if "probe" not in asset.path]
# swap this with line above for one of our ophys dandisets
# files = [asset for asset in my_dandiset.get_assets() if "raw" not in asset.path]
n_files = len(files)
print(f"{n_files} files retrieved")
extract_nwb_info = get_ephys_nwb_info if nwb_type == "ephys" else get_ophys_nwb_info
for i, file in enumerate(files):
print(f"Examining file {i+1}/{n_files}: {file.identifier}")
# get basic file metadata
row = [file.identifier, file.size, file.path]
base_url = file.client.session.head(file.base_download_url)
file_url = base_url.headers["Location"]
# open and read nwb file with streaming
rem_file = remfile.File(file_url)
h5py_file = h5py.File(rem_file, "r")
io = NWBHDF5IO(file=h5py_file, mode="r", load_namespaces=True)
nwb = io.read()
# extract experimental info from within file
row += extract_nwb_info(nwb)
nwb_table.append(row)
del nwb
# don't run full loop if running in test environment
if os.environ.get("TESTING", False):
break
12 files retrieved
Examining file 1/12: 9ab6bfff-70ed-44dc-b384-96b4cef2b566
Examining file 2/12: 15bbb781-f912-4630-b9fe-f3df864290ad
Examining file 3/12: f7b65765-41b5-4605-9585-95338c3a9e5a
Examining file 4/12: 6b321a1c-55c4-4de5-8a25-373f2c5a4bc8
Examining file 5/12: 0073a783-5b41-42a8-882a-2960554d4e43
Examining file 6/12: 1b7aaf88-eabd-46f5-8f2a-c1fecba823f2
Examining file 7/12: 3e3d9ce2-4d45-4f41-a957-532cbdf5bc39
Examining file 8/12: f1a26076-0a3e-43c8-b138-5320bca7a23f
Examining file 9/12: bc340647-7b72-4ec8-aa86-ac236da36713
Examining file 10/12: efb53d2c-2d51-4a26-9b36-5d3d3ff7e19e
Examining file 11/12: e61edcf3-26e9-44e1-9017-3ee558197da5
Examining file 12/12: 233e6f43-d51e-47f0-9cda-838444c274f6
# convert table to pandas dataframe
if nwb_type == "ephys":
dandiset_files = pd.DataFrame(nwb_table, columns=("identifier", "size", "path", "session_time", "sub_name", "sub_sex", "sub_age", "sub_genotype", "probes", "stim types", "#_units", "session_length"))
else:
dandiset_files = pd.DataFrame(nwb_table, columns=("identifier", "size", "path", "session_time", "sub_name", "session_id", "experiment_id", "container_id", "group", "group_count", "imaging_depth", "location", "fov_height", "fov_width", "specimen_name", "sub_sex", "sub_age", "sub_genotype", "stim_types", "#_rois", "session_end"))
dandiset_files
identifier | size | path | session_time | sub_name | sub_sex | sub_age | sub_genotype | probes | stim types | #_units | session_length | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 9ab6bfff-70ed-44dc-b384-96b4cef2b566 | 3308934228 | sub-633229/sub-633229_ses-1199247593_ogen.nwb | 2022-08-17 00:00:00-07:00 | 633229 | F | 101.0 | Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt | {probeC, probeB, probeE, probeA, probeD, probe... | {spontaneous_presentations, ICwcfg1_presentati... | 3026 | 7279.58784 |
1 | 15bbb781-f912-4630-b9fe-f3df864290ad | 2773818936 | sub-631510/sub-631510_ses-1196157974_ogen.nwb | 2022-08-03 00:00:00-07:00 | 631510 | F | 99.0 | Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt | {probeC, probeB, probeE, probeA, probeD, probe... | {spontaneous_presentations, ICwcfg1_presentati... | 2386 | 7339.21131 |
2 | f7b65765-41b5-4605-9585-95338c3a9e5a | 2717870004 | sub-620334/sub-620334_ses-1189887297_ogen.nwb | 2022-07-06 00:00:00-07:00 | 620334 | M | 154.0 | Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt | {probeC, probeB, probeE, probeA, probeD, probe... | {spontaneous_presentations, ICwcfg1_presentati... | 2092 | 7279.89479 |
3 | 6b321a1c-55c4-4de5-8a25-373f2c5a4bc8 | 3036007774 | sub-620333/sub-620333_ses-1188137866_ogen.nwb | 2022-06-30 00:00:00-07:00 | 620333 | M | 148.0 | Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt | {probeC, probeB, probeE, probeA, probeD, probe... | {spontaneous_presentations, ICwcfg1_presentati... | 2593 | 7283.08716 |
4 | 0073a783-5b41-42a8-882a-2960554d4e43 | 2108745405 | sub-631570/sub-631570_ses-1194857009_ogen.nwb | 2022-07-28 00:00:00-07:00 | 631570 | F | 92.0 | Pvalb-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt | {probeC, probeB, probeE, probeA, probeD, probe... | {spontaneous_presentations, ICwcfg1_presentati... | 1789 | 7278.92195 |
5 | 1b7aaf88-eabd-46f5-8f2a-c1fecba823f2 | 2598789798 | sub-625555/sub-625555_ses-1183070926_ogen.nwb | 2022-06-09 00:00:00-07:00 | 625555 | F | 90.0 | Pvalb-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt | {probeC, probeB, probeE, probeA, probeD, probe... | {spontaneous_presentations, ICwcfg1_presentati... | 2621 | 7278.57204 |
6 | 3e3d9ce2-4d45-4f41-a957-532cbdf5bc39 | 3619691457 | sub-625554/sub-625554_ses-1181330601_ogen.nwb | 2022-06-01 00:00:00-07:00 | 625554 | M | 82.0 | Pvalb-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt | {probeC, probeB, probeE, probeA, probeD, probe... | {spontaneous_presentations, ICwcfg1_presentati... | 2930 | 7315.43515 |
7 | f1a26076-0a3e-43c8-b138-5320bca7a23f | 2469141920 | sub-619296/sub-619296_ses-1187930705_ogen.nwb | 2022-06-29 00:00:00-07:00 | 619296 | M | 154.0 | Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt | {probeC, probeB, probeE, probeA, probeD, probe... | {spontaneous_presentations, ICwcfg1_presentati... | 1918 | 7278.13709 |
8 | bc340647-7b72-4ec8-aa86-ac236da36713 | 2709636662 | sub-630506/sub-630506_ses-1192952695_ogen.nwb | 2022-07-20 00:00:00-07:00 | 630506 | F | 92.0 | Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt | {probeC, probeB, probeE, probeA, probeD, probe... | {spontaneous_presentations, ICwcfg1_presentati... | 2517 | 7279.14674 |
9 | efb53d2c-2d51-4a26-9b36-5d3d3ff7e19e | 2803551453 | sub-625545/sub-625545_ses-1182865981_ogen.nwb | 2022-06-08 00:00:00-07:00 | 625545 | M | 89.0 | Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt | {probeC, probeB, probeE, probeA, probeD, probe... | {spontaneous_presentations, ICwcfg1_presentati... | 2793 | 7279.21336 |
10 | e61edcf3-26e9-44e1-9017-3ee558197da5 | 2453451408 | sub-619293/sub-619293_ses-1184980079_ogen.nwb | 2022-06-16 00:00:00-07:00 | 619293 | M | 141.0 | Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt | {probeC, probeB, probeE, probeA, probeD, probe... | {spontaneous_presentations, ICwcfg1_presentati... | 2136 | 7454.53444 |
11 | 233e6f43-d51e-47f0-9cda-838444c274f6 | 2557113684 | sub-637484/sub-637484_ses-1208667752_ogen.nwb | 2022-09-08 00:00:00-07:00 | 637484 | M | 92.0 | Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt | {probeC, probeB, probeE, probeA, probeD, probe... | {spontaneous_presentations, ICwcfg1_presentati... | 2373 | 7349.25154 |
if nwb_type == "ephys":
m_count = len(dandiset_files["sub_sex"][dandiset_files["sub_sex"] == "M"])
f_count = len(dandiset_files["sub_sex"][dandiset_files["sub_sex"] == "F"])
sst_count = len(dandiset_files[dandiset_files["sub_genotype"].str.count("Sst") >= 1])
pval_count = len(dandiset_files[dandiset_files["sub_genotype"].str.count("Pval") >= 1])
wt_count = len(dandiset_files[dandiset_files["sub_genotype"].str.count("wt/wt") >= 1])
print("Dandiset Overview:")
print(len(dandiset_files), "dandiset_files")
print(len(set(dandiset_files["sub_name"])), "subjects", m_count, "males,", f_count, "females")
print(sst_count, "sst,", pval_count, "pval,", wt_count, "wt")
else:
n_sessions = len(dandiset_files["session_id"].value_counts())
subjects_info = dandiset_files.groupby(["sub_name", "sub_sex"]).size().reset_index().to_dict()
m_count = len([sex for sex in subjects_info["sub_sex"].values() if sex == "M"])
f_count = len([sex for sex in subjects_info["sub_sex"].values() if sex == "F"])
print("Dandiset Overview:")
print(len(dandiset_files), "dandiset_files")
print(len(subjects_info["sub_name"]), "subjects", m_count, "males", f_count,"females")
Dandiset Overview:
12 dandiset_files
12 subjects 7 males, 5 females
9 sst, 3 pval, 0 wt
# output all session metadata to local CSV file
dandiset_files.to_csv("dandiset_files.csv")
Selecting Files#
Pandas syntax can be used to filter the table above and select individual sessions.
selected_files = dandiset_files[dandiset_files["size"] <= 2_500_000_000]
# selected_files = dandiset_files[dandiset_files["sub_sex"] == "F"]
# selected_files = dandiset_files[dandiset_files["#_units"] > 2900]
selected_files
identifier | size | path | session_time | sub_name | sub_sex | sub_age | sub_genotype | probes | stim types | #_units | session_length | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 0073a783-5b41-42a8-882a-2960554d4e43 | 2108745405 | sub-631570/sub-631570_ses-1194857009_ogen.nwb | 2022-07-28 00:00:00-07:00 | 631570 | F | 92.0 | Pvalb-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt | {probeC, probeB, probeE, probeA, probeD, probe... | {spontaneous_presentations, ICwcfg1_presentati... | 1789 | 7278.92195 |
7 | f1a26076-0a3e-43c8-b138-5320bca7a23f | 2469141920 | sub-619296/sub-619296_ses-1187930705_ogen.nwb | 2022-06-29 00:00:00-07:00 | 619296 | M | 154.0 | Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt | {probeC, probeB, probeE, probeA, probeD, probe... | {spontaneous_presentations, ICwcfg1_presentati... | 1918 | 7278.13709 |
10 | e61edcf3-26e9-44e1-9017-3ee558197da5 | 2453451408 | sub-619293/sub-619293_ses-1184980079_ogen.nwb | 2022-06-16 00:00:00-07:00 | 619293 | M | 141.0 | Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt | {probeC, probeB, probeE, probeA, probeD, probe... | {spontaneous_presentations, ICwcfg1_presentati... | 2136 | 7454.53444 |
Downloading Selected Files#
To download the files, we use the same method that is explained in Downloading an NWB File. This can be used with the paths from the selected sessions above to just download the files of interest. Set download_loc
to be the relative path of where the files should be downloaded. Note that if the files are large, this can take a long time.
download_loc = "."
selected_paths = set(selected_files.path)
selected_paths
{'sub-619293/sub-619293_ses-1184980079_ogen.nwb',
'sub-619296/sub-619296_ses-1187930705_ogen.nwb',
'sub-631570/sub-631570_ses-1194857009_ogen.nwb'}
for dandi_filepath in selected_paths:
filename = dandi_filepath.split("/")[-1]
file = my_dandiset.get_asset_by_path(dandi_filepath)
file.download(f"{download_loc}/{filename}")
print(f"Downloaded file to {download_loc}/{filename}")
Downloaded file to ./sub-631570_ses-1194857009_ogen.nwb
Downloaded file to ./sub-619293_ses-1184980079_ogen.nwb
Downloaded file to ./sub-631570_ses-1194857009_ogen.nwb
Downloaded file to ./sub-619296_ses-1187930705_ogen.nwb