CAVE Quickstart

The Connectome Annotation Versioning Engine (CAVE) is a suite of tools developed at the Allen Institute and Seung Lab to manage large connectomics data.

Initial Setup

Before using any programmatic access to the data, you first need to set up your CAVEclient token.

CAVEclient

Most programmatic access to the CAVE services occurs through CAVEclient, a Python client to access various types of data from the online services.

Full documentation for CAVEclient is available here.

To initialize a caveclient, we give it a datastack, which is a name that defines a particular combination of imagery, segmentation, and annotation database. For the MICrONs public data, we use the datastack name minnie65_public.

from caveclient import CAVEclient
datastack_name = 'minnie65_public'
client = CAVEclient(datastack_name)

# Show the description of the datastack
client.info.get_datastack_info()['description']

'This is the publicly released version of the minnie65 volume and segmentation. '

Materialization versions

Data in CAVE is timestamped and periodically versioned - each (materialization) version corresponds to a specific timestamp. Individual versions are made publicly available. The materialization service provides annotation queries to the dataset. It is available under client.materialize.

Periodic updates are made to the public datastack, which will include updates to the available tables. Some cells will have different pt_root_id because they have undergone proofreading.

It is worth checking the version of the data you are using, and specifying the version for analysis consistency.

# see the available materialization versions
client.materialize.get_versions()

[1078, 117, 661, 343, 1181, 795, 943]

And these are their associated timestamps (all timestamps are in UTC):

for version in client.materialize.get_versions():
    print(f"Version {version}: {client.materialize.get_timestamp(version)}")

Version 1078: 2024-06-05 10:10:01.203215+00:00
Version 117: 2021-06-11 08:10:00.215114+00:00
Version 661: 2023-04-06 20:17:09.199182+00:00
Version 343: 2022-02-24 08:10:00.184668+00:00
Version 1181: 2024-09-16 10:10:01.121167+00:00
Version 795: 2023-08-23 08:10:01.404268+00:00
Version 943: 2024-01-22 08:10:01.497934+00:00

# set materialization version, for consistency
materialization = 1181 # current public as of 9/16/2024
client.version = materialization

CAVEclient Basics

The most frequent use of the CAVEclient is to query the database for annotations like synapses. All database functions are under the client.materialize property. To see what tables are available, use the get_tables function:

client.materialize.get_tables()

['nucleus_alternative_points',
 'allen_column_mtypes_v2',
 'bodor_pt_cells',
 'aibs_metamodel_mtypes_v661_v2',
 'allen_v1_column_types_slanted_ref',
 'aibs_column_nonneuronal_ref',
 'nucleus_ref_neuron_svm',
 'apl_functional_coreg_vess_fwd',
 'vortex_compartment_targets',
 'baylor_log_reg_cell_type_coarse_v1',
 'functional_properties_v3_bcm',
 'l5et_column',
 'pt_synapse_targets',
 'coregistration_auto_phase3_fwd_apl_vess_combined',
 'coregistration_manual_v4',
 'vortex_manual_myelination_v0',
 'synapses_pni_2',
 'nucleus_detection_v0',
 'vortex_manual_nodes_of_ranvier',
 'vortex_astrocyte_proofreading_status',
 'bodor_pt_target_proofread',
 'nucleus_functional_area_assignment',
 'coregistration_auto_phase3_fwd',
 'synapse_target_structure',
 'proofreading_status_and_strategy',
 'aibs_metamodel_celltypes_v661']

For each table, you can see the metadata describing that table. For example, let’s look at the nucleus_detection_v0 table:

client.materialize.get_table_metadata('nucleus_detection_v0')

{'aligned_volume': 'minnie65_phase3',
 'created': '2020-11-02T18:56:35.530100',
 'id': 45664,
 'schema': 'nucleus_detection',
 'table_name': 'nucleus_detection_v0',
 'valid': True,
 'schema_type': 'nucleus_detection',
 'user_id': '121',
 'description': 'A table of nuclei detections from a nucleus detection model developed by Shang Mu, Leila Elabbady, Gayathri Mahalingam and Forrest Collman. Pt is the centroid of the nucleus detection. id corresponds to the flat_segmentation_source segmentID. Only included nucleus detections of volume>25 um^3, below which detections are false positives, though some false positives above that threshold remain. ',
 'notice_text': None,
 'reference_table': None,
 'flat_segmentation_source': 'precomputed://https://bossdb-open-data.s3.amazonaws.com/iarpa_microns/minnie/minnie65/nuclei',
 'write_permission': 'PRIVATE',
 'read_permission': 'PUBLIC',
 'last_modified': '2022-10-25T19:24:28.559914',
 'segmentation_source': '',
 'pcg_table_name': 'minnie3_v1',
 'last_updated': '2024-10-24T22:00:00.145632',
 'voxel_resolution': [4.0, 4.0, 40.0]}

You get a dictionary of values. Two fields are particularly important: the description, which offers a text description of the contents of the table and voxel_resolution which defines how the coordinates in the table are defined, in nm/voxel.

Annotation tables

You can also find a semantic description of the most commonly used tables at the Annotation Tables page.

Querying Tables

To get the contents of a table, use the query_table function. This will return the whole contents of a table without any filtering, up to for a maximum limit of 200,000 rows. The table is returned as a Pandas DataFrame and you can immediately use standard Pandas function on it.

cell_type_df = client.materialize.query_table('nucleus_detection_v0')
cell_type_df.head()

	id	created	superceded_id	valid	volume	pt_supervoxel_id	pt_root_id	pt_position	bb_start_position	bb_end_position
0	730537	2020-09-28 22:40:41.780734+00:00	NaN	t	32.307937	0	0	[381312, 273984, 19993]	[nan, nan, nan]	[nan, nan, nan]
1	373879	2020-09-28 22:40:41.781788+00:00	NaN	t	229.045043	96218056992431305	864691136090135607	[228816, 239776, 19593]	[nan, nan, nan]	[nan, nan, nan]
2	601340	2020-09-28 22:40:41.782714+00:00	NaN	t	426.138010	0	0	[340000, 279152, 20946]	[nan, nan, nan]	[nan, nan, nan]
3	201858	2020-09-28 22:40:41.783784+00:00	NaN	t	93.753836	84955554103121097	864691135373893678	[146848, 213600, 26267]	[nan, nan, nan]	[nan, nan, nan]
4	600774	2020-09-28 22:40:41.785273+00:00	NaN	t	135.189791	0	0	[339120, 276112, 19442]	[nan, nan, nan]	[nan, nan, nan]

Caution

While most tables are small enough to be returned in full, the synapse table has hundreds of millions of rows and is too large to download this way

Tables have a collection of columns, some of which specify point in space (columns ending in _position), some a root id (ending in _root_id), and others that contain other information about the object at that point. Before describing some of the most important tables in the database, it’s useful to know about a few advanced options that apply when querying any table.

desired_resolution : This parameter allows you to convert the columns specifying spatial points to different resolutions. Many tables are stored at a resolution of 4x4x40 nm/voxel, for example, but you can convert to nanometers by setting desired_resolution=[1,1,1].
split_positions : This parameter allows you to split the columns specifying spatial points into separate columns for each dimension. The new column names will be the original column name with _x, _y, and _z appended.
select_columns : This parameter allows you to get only a subset of columns from the table. Once you know exactly what you want, this can save you some cleanup.
limit : This parameter allows you to limit the number of rows returned. If you are just testing out a query or trying to inspect the kind of data within a table, you can set this to a small number to make sure it works before downloading the whole table. Note that this will show a warning so that you don’t accidentally limit your query when you don’t mean to.

For example, using all of these together:

cell_type_df = client.materialize.query_table('nucleus_detection_v0', split_positions=True, desired_resolution=[1,1,1], select_columns=['pt_position', 'pt_root_id'], limit=10)
cell_type_df

	pt_position_x	pt_position_y	pt_position_z	pt_root_id
0	241856.0	374464.0	838720.0	0
1	227200.0	389120.0	797160.0	0
2	230144.0	422336.0	795320.0	0
3	239488.0	386432.0	794120.0	0
4	239744.0	423488.0	803120.0	864691136050815731
5	245888.0	384512.0	800120.0	0
6	249792.0	391680.0	807080.0	0
7	243328.0	403008.0	794280.0	0
8	247872.0	386816.0	805320.0	0
9	260352.0	416640.0	802360.0	864691135013273238

Filtering Queries

Filtering tables so that you only get data about certain rows back is a very common operation. While there are filtering options in the query_table function (see documentation for more details), a more unified filter interface is available through a “table manager” interface.

Rather than passing a table name to the query_table function, client.materialize.tables has a subproperty for each table in the database that can be used to filter that table.

The general pattern for usage is

client.materialize.tables.{table_name}({filter options}).query({format and timestamp options})

where {table_name} is the name of the table you want to filter, {filter options} is a collection of arguments for filtering the query, and {format and timestamp options} are those parameters controlling the format and timestamp of the query.

For example, let’s look at the table aibs_metamodel_celltypes_v661, which has cell type predictions across the dataset. We can get the whole table as a DataFrame:

cell_type_df = client.materialize.tables.aibs_metamodel_celltypes_v661().query()
cell_type_df.head()

	id	created	valid	volume	pt_supervoxel_id	pt_root_id	id_ref	created_ref	valid_ref	target_id	classification_system	cell_type	pt_position	bb_start_position	bb_end_position
0	336365	2020-09-28 22:42:48.966292+00:00	t	272.488202	93606511657924288	864691136274724621	36916	2023-12-19 22:47:18.659864+00:00	t	336365	excitatory_neuron	5P-IT	[209760, 180832, 27076]	[nan, nan, nan]	[nan, nan, nan]
1	110648	2020-09-28 22:45:09.650639+00:00	t	328.533443	79385153184885329	864691135489403194	1070	2023-12-19 22:38:00.472115+00:00	t	110648	excitatory_neuron	23P	[106448, 129632, 25410]	[nan, nan, nan]	[nan, nan, nan]
2	112071	2020-09-28 22:43:34.088785+00:00	t	272.929423	79035988248401958	864691136147292311	1099	2023-12-19 22:38:00.898837+00:00	t	112071	excitatory_neuron	23P	[103696, 149472, 15583]	[nan, nan, nan]	[nan, nan, nan]
3	197927	2020-09-28 22:43:10.652649+00:00	t	91.308851	84529699506051734	864691136050858227	13259	2023-12-19 22:41:14.417986+00:00	t	197927	nonneuron	oligo	[143600, 186192, 26471]	[nan, nan, nan]	[nan, nan, nan]
4	198087	2020-09-28 22:41:36.677186+00:00	t	161.744978	83756261929388963	864691135809440972	13271	2023-12-19 22:41:14.685474+00:00	t	198087	nonneuron	astrocyte	[137952, 190944, 27361]	[nan, nan, nan]	[nan, nan, nan]

and we can add similar formatting options as in the last section to the query function:

cell_type_df = client.materialize.tables.aibs_metamodel_celltypes_v661().query(split_positions=True, desired_resolution=[1,1,1], select_columns=['pt_position', 'pt_root_id', 'cell_type'], limit=10)
cell_type_df

	cell_type	pt_position_x	pt_position_y	pt_position_z	pt_root_id
0	23P	257600.0	487936.0	802760.0	864691135724233643
1	23P	260992.0	493568.0	801560.0	864691136436395166
2	NGC	256256.0	466432.0	831040.0	864691135462260637
3	23P	255744.0	480640.0	833200.0	864691136723556861
4	23P	262144.0	505856.0	824880.0	864691135776658528
5	23P	257536.0	521728.0	804440.0	864691135941166708
6	23P	251840.0	552896.0	832320.0	864691135545065768
7	23P	251136.0	546048.0	821320.0	864691135479369926
8	23P	256000.0	626368.0	814000.0	864691135697633557
9	astrocyte	324096.0	417920.0	658880.0	864691135937358133

However, now we can also filter the table to get only cells that are predicted to have cell type "BC" (for “basket cell”).

my_cell_type = "BC"
client.materialize.tables.aibs_metamodel_celltypes_v661(cell_type=my_cell_type).query()

	id_ref	created_ref	valid_ref	target_id	classification_system	cell_type	id	created	valid	volume	pt_supervoxel_id	pt_root_id	pt_position	bb_start_position	bb_end_position
0	43009	2023-12-19 22:48:53.577191+00:00	t	369908	inhibitory_neuron	BC	369908	2020-09-28 22:40:41.814964+00:00	t	332.862751	96002690286851358	864691136522768017	[227104, 207840, 20841]	[nan, nan, nan]	[nan, nan, nan]
1	12051	2023-12-19 22:40:57.133228+00:00	t	193846	inhibitory_neuron	BC	193846	2020-09-28 22:40:41.897904+00:00	t	306.148966	82838443188669165	864691135684976823	[131568, 168496, 16452]	[nan, nan, nan]	[nan, nan, nan]
2	83044	2023-12-19 22:58:50.269173+00:00	t	615735	inhibitory_neuron	BC	615735	2020-09-28 22:40:41.957345+00:00	t	314.539540	112181247505371364	864691136311774525	[344880, 161104, 17084]	[nan, nan, nan]	[nan, nan, nan]
3	48718	2023-12-19 22:50:21.192138+00:00	t	401681	inhibitory_neuron	BC	401681	2020-09-28 22:40:42.066718+00:00	t	497.801462	98465046644219429	864691136052141043	[245232, 203952, 21268]	[nan, nan, nan]	[nan, nan, nan]
4	82324	2023-12-19 22:58:39.896999+00:00	t	613047	inhibitory_neuron	BC	613047	2020-09-28 22:40:41.982376+00:00	t	242.159780	113234168401651200	864691136065413528	[352688, 141616, 25312]	[nan, nan, nan]	[nan, nan, nan]
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3360	8968	2023-12-19 22:40:09.246333+00:00	t	170777	inhibitory_neuron	BC	170777	2020-09-28 22:45:25.310708+00:00	t	499.103662	81230957054577082	864691135065994564	[119600, 250560, 15373]	[nan, nan, nan]	[nan, nan, nan]
3361	15548	2023-12-19 22:41:48.382554+00:00	t	208056	inhibitory_neuron	BC	208056	2020-09-28 22:45:25.401800+00:00	t	521.621668	84540007091735344	864691135801456226	[143472, 262944, 23693]	[nan, nan, nan]	[nan, nan, nan]
3362	79472	2023-12-19 22:57:53.993099+00:00	t	591219	inhibitory_neuron	BC	591219	2020-09-28 22:45:25.526753+00:00	t	567.517839	110216764830845707	864691135279126177	[330320, 204752, 25060]	[nan, nan, nan]	[nan, nan, nan]
3363	55791	2023-12-19 22:52:02.582669+00:00	t	438586	inhibitory_neuron	BC	438586	2020-09-28 22:45:25.430745+00:00	t	529.501389	99807894274485381	864691135395662581	[254912, 247440, 23680]	[nan, nan, nan]	[nan, nan, nan]
3364	50504	2023-12-19 22:50:48.576826+00:00	t	419363	inhibitory_neuron	BC	419363	2020-09-28 22:45:25.436862+00:00	t	530.642698	99716496901116512	864691136691390838	[254416, 90336, 20469]	[nan, nan, nan]	[nan, nan, nan]

3365 rows × 15 columns

or maybe we just want the cell types for a particular collection of root ids:

my_root_ids = [864691135771677771, 864691135560505569, 864691136723556861]
client.materialize.tables.aibs_metamodel_celltypes_v661(pt_root_id=my_root_ids).query()

	id	created	valid	volume	pt_supervoxel_id	pt_root_id	id_ref	created_ref	valid_ref	target_id	classification_system	cell_type	pt_position	bb_start_position	bb_end_position
0	19116	2020-09-28 22:41:51.767906+00:00	t	301.426115	74737997899501359	864691135771677771	11282	2023-12-19 22:40:43.249642+00:00	t	19116	excitatory_neuron	23P	[72576, 108656, 20291]	[nan, nan, nan]	[nan, nan, nan]
1	21783	2020-09-28 22:41:59.966574+00:00	t	263.637074	75795590176519004	864691135560505569	15681	2023-12-19 22:41:50.365399+00:00	t	21783	excitatory_neuron	23P	[80128, 124000, 16563]	[nan, nan, nan]	[nan, nan, nan]
2	4074	2020-09-28 22:42:41.341179+00:00	t	313.678234	73543309863605007	864691136723556861	50080	2023-12-19 22:50:42.474168+00:00	t	4074	excitatory_neuron	23P	[63936, 120160, 20830]	[nan, nan, nan]	[nan, nan, nan]

You can get a list of all parameters than be used for querying with the standard IPython/Jupyter docstring functionality, e.g. client.materialize.tables.aibs_metamodel_celltypes_v661.

Caution

Use of this functionality will show a brief warning that the interface is experimental. This is because the interface is still being developed and may change in the near future in response to user feedback.

Querying Proofread neurons

Proofread neurons

Proofreading is necessary to obtain accurate reconstructions of a cell. In the MICrONS dataset, the general rule is that dendrites onto cells with a single cell body are sufficiently proofread to trust synaptic connections onto a cell. Axons on the other hand require so much proofread that only ~1,000 cells have axons that were proofread to various degrees such that their outputs can be used for analysis.

The table proofreading_status_and_strategy contains proofreading information about ~1,300 neurons. This website provides the most detailed overview. In brief, axons annotated with any strategy_axon were cleaned of false mergers but not all were fully extended. The most important distinction is axons annotated with axon_column_truncated were only proofread within a certain volume wheras others were proofread without such bias.

proof_all_df = client.materialize.query_table("proofreading_status_and_strategy", desired_resolution=[1, 1, 1], split_positions=True)

proof_all_df["strategy_axon"].value_counts()

strategy_axon
axon_partially_extended    979
axon_column_truncated      233
none                       185
axon_interareal            144
axon_fully_extended         80
Name: count, dtype: int64

We can filter our query to only return rows that match a condition by adding a filter to our query:

proof_df = client.materialize.query_table("proofreading_status_and_strategy", filter_in_dict={"strategy_axon": ["axon_partially_extended", "axon_fully_extended", "axon_interareal", "axon_column_truncated"]}, desired_resolution=[1, 1, 1], split_positions=True)

proof_df["strategy_axon"].value_counts()

strategy_axon
axon_column_truncated      598
axon_partially_extended    341
axon_interareal            146
axon_fully_extended         77
Name: count, dtype: int64