CAVE Quickstart

The Connectome Annotation Versioning Engine (CAVE) is a suite of tools developed at the Allen Institute and Seung Lab to manage large connectomics data.

Initial Setup

Before using any programmatic access to the data, you first need to set up your CAVEclient token.

CAVEclient

Most programmatic access to the CAVE services occurs through CAVEclient, a Python client to access various types of data from the online services.

Full documentation for CAVEclient is available here.

To initialize a caveclient, we give it a datastack, which is a name that defines a particular combination of imagery, segmentation, and annotation database. For the MICrONs public data, we use the datastack name minnie65_public.

import os
from caveclient import CAVEclient
datastack_name = 'minnie65_public'
client = CAVEclient(datastack_name)

# Show the description of the datastack
client.info.get_datastack_info()['description']

'This is the publicly released version of the minnie65 volume and segmentation. '

CAVEclient Basics

The most frequent use of the CAVEclient is to query the database for annotations like synapses. All database functions are under the client.materialize property. To see what tables are available, use the get_tables function:

client.materialize.get_tables()

['baylor_gnn_cell_type_fine_model_v2',
 'nucleus_alternative_points',
 'allen_column_mtypes_v2',
 'bodor_pt_cells',
 'aibs_metamodel_mtypes_v661_v2',
 'proofreading_status_public_release',
 'allen_column_mtypes_v1',
 'allen_v1_column_types_slanted_ref',
 'aibs_column_nonneuronal_ref',
 'nucleus_ref_neuron_svm',
 'aibs_soma_nuc_exc_mtype_preds_v117',
 'baylor_log_reg_cell_type_coarse_v1',
 'coregistration_manual_v3',
 'apl_functional_coreg_forward_v5',
 'l5et_column',
 'cell_edits_v661',
 'pt_synapse_targets',
 'aibs_metamodel_celltypes_v661',
 'synapses_pni_2',
 'nucleus_detection_v0',
 'allen_minnie_extra_types',
 'aibs_soma_nuc_metamodel_preds_v117',
 'bodor_pt_target_proofread']

For each table, you can see the metadata describing that table. For example, let’s look at the nucleus_detection_v0 table:

client.materialize.get_table_metadata('nucleus_detection_v0')

{'schema': 'nucleus_detection',
 'id': 27096,
 'created': '2020-11-02T18:56:35.530100',
 'table_name': 'nucleus_detection_v0__minnie3_v1',
 'valid': True,
 'aligned_volume': 'minnie65_phase3',
 'schema_type': 'nucleus_detection',
 'user_id': '121',
 'description': 'A table of nuclei detections from a nucleus detection model developed by Shang Mu, Leila Elabbady, Gayathri Mahalingam and Forrest Collman. Pt is the centroid of the nucleus detection. id corresponds to the flat_segmentation_source segmentID. Only included nucleus detections of volume>25 um^3, below which detections are false positives, though some false positives above that threshold remain. ',
 'notice_text': None,
 'reference_table': None,
 'flat_segmentation_source': 'precomputed://https://bossdb-open-data.s3.amazonaws.com/iarpa_microns/minnie/minnie65/nuclei',
 'write_permission': 'PRIVATE',
 'read_permission': 'PUBLIC',
 'last_modified': '2022-10-25T19:24:28.559914',
 'segmentation_source': '',
 'pcg_table_name': 'minnie3_v1',
 'last_updated': '2024-01-23T23:00:00.080429',
 'annotation_table': 'nucleus_detection_v0',
 'voxel_resolution': [4.0, 4.0, 40.0]}

You get a dictionary of values. Two fields are particularly important: the description, which offers a text description of the contents of the table and voxel_resolution which defines how the coordinates in the table are defined, in nm/voxel.

Querying Tables

To get the contents of a table, use the query_table function. This will return the whole contents of a table without any filtering, up to for a maximum limit of 200,000 rows. The table is returned as a Pandas DataFrame and you can immediately use standard Pandas function on it.

cell_type_df = client.materialize.query_table('nucleus_detection_v0')
cell_type_df.head()

	id	created	superceded_id	valid	volume	pt_supervoxel_id	pt_root_id	pt_position	bb_start_position	bb_end_position
0	730537	2020-09-28 22:40:41.780734+00:00	NaN	t	32.307937	0	0	[381312, 273984, 19993]	[nan, nan, nan]	[nan, nan, nan]
1	373879	2020-09-28 22:40:41.781788+00:00	NaN	t	229.045043	96218056992431305	864691136090135607	[228816, 239776, 19593]	[nan, nan, nan]	[nan, nan, nan]
2	601340	2020-09-28 22:40:41.782714+00:00	NaN	t	426.138010	0	0	[340000, 279152, 20946]	[nan, nan, nan]	[nan, nan, nan]
3	201858	2020-09-28 22:40:41.783784+00:00	NaN	t	93.753836	84955554103121097	864691135373893678	[146848, 213600, 26267]	[nan, nan, nan]	[nan, nan, nan]
4	600774	2020-09-28 22:40:41.785273+00:00	NaN	t	135.189791	0	0	[339120, 276112, 19442]	[nan, nan, nan]	[nan, nan, nan]

Caution

While most tables are small enough to be returned in full, the synapse table has hundreds of millions of rows and is too large to download this way

Tables have a collection of columns, some of which specify point in space (columns ending in _position), some a root id (ending in _root_id), and others that contain other information about the object at that point. Before describing some of the most important tables in the database, it’s useful to know about a few advanced options that apply when querying any table.

desired_resolution : This parameter allows you to convert the columns specifying spatial points to different resolutions. Many tables are stored at a resolution of 4x4x40 nm/voxel, for example, but you can convert to nanometers by setting desired_resolution=[1,1,1].
split_positions : This parameter allows you to split the columns specifying spatial points into separate columns for each dimension. The new column names will be the original column name with _x, _y, and _z appended.
select_columns : This parameter allows you to get only a subset of columns from the table. Once you know exactly what you want, this can save you some cleanup.
limit : This parameter allows you to limit the number of rows returned. If you are just testing out a query or trying to inspect the kind of data within a table, you can set this to a small number to make sure it works before downloading the whole table. Note that this will show a warning so that you don’t accidentally limit your query when you don’t mean to.

For example, using all of these together:

cell_type_df = client.materialize.query_table('nucleus_detection_v0', split_positions=True, desired_resolution=[1,1,1], select_columns=['pt_position', 'pt_root_id'], limit=10)
cell_type_df

	pt_position_x	pt_position_y	pt_position_z	pt_root_id
0	241856.0	374464.0	838720.0	0
1	227200.0	389120.0	797160.0	0
2	230144.0	422336.0	795320.0	0
3	239488.0	386432.0	794120.0	0
4	239744.0	423488.0	803120.0	864691136050815731
5	245888.0	384512.0	800120.0	0
6	249792.0	391680.0	807080.0	0
7	243328.0	403008.0	794280.0	0
8	247872.0	386816.0	805320.0	0
9	260352.0	416640.0	802360.0	864691135013273238

Filtering Queries

Filtering tables so that you only get data about certain rows back is a very common operation. While there are filtering options in the query_table function (see documentation for more details), a more unified filter interface is available through a “table manager” interface.

Rather than passing a table name to the query_table function, client.materialize.tables has a subproperty for each table in the database that can be used to filter that table.

The general pattern for usage is

client.materialize.tables.{table_name}({filter options}).query({format and timestamp options})

where {table_name} is the name of the table you want to filter, {filter options} is a collection of arguments for filtering the query, and {format and timestamp options} are those parameters controlling the format and timestamp of the query.

For example, let’s look at the table aibs_soma_nuc_metamodel_preds_v117, which has cell type predictions across the dataset. We can get the whole table as a DataFrame:

cell_type_df = client.materialize.tables.aibs_soma_nuc_metamodel_preds_v117().query()
cell_type_df.head()

	id_ref	created_ref	valid_ref	target_id	classification_system	cell_type	id	created	valid	volume	pt_supervoxel_id	pt_root_id	pt_position	bb_start_position	bb_end_position
0	553	2022-07-26 23:54:55.895294+00:00	t	498173	aibs_neuronal	6P-IT	498173	2020-09-28 22:43:20.177696+00:00	t	308.176159	103884538719281829	864691135373830344	[284688, 211936, 15566]	[nan, nan, nan]	[nan, nan, nan]
1	4509	2022-07-27 00:00:10.165062+00:00	t	487329	aibs_neuronal	MC	487329	2020-09-28 22:41:27.945151+00:00	t	295.937638	105279407463397326	864691135975935434	[294544, 118624, 21745]	[nan, nan, nan]	[nan, nan, nan]
2	4693	2022-07-27 00:00:10.313814+00:00	t	106662	aibs_neuronal	23P	106662	2020-09-28 22:42:56.452281+00:00	t	230.148178	79524515478544304	864691136084076652	[107056, 119248, 19414]	[nan, nan, nan]	[nan, nan, nan]
3	5061	2022-07-27 00:00:10.592207+00:00	t	271350	aibs_neuronal	6P-CT	271350	2020-09-28 22:41:38.906480+00:00	t	305.328128	87351114324194368	864691135777995965	[163920, 235968, 20875]	[nan, nan, nan]	[nan, nan, nan]
4	8652	2022-07-27 00:01:29.589487+00:00	t	456040	aibs_neuronal	MC	456040	2020-09-28 22:42:07.860678+00:00	t	257.463910	101129507251445952	864691136084057196	[264544, 132528, 23988]	[nan, nan, nan]	[nan, nan, nan]

and we can add similar formatting options as in the last section to the query function:

cell_type_df = client.materialize.tables.aibs_soma_nuc_metamodel_preds_v117().query(split_positions=True, desired_resolution=[1,1,1], select_columns=['pt_position', 'pt_root_id', 'cell_type'], limit=10)
cell_type_df

	cell_type	pt_position_x	pt_position_y	pt_position_z	pt_root_id
0	23P	257600.0	487936.0	802760.0	864691135724233643
1	23P	260992.0	493568.0	801560.0	864691136436395166
2	NGC	256256.0	466432.0	831040.0	864691135462260637
3	23P	255744.0	480640.0	833200.0	864691136723556861
4	23P	262144.0	505856.0	824880.0	864691135776658528
5	23P	257536.0	521728.0	804440.0	864691135941166708
6	23P	251136.0	546048.0	821320.0	864691135479369926
7	astrocyte	324096.0	417920.0	658880.0	864691135937358133
8	NGC	324032.0	432960.0	679800.0	864691135207734905
9	NGC	309568.0	421120.0	706000.0	864691135758479438

However, now we can also filter the table to get only cells that are predicted to have cell type "BC" (for “basket cell”).

my_cell_type = "BC"
client.materialize.tables.aibs_soma_nuc_metamodel_preds_v117(cell_type=my_cell_type).query()

	id_ref	created_ref	valid_ref	target_id	classification_system	cell_type	id	created	valid	volume	pt_supervoxel_id	pt_root_id	pt_position	bb_start_position	bb_end_position
0	39997	2022-07-27 00:05:37.339316+00:00	t	193846	aibs_neuronal	BC	193846	2020-09-28 22:40:41.897904+00:00	t	306.148966	82838443188669165	864691135684976823	[131568, 168496, 16452]	[nan, nan, nan]	[nan, nan, nan]
1	15248	2022-07-27 00:02:08.492581+00:00	t	615735	aibs_neuronal	BC	615735	2020-09-28 22:40:41.957345+00:00	t	314.539540	112181247505371364	864691136311774525	[344880, 161104, 17084]	[nan, nan, nan]	[nan, nan, nan]
2	41346	2022-07-27 00:05:49.017547+00:00	t	369908	aibs_neuronal	BC	369908	2020-09-28 22:40:41.814964+00:00	t	332.862751	96002690286851358	864691136522768017	[227104, 207840, 20841]	[nan, nan, nan]	[nan, nan, nan]
3	54811	2022-07-27 00:07:14.246777+00:00	t	432338	aibs_neuronal	BC	432338	2020-09-28 22:40:42.077679+00:00	t	403.668828	100576589938580580	864691136296706331	[260560, 207360, 20577]	[nan, nan, nan]	[nan, nan, nan]
4	30769	2022-07-27 00:04:44.154032+00:00	t	613047	aibs_neuronal	BC	613047	2020-09-28 22:40:41.982376+00:00	t	242.159780	113234168401651200	864691136065413528	[352688, 141616, 25312]	[nan, nan, nan]	[nan, nan, nan]
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3079	60636	2022-07-27 00:07:51.544023+00:00	t	438586	aibs_neuronal	BC	438586	2020-09-28 22:45:25.430745+00:00	t	529.501389	99807894274485381	864691136897160046	[254912, 247440, 23680]	[nan, nan, nan]	[nan, nan, nan]
3080	76524	2022-07-27 00:09:27.526460+00:00	t	452402	aibs_neuronal	BC	452402	2020-09-28 22:45:25.323945+00:00	t	501.542912	102109927354855018	864691135386800769	[271408, 97216, 18797]	[nan, nan, nan]	[nan, nan, nan]
3081	56954	2022-07-27 00:07:26.643176+00:00	t	376258	aibs_neuronal	BC	376258	2020-09-28 22:45:25.357703+00:00	t	507.659387	95022544591277972	864691136578023700	[220080, 245168, 22235]	[nan, nan, nan]	[nan, nan, nan]
3082	74764	2022-07-27 00:09:16.780389+00:00	t	676137	aibs_neuronal	BC	676137	2020-09-28 22:45:25.371421+00:00	t	510.804296	116061493832778551	864691136119431960	[372976, 235696, 25121]	[nan, nan, nan]	[nan, nan, nan]
3083	20112	2022-07-27 00:03:41.743075+00:00	t	591219	aibs_neuronal	BC	591219	2020-09-28 22:45:25.526753+00:00	t	567.517839	110216764830845707	864691135279126177	[330320, 204752, 25060]	[nan, nan, nan]	[nan, nan, nan]

3084 rows × 15 columns

or maybe we just want the cell types for a particular collection of root ids:

my_root_ids = [864691135771677771, 864691135560505569, 864691136723556861]
client.materialize.tables.aibs_soma_nuc_metamodel_preds_v117(pt_root_id=my_root_ids).query()

	id_ref	created_ref	valid_ref	target_id	classification_system	cell_type	id	created	valid	volume	pt_supervoxel_id	pt_root_id	pt_position	bb_start_position	bb_end_position
0	69892	2022-07-27 00:08:42.855344+00:00	t	19116	aibs_neuronal	23P	19116	2020-09-28 22:41:51.767906+00:00	t	301.426115	74737997899501359	864691135771677771	[72576, 108656, 20291]	[nan, nan, nan]	[nan, nan, nan]
1	1778	2022-07-26 23:54:56.804122+00:00	t	21783	aibs_neuronal	23P	21783	2020-09-28 22:41:59.966574+00:00	t	263.637074	75795590176519004	864691135560505569	[80128, 124000, 16563]	[nan, nan, nan]	[nan, nan, nan]
2	52713	2022-07-27 00:07:02.006105+00:00	t	4074	aibs_neuronal	23P	4074	2020-09-28 22:42:41.341179+00:00	t	313.678234	73543309863605007	864691136723556861	[63936, 120160, 20830]	[nan, nan, nan]	[nan, nan, nan]

You can get a list of all parameters than be used for querying with the standard IPython/Jupyter docstring functionality, e.g. client.materialize.tables.aibs_soma_nuc_metamodel_preds_v117.

Caution

Use of this functionality will show a brief warning that the interface is experimental. This is because the interface is still being developed and may change in the near future in response to user feedback.

Querying Synapses

While synapses are stored as any other table in the database, in this case synapses_pni_2, this table is much larger than any other table at more than 337 million rows, and it works best when queried in a different way.

The synapse_query function allows you to query the synapse table in a more convenient way than most other tables. In particular, the pre_ids and post_ids let you specify which root id (or collection of root ids) you want to query, with pre_ids indicating the collection of presynaptic neurons and post_ids the collection of postsynaptic neurons.

Using both pre_ids and post_ids in one call is effectively a logical AND, returning only those synapses from neurons in the list of pre_ids that target neurons in the list of post_ids.

Let’s look at one particular example.

my_root_id = 864691135808473885
syn_df = client.materialize.synapse_query(pre_ids=my_root_id)
print(f"Total number of output synapses for {my_root_id}: {len(syn_df)}")
syn_df.head()

Total number of output synapses for 864691135808473885: 1498

	id	created	superceded_id	valid	size	pre_pt_supervoxel_id	pre_pt_root_id	post_pt_supervoxel_id	post_pt_root_id	pre_pt_position	post_pt_position	ctr_pt_position
0	158405512	2020-11-04 06:48:59.403833+00:00	NaN	t	420	89385416926790697	864691135808473885	89385416926797494	864691135546540484	[179076, 188248, 20233]	[179156, 188220, 20239]	[179140, 188230, 20239]
1	185549462	2020-11-04 06:49:10.903020+00:00	NaN	t	4832	91356016507479890	864691135808473885	91356016507470163	864691135884799088	[193168, 190452, 19262]	[193142, 190404, 19257]	[193180, 190432, 19254]
2	138110803	2020-11-04 06:49:46.758528+00:00	NaN	t	3176	87263084540201919	864691135808473885	87263084540199587	864691135195078186	[163440, 104292, 19808]	[163498, 104348, 19806]	[163460, 104356, 19804]
3	155339535	2020-11-04 09:53:22.361558+00:00	NaN	t	5624	88540717319827050	864691135808473885	88540717319834759	864691136039974142	[173050, 186398, 21570]	[173026, 186518, 21573]	[173100, 186472, 21569]
4	148262628	2020-11-04 06:53:27.294021+00:00	NaN	t	3536	88189766885093187	864691135808473885	88189835604584343	864691135250533976	[170154, 193170, 21123]	[170046, 193240, 21123]	[170118, 193220, 21128]

Note that synapse queries always return the list of every synapse between the neurons in the query, even if there are multiple synapses between the same pair of neurons.

A common pattern to generate a list of connections between unique pairs of neurons is to group by the root ids of the presynaptic and postsynaptic neurons and then count the number of synapses between them. For example, to get the number of synapses from this neuron onto every other neuron, ordered

syn_df.groupby(
  ['pre_pt_root_id', 'post_pt_root_id']
).count()[['id']].rename(
  columns={'id': 'syn_count'}
).sort_values(
  by='syn_count',
  ascending=False,
)
# Note that the 'id' part here is just a way to quickly extract one column.
# This could be any of the remaining column names, but `id` is often convenient 
# because it is common to all tables.

		syn_count
pre_pt_root_id	post_pt_root_id
864691135808473885	864691135800958562	20
	864691135214122296	16
	864691136578647572	15
	864691136066504856	13
	864691135841325283	11
	...	...
	864691135544771112	1
	864691135545198632	1
	864691135545210408	1
	864691135545311272	1
	864691137197468481	1

1037 rows × 1 columns