The Connectome Annotation Versioning Engine (CAVE) is a suite of tools developed at the Allen Institute and Seung Lab to manage large connectomics data.
To initialize a caveclient, we give it a datastack, which is a name that defines a particular combination of imagery, segmentation, and annotation database. For the MICrONs public data, we use the datastack name minnie65_public.
import osfrom caveclient import CAVEclientdatastack_name ='minnie65_public'client = CAVEclient(datastack_name)# Show the description of the datastackclient.info.get_datastack_info()['description']
'This is the publicly released version of the minnie65 volume and segmentation. '
CAVEclient Basics
The most frequent use of the CAVEclient is to query the database for annotations like synapses. All database functions are under the client.materialize property. To see what tables are available, use the get_tables function:
{'schema': 'nucleus_detection',
'id': 27096,
'created': '2020-11-02T18:56:35.530100',
'table_name': 'nucleus_detection_v0__minnie3_v1',
'valid': True,
'aligned_volume': 'minnie65_phase3',
'schema_type': 'nucleus_detection',
'user_id': '121',
'description': 'A table of nuclei detections from a nucleus detection model developed by Shang Mu, Leila Elabbady, Gayathri Mahalingam and Forrest Collman. Pt is the centroid of the nucleus detection. id corresponds to the flat_segmentation_source segmentID. Only included nucleus detections of volume>25 um^3, below which detections are false positives, though some false positives above that threshold remain. ',
'notice_text': None,
'reference_table': None,
'flat_segmentation_source': 'precomputed://https://bossdb-open-data.s3.amazonaws.com/iarpa_microns/minnie/minnie65/nuclei',
'write_permission': 'PRIVATE',
'read_permission': 'PUBLIC',
'last_modified': '2022-10-25T19:24:28.559914',
'segmentation_source': '',
'pcg_table_name': 'minnie3_v1',
'last_updated': '2024-01-23T23:00:00.080429',
'annotation_table': 'nucleus_detection_v0',
'voxel_resolution': [4.0, 4.0, 40.0]}
You get a dictionary of values. Two fields are particularly important: the description, which offers a text description of the contents of the table and voxel_resolution which defines how the coordinates in the table are defined, in nm/voxel.
Querying Tables
To get the contents of a table, use the query_table function. This will return the whole contents of a table without any filtering, up to for a maximum limit of 200,000 rows. The table is returned as a Pandas DataFrame and you can immediately use standard Pandas function on it.
While most tables are small enough to be returned in full, the synapse table has hundreds of millions of rows and is too large to download this way
Tables have a collection of columns, some of which specify point in space (columns ending in _position), some a root id (ending in _root_id), and others that contain other information about the object at that point. Before describing some of the most important tables in the database, it’s useful to know about a few advanced options that apply when querying any table.
desired_resolution : This parameter allows you to convert the columns specifying spatial points to different resolutions. Many tables are stored at a resolution of 4x4x40 nm/voxel, for example, but you can convert to nanometers by setting desired_resolution=[1,1,1].
split_positions : This parameter allows you to split the columns specifying spatial points into separate columns for each dimension. The new column names will be the original column name with _x, _y, and _z appended.
select_columns : This parameter allows you to get only a subset of columns from the table. Once you know exactly what you want, this can save you some cleanup.
limit : This parameter allows you to limit the number of rows returned. If you are just testing out a query or trying to inspect the kind of data within a table, you can set this to a small number to make sure it works before downloading the whole table. Note that this will show a warning so that you don’t accidentally limit your query when you don’t mean to.
Filtering tables so that you only get data about certain rows back is a very common operation. While there are filtering options in the query_table function (see documentation for more details), a more unified filter interface is available through a “table manager” interface.
Rather than passing a table name to the query_table function, client.materialize.tables has a subproperty for each table in the database that can be used to filter that table.
where {table_name} is the name of the table you want to filter, {filter options} is a collection of arguments for filtering the query, and {format and timestamp options} are those parameters controlling the format and timestamp of the query.
For example, let’s look at the table aibs_soma_nuc_metamodel_preds_v117, which has cell type predictions across the dataset. We can get the whole table as a DataFrame:
You can get a list of all parameters than be used for querying with the standard IPython/Jupyter docstring functionality, e.g. client.materialize.tables.aibs_soma_nuc_metamodel_preds_v117.
Caution
Use of this functionality will show a brief warning that the interface is experimental. This is because the interface is still being developed and may change in the near future in response to user feedback.
Querying Synapses
While synapses are stored as any other table in the database, in this case synapses_pni_2, this table is much larger than any other table at more than 337 million rows, and it works best when queried in a different way.
The synapse_query function allows you to query the synapse table in a more convenient way than most other tables. In particular, the pre_ids and post_ids let you specify which root id (or collection of root ids) you want to query, with pre_ids indicating the collection of presynaptic neurons and post_ids the collection of postsynaptic neurons.
Using both pre_ids and post_ids in one call is effectively a logical AND, returning only those synapses from neurons in the list of pre_ids that target neurons in the list of post_ids.
Let’s look at one particular example.
my_root_id =864691135808473885syn_df = client.materialize.synapse_query(pre_ids=my_root_id)print(f"Total number of output synapses for {my_root_id}: {len(syn_df)}")syn_df.head()
Total number of output synapses for 864691135808473885: 1498
id
created
superceded_id
valid
size
pre_pt_supervoxel_id
pre_pt_root_id
post_pt_supervoxel_id
post_pt_root_id
pre_pt_position
post_pt_position
ctr_pt_position
0
158405512
2020-11-04 06:48:59.403833+00:00
NaN
t
420
89385416926790697
864691135808473885
89385416926797494
864691135546540484
[179076, 188248, 20233]
[179156, 188220, 20239]
[179140, 188230, 20239]
1
185549462
2020-11-04 06:49:10.903020+00:00
NaN
t
4832
91356016507479890
864691135808473885
91356016507470163
864691135884799088
[193168, 190452, 19262]
[193142, 190404, 19257]
[193180, 190432, 19254]
2
138110803
2020-11-04 06:49:46.758528+00:00
NaN
t
3176
87263084540201919
864691135808473885
87263084540199587
864691135195078186
[163440, 104292, 19808]
[163498, 104348, 19806]
[163460, 104356, 19804]
3
155339535
2020-11-04 09:53:22.361558+00:00
NaN
t
5624
88540717319827050
864691135808473885
88540717319834759
864691136039974142
[173050, 186398, 21570]
[173026, 186518, 21573]
[173100, 186472, 21569]
4
148262628
2020-11-04 06:53:27.294021+00:00
NaN
t
3536
88189766885093187
864691135808473885
88189835604584343
864691135250533976
[170154, 193170, 21123]
[170046, 193240, 21123]
[170118, 193220, 21128]
Note that synapse queries always return the list of every synapse between the neurons in the query, even if there are multiple synapses between the same pair of neurons.
A common pattern to generate a list of connections between unique pairs of neurons is to group by the root ids of the presynaptic and postsynaptic neurons and then count the number of synapses between them. For example, to get the number of synapses from this neuron onto every other neuron, ordered
syn_df.groupby( ['pre_pt_root_id', 'post_pt_root_id']).count()[['id']].rename( columns={'id': 'syn_count'}).sort_values( by='syn_count', ascending=False,)# Note that the 'id' part here is just a way to quickly extract one column.# This could be any of the remaining column names, but `id` is often convenient # because it is common to all tables.