1. Introduction¶

1.1 Background and Motivation¶

The increasing scale of scientific and industrial 3D data presents significant challenges for data storage, access, and analysis. Modern applications routinely generate datasets containing millions to trillions of vector objects—point clouds from LiDAR and microscopy, meshes from 3D reconstruction, skeletons from connectomics, and streamlines from tractography. These datasets often require:

Distributed storage and access: Data must be stored in cloud object stores (S3, GCS, Azure Blob) and accessed from distributed computing environments
Spatial querying: Efficient retrieval of data within spatial regions without loading entire datasets
Incremental updates: Ability to add or modify data without regenerating entire files
Multi-resolution visualization: Support for level-of-detail rendering at different scales
Rich metadata: Complex annotations at multiple levels (points, objects, groups)
Concurrent access: Multiple processes or users reading and writing simultaneously

Traditional file formats (LAS, PLY, OBJ, STL) were designed for single-file, single-machine workflows. They lack spatial indexing, multi-resolution support, and distributed write capabilities. While newer formats like 3D Tiles and I3S address visualization needs, they are optimized for streaming and lack the flexibility needed for scientific computing and analysis workflows.

The Zarr Vectors (ZV) format addresses these limitations by providing a cloud-native, spatially-indexed format that supports distributed uncoordinated reads and writes, native multi-resolution representations, fragment- and object-level re-use across resolution levels, and rich metadata—all built on the proven Zarr v3 storage foundation.

1.2 Purpose and Goals¶

The Zarr Vector Format specification defines a standardized, cloud-native format for storing very large N-dimensional vector objects. The format is designed to:

Enable scalable storage: Support datasets ranging from thousands to trillions of vector objects
Facilitate distributed access: Enable efficient spatial queries and selective data loading from cloud storage
Support concurrent operations: Allow multiple processes to read and write data without coordination
Provide multi-resolution support: Native support for hierarchical level-of-detail representations
Enable rich metadata: Support arbitrary metadata at point, object, and group levels
Maintain format flexibility: Support diverse geometry types (point clouds, meshes, skeletons, streamlines) with extensibility for custom types
Ensure interoperability: Provide clear migration paths from existing formats and compatibility with common tools

The format is designed to be:

Cloud-native: Optimized for object storage backends (S3, GCS, Azure Blob) and filesystems
Analysis-friendly: Supports scientific computing workflows, not just visualization
Extensible: Allows custom geometry types and metadata schemas
Efficient: Enables selective access patterns to minimize data transfer and memory usage

1.3 Design Principles¶

The Zarr Vector Format is built on several core design principles:

1.3.1 Cloud-Native Architecture¶

The format uses Zarr as its underlying storage mechanism, enabling:

Storage in cloud object stores, distributed filesystems, or local filesystems
Efficient chunked storage with independent chunk access
Support for various storage backends through Zarr’s abstraction layer

1.3.2 Spatial Indexing as First-Class Feature¶

Spatial organization is fundamental to the format:

Data is organized by spatial chunks in N-dimensional space
Spatial queries can be answered by identifying relevant chunks
Chunk boundaries enable efficient spatial filtering
Spatial indexing supports both regular grids and (future) hierarchical structures

1.3.3 Multi-Resolution Native Support¶

Multi-resolution is built into the format structure:

Multiple resolution levels are stored as separate Zarr groups
Each level can have different spatial chunk sizes
Downsampling strategies are specified in metadata
Enables progressive loading and level-of-detail visualization

1.3.4 Distributed Uncoordinated Writes¶

The format supports concurrent write operations:

Writers can operate on different spatial chunks independently
No global coordination required for most operations
Chunk-level atomicity through Zarr’s storage model
Explicit mechanisms for handling cross-chunk objects

1.3.5 Rich Metadata at Multiple Levels¶

Metadata can be attached at various granularities:

Point-level: Attributes per vertex (e.g., gene expression, color, intensity)
Object-level: Properties of geometric entities (e.g., mesh material, streamline properties)
Group-level: Collections of objects (e.g., cell types, anatomical regions)
Format-level: Global properties (coordinate systems, resolution levels, geometry types)

1.3.6 Efficient Selective Access Patterns¶

The format enables targeted data retrieval:

Spatial queries: Load only chunks intersecting a bounding box
Attribute queries: Load only specific attribute channels
Object queries: Load only vertices belonging to specific objects
Multi-resolution queries: Load appropriate resolution level for the task

1.3.7 Geometry Type Flexibility¶

The format supports diverse geometry types with a unified storage model:

Point clouds: Unstructured collections of points
Meshes: Connected surfaces or volumes (triangular, quad, tetrahedral)
Skeletons: Graph structures representing topology
Streamlines/Polylines: Ordered sequences of connected points
Custom types: Extensible through metadata

1.4 Relationship to Other Formats¶

The Zarr Vector Format builds upon and extends concepts from several existing formats:

1.4.1 TRX Format (Tractography)¶

The TRX format (https://github.com/tee-ar-ex/trx-spec) provides the conceptual foundation:

Path offsets: TRX’s efficient storage of variable-length streamlines inspired the per-chunk fragment-index encoding
Channel dimension: TRX’s approach to storing vertex attributes with a channel dimension is adopted
Metadata model: TRX’s flexible metadata structure influenced the multi-level metadata design
Compatibility: When spatial indexing is collapsed to a single dimension, ZV closely aligns with TRX

1.4.2 OME-Zarr¶

OME-Zarr’s multi-resolution approach is adapted for vector data:

Multi-resolution structure: OME-Zarr’s pyramid approach is extended to vector geometries
Zarr backend: Both formats leverage Zarr’s chunked storage
Metadata standards: OME-Zarr’s metadata conventions (RFC 4, RFC 5) are followed for coordinate reference systems
Extension model: OME-Zarr’s extensibility model influenced the design
Pyramid chunk-size growth: OME-Zarr scales image pyramids by shrinking voxel size at coarser levels (so the same chunk count covers a larger physical region). ZV has no voxel concept, so it instead lets each pyramid level override chunk_shape directly (v0.7+); coarser levels can use larger chunks while staying nested in the level-0 grid.

1.4.3 Zarr Specification¶

Zarr provides the storage foundation:

Chunked arrays: Zarr’s efficient chunked storage enables spatial indexing
Ragged objects: per-chunk byte payloads (vertices, fragment-index, links, manifests) carry their own record framing inside Zarr v3 1-D uint8 arrays — no use of Zarr’s variable-length-chunk feature
Store abstraction: Zarr’s store interface enables cloud-native storage
Metadata: Zarr v3’s zarr.json files carry format metadata under namespaced keys (zarr_vectors, zarr_vectors_level, zv_array)

1.4.4 Precomputed Mesh and Annotation Formats¶

The Neuroglancer precomputed mesh format influenced multi-resolution mesh design:

Multi-resolution meshes: Similar approach to storing meshes at multiple levels of detail
Spatial chunking: Concept of dividing meshes into spatial regions
Draco compression: Use of Draco for mesh compression

The companion precomputed annotation format covers points, lines, axis-aligned bounding boxes, and ellipsoids with per-annotation properties and per-segment relationships. ZV expresses the same primitives via geometry_types + per-vertex / per-object attributes + groups; coarsening differs (per-object aggregation vs random-subsample-with-limit). See Appendix L for the field-by-field mapping.

1.4.5 Traditional Formats (LAS, PLY, OBJ, STL)¶

While ZV addresses limitations of traditional formats, it maintains conceptual compatibility:

Geometry representation: Standard concepts (vertices, faces, edges) are preserved
Attribute storage: Traditional attribute concepts map to ZV’s attribute arrays
Migration paths: Clear conversion strategies from traditional formats

1.4.6 Visualization Formats (3D Tiles, I3S)¶

ZV complements visualization-focused formats:

Different goals: 3D Tiles/I3S optimize for web streaming; ZV optimizes for analysis
Shared concepts: Spatial indexing and multi-resolution are common themes
Interoperability: ZV data can be converted to visualization formats when needed

1.5 Key Features¶

The Zarr Vector Format provides several key features that distinguish it from existing formats:

Spatial Chunking: Data is organized into spatial chunks, enabling efficient spatial queries
Fragment-Indexed Chunks: Variable-length per-chunk structures (vertices, links, attributes) are partitioned by a compact byte-level fragment index that supports vertex / link / fragment re-use within and across objects
Selective Attribute Access: Attributes are stored with a channel dimension, enabling loading of specific attributes
Object Groups: Object identity is preserved across pyramid levels; group-level metadata partitions objects (cell types, brain regions, tracts, …) without imposing a hierarchy
Cross-Chunk Linking: Explicit cross-chunk link arrays handle objects that span multiple spatial chunks
Multi-Resolution Pyramids: Per-object pyramids with optional cross-pyramid-level link arrays for fragment-level LOD selection, plus per-level chunk_shape overrides for scalable coarse levels
Distributed Writes: Uncoordinated writes to different spatial chunks enable parallel processing; chunk-local fragment numbering removes the need for global coordination
Flexible Compression: Support for various compression codecs, including Draco for geometry
Rich Metadata: NGFF-compatible metadata at multiple levels with a LinkML schema
Extensibility: Support for custom geometry types, custom attributes, and writer-side capability tokens

1.6 Target Audiences¶

This specification is intended for:

Format Implementers: Developers creating libraries and tools to read/write ZV files
Application Developers: Developers building applications that work with large-scale vector data
Data Scientists: Researchers and analysts working with point clouds, meshes, and related data
Infrastructure Engineers: Engineers designing storage and compute systems for 3D data
Format Evaluators: Those considering ZV for their use cases and comparing it to alternatives

1.7 Document Structure¶

This specification is organized as follows:

Sections 1-4: Introduction, scope, terminology, and data model—essential reading for understanding the format
Sections 5-8: Core technical specifications—Zarr store structure, spatial indexing, arrays, and metadata
Sections 9-12: Advanced features—multi-resolution, cross-chunk linking, compression, and geometry types
Section 13: Conformance and validation—requirements for compliant implementations
Section 14: Examples—practical examples demonstrating format usage
Section 15: Appendices—reference material, schemas, algorithms, and migration guides

Reading Paths:

Quick Start: Read Sections 1, 2, 4, and 14 (examples)
Implementation: Read Sections 1-8, 13, and relevant appendices
Advanced Usage: Read all sections, focusing on 9-12 for specific features
Migration: Read Sections 1-4, then Appendix G (Migration Guide)

1.8 Conventions¶

Throughout this specification:

Normative text: Requirements that must be followed for conformance
Informative text: Explanatory material, examples, and recommendations
Code examples: Provided in pseudocode or Python-like syntax
Metadata examples: Provided in JSON format
References: Links to external specifications and resources