Skip to content

File System

File system abstractions for data synchronization.


BaseFileSystem dataclass

BaseFileSystem()

partition

partition(
    size_bytes_limit=None,
    object_count_limit=None,
    raise_error_if_criteria_not_met=False,
)

Partitions the root tree folder structure into a list of nodes.

Partitioning is guided by constraints by size and object count.

Parameters:

Name Type Description Default
size_bytes_limit Optional[int]

If specified, partitions must be less than the specified value.

None
object_count_limit Optional[int]

If specified, partitions must contain fewer objects than the specified value.

None
raise_error_if_criteria_not_met bool

If True, raises error if nodes cannot meet criteria. In actuality, this is more relevant for size limitations where an object size is greater than the size limit.

False

Raises:

Type Description
ValueError

Thrown if raise_error_if_criteria_not_met is true and criteria not met.

Returns:

Type Description
List[Node]

List of nodes representing the partition.

Source code in src/aibs_informatics_aws_utils/data_sync/file_system.py
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
def partition(
    self,
    size_bytes_limit: Optional[int] = None,
    object_count_limit: Optional[int] = None,
    raise_error_if_criteria_not_met: bool = False,
) -> List[Node]:
    """Partitions the root tree folder structure into a list of nodes.

    Partitioning is guided by constraints by size and object count.

    Args:
        size_bytes_limit: If specified, partitions must be less than the specified value.
        object_count_limit: If specified, partitions must contain fewer objects than
            the specified value.
        raise_error_if_criteria_not_met: If True, raises error if nodes cannot meet
            criteria. In actuality, this is more relevant for size limitations where
            an object size is greater than the size limit.

    Raises:
        ValueError: Thrown if raise_error_if_criteria_not_met is true and criteria not met.

    Returns:
        List of nodes representing the partition.
    """
    unchecked_nodes = {self.node}
    size_bytes_exceeding_obj_nodes = []

    partitioned_nodes: List[Node] = []
    logger.info(
        f"Partitioning nodes with size_bytes_limit={size_bytes_limit} "
        f"and object_count_limit={object_count_limit}"
    )

    while unchecked_nodes:
        unchecked_node = unchecked_nodes.pop()
        if (size_bytes_limit and unchecked_node.size_bytes > size_bytes_limit) or (
            object_count_limit and unchecked_node.object_count > object_count_limit
        ):
            if unchecked_node.has_children():
                unchecked_nodes.update(unchecked_node.children.values())
            else:
                size_bytes_exceeding_obj_nodes.append(unchecked_node)
        else:
            partitioned_nodes.append(unchecked_node)

    if size_bytes_exceeding_obj_nodes:
        msg = (
            f"Found {len(size_bytes_exceeding_obj_nodes)} objects that exceed the "
            f"partition size limit {size_bytes_limit}."
        )
        if raise_error_if_criteria_not_met:
            raise ValueError(msg)
        logger.warning(msg)
        partitioned_nodes.extend(size_bytes_exceeding_obj_nodes)
    logger.info(f"Partitioned {len(partitioned_nodes)} nodes.")
    return partitioned_nodes

Node dataclass

Node(
    path_part,
    parent=None,
    children=dict(),
    size_bytes=0,
    object_count=0,
    last_modified=BEGINNING_OF_TIME,
    is_path_part_prefix=False,
    is_path_part_suffix=False,
)

Represents an object or folder in an file system path.

Attributes:

Name Type Description
path_part str

Specifies the key part of the fs path (an edge) to this node.

parent Optional['Node']

Optionally specify the parent node to which this node is connected. By default, this is None.

children Dict[str, 'Node']

Child nodes that exist under this path prefix.

size_bytes int

The size (in bytes) of all objects under this path prefix.

object_count int

The number of objects under this path prefix.

last_modified datetime

The most recent date any objects under this prefix were last modified.

S3FileSystem dataclass

S3FileSystem(bucket, key)

Bases: BaseFileSystem

Generates a FS tree structure of an S3 path with size and object count stats.

Attributes:

Name Type Description
bucket str

The S3 bucket to describe.

key str

The S3 key to describe.