mlreco.iotools.collates module

Collate classes are a middleware between parsers and datasets. They are given to torch.utils.data.DataLoader as collate_fn argument. We have two different collate functions: one for sparse and one for dense input data.

class mlreco.iotools.collates.VolumeBoundaries(definitions)[source]

Bases: object

VolumeBoundaries is a helper class to deal with multiple detector volumes. Assume you have N volumes that you want to process independently, but your input data file does not separate between them (maybe it is hard to make the separation at simulation level, e.g. in Supera). You can specify in the configuration of the collate function where the volume boundaries are and this helper class will take care of the following:

1. Relabel batch ids: this will introduce “virtual” batch ids to account for each volume in each batch.

2. Shift coordinates: voxel coordinates are shifted such that the origin is always the bottom left corner of a volume. In other words, it ensures the voxel coordinate phase space is the same regardless of which volume we are processing. That way you can train on a single volume (subpart of the detector, e.g. cryostat or TPC) and process later however many volumes make up your detector.

3. Sort coordinates: there is no guarantee that concatenating coordinates of N volumes vs the stored coordinates for label tensors which cover all volumes already by default will yield the same ordering. Hence we do a np.lexsort on coordinates after 1. and 2. have happened. We sort by: batch id, z, y, x in this order.

An example of configuration would be :

```yaml collate:

collate_fn: Collatesparse boundaries: [[1376.3], None, None]

```

boundaries is what defines the different volumes. It has a length equal to the spatial dimension. For each spatial dimension, None means that there is no boundary along that axis. A list of floating numbers specifies the volume boundaries along that axis in voxel units. The list of volumes will be inferred from this list of boundaries (“meshgrid” style, taking all possible combinations of the boundaries to generate all the volumes).

__init__(definitions)[source]

See explanation of boundaries above.

Parameters

definitions (list) –

num_volumes()[source]
Returns

Return type

int

virtual_batch_ids(entry=0)[source]
Parameters

entry (int, optional) – Which entry of the dataset you are trying to access.

Returns

List of virtual batch ids that correspond to this entry.

Return type

list

translate(voxels, volume)[source]

Meant to reverse what the split method does: for voxels coordinates initially in the range of volume 0, translate to the range of a specific volume given in argument.

Parameters
  • voxels (np.ndarray) – Expected shape is (D_0, …, D_N, self.dim) with N >=0. In other words, voxels can be a list of coordinate or a single coordinate with shape (d,).

  • volume (int) –

Returns

Translated voxels array, using internally computed shifts.

Return type

np.ndarray

untranslate(voxels, volume)[source]

Meant to reverse what the translate method does: for voxels coordinates initially in the range of full detector, translate to the range of 1 volume for a specific volume given in argument.

Parameters
  • voxels (np.ndarray) – Expected shape is (D_0, …, D_N, self.dim) with N >=0. In other words, voxels can be a list of coordinate or a single coordinate with shape (d,).

  • volume (int) –

Returns

Translated voxels array, using internally computed shifts.

Return type

np.ndarray

split(voxels)[source]
Parameters

voxels (np.array, shape (N, 4)) – It should contain (batch id, x, y, z) coordinates in this order (as an example if you are working in 3D).

Returns

  • new_voxels (np.array, shape (N, 4)) – The array contains voxels with shifted coordinates + virtual batch ids. This array is not yet permuted to obey the lexsort.

  • perm (np.array, shape (N,)) – This is a permutation mask which can be used to apply the lexsort to both the new voxels and the features or data tensor (which is not passed to this function).

__dict__ = mappingproxy({'__module__': 'mlreco.iotools.collates', '__doc__': '\n    VolumeBoundaries is a helper class to deal with multiple detector volumes. Assume you have N\n    volumes that you want to process independently, but your input data file does not separate\n    between them (maybe it is hard to make the separation at simulation level, e.g. in Supera).\n    You can specify in the configuration of the collate function where the volume boundaries are\n    and this helper class will take care of the following:\n\n    1. Relabel batch ids: this will introduce "virtual" batch ids to account for each volume in\n    each batch.\n\n    2. Shift coordinates: voxel coordinates are shifted such that the origin is always the bottom\n    left corner of a volume. In other words, it ensures the voxel coordinate phase space is the\n    same regardless of which volume we are processing. That way you can train on a single volume\n    (subpart of the detector, e.g. cryostat or TPC) and process later however many volumes make up\n    your detector.\n\n    3. Sort coordinates: there is no guarantee that concatenating coordinates of N volumes vs the\n    stored coordinates for label tensors which cover all volumes already by default will yield the\n    same ordering. Hence we do a np.lexsort on coordinates after 1. and 2. have happened. We sort\n    by: batch id, z, y, x in this order.\n\n    An example of configuration would be :\n\n    ```yaml\n    collate:\n      collate_fn: Collatesparse\n      boundaries: [[1376.3], None, None]\n    ```\n\n    `boundaries` is what defines the different volumes. It has a length equal to the spatial dimension.\n    For each spatial dimension, `None` means that there is no boundary along that axis.\n    A list of floating numbers specifies the volume boundaries along that axis in voxel units.\n    The list of volumes will be inferred from this list of boundaries ("meshgrid" style, taking\n    all possible combinations of the boundaries to generate all the volumes).\n    ', '__init__': <function VolumeBoundaries.__init__>, 'num_volumes': <function VolumeBoundaries.num_volumes>, 'virtual_batch_ids': <function VolumeBoundaries.virtual_batch_ids>, 'translate': <function VolumeBoundaries.translate>, 'untranslate': <function VolumeBoundaries.untranslate>, 'split': <function VolumeBoundaries.split>, '__dict__': <attribute '__dict__' of 'VolumeBoundaries' objects>, '__weakref__': <attribute '__weakref__' of 'VolumeBoundaries' objects>, '__annotations__': {}})
__module__ = 'mlreco.iotools.collates'
__weakref__

list of weak references to the object (if defined)

mlreco.iotools.collates.CollateSparse(batch, **kwargs)[source]

Collate sparse input.

Parameters
  • batch (a list of dictionary) – Each list element (single dictionary) is a minibatch data = key-value pairs where a value is a parser function return.

  • boundaries (list, optional, default is None) – This contains a list of volume boundaries if you want to process distinct volumes independently. See VolumeBoundaries documentation for more details and explanations.

Returns

a dictionary of key-value pair where key is same as keys in the input batch, and the value is a list of data elements in the input.

Return type

dict

Notes

Assumptions:

  • The input batch is a tuple of length >=1. Length 0 tuple will fail (IndexError).

  • The dictionaries in the input batch tuple are assumed to have identical list of keys.

mlreco.iotools.collates.CollateDense(batch)[source]

Collate dense input.

Very basic collate function that makes a numpy.ndarray for each key.

Parameters

batch (list) –