Reworked codec pipelines #1670

normanrz · 2024-02-16T13:44:54Z

This PR refactors the codec pipelines in the v3 codebase. There are now a new default implementation:

BatchedCodecPipeline, which divides the chunk batches into configurable "mini-batches". In a mini-batch all steps are run in lock step (e.g. fetching from store, decoding, encoding, writing to store). Multiple mini-batches are processed concurrently.

This PR moves a lot of code from the Array to the codec pipeline, which is an opportunity to share more code between the Array and the ShardingCodec. To make that work new ByteGetter and ByteSetter protocols to generalize the existing StorePath are introduced.

It also changes the Codec API by making decode and encode methods take chunk batches.

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/tutorial.rst
Changes documented in docs/release.rst
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

pep8speaks · 2024-02-16T13:45:12Z

Hello @normanrz! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2024-05-10 11:48:21 UTC

src/zarr/v3/codecs/batched_pipeline.py

src/zarr/v3/metadata.py

akshaysubr · 2024-03-13T06:14:48Z

@normanrz Overall, this looks quite good to me. A couple of questions I had:

It makes sense for the default behavior to be just dispatching decoding to multiple threads in decode_batch and similarly for encoding. For codecs that want a different parallelization strategy, the idea would be that they just override decode_batch right?
All of the array types are currently np.ndarray. Would it be possible to generalize this to something that is agnostic to the specific library and uses the buffer protocol, __array_interface__, __cuda_array_interface__ or DLPack? This would allow using numpy arrays as usual, but also cupy arrays, PyTorch tensors, JAX arrays, etc.

normanrz · 2024-04-19T15:35:38Z

I refactored the codec pipeline quite a bit in the last commit.

There are now a BatchedCodecPipeline and a InterleavedCodecPipeline building upon an abstract CodecPipeline
There are new protocol types for ByteGetter and ByteSetter. These are basically a generalization of StorePath. This allows the sharding codec to use the same codec pipeline abstraction as the array. That moves us one step closer to treating shards as sub-arrays.
The decode, decode_partial, encode, encode_partial now have the batched interface.
This still uses np.ndarray everywhere. I leave that to @akshaysubr to change to a generalized array interface.

Currently, the choice of codec pipeline is hard-coded. I am still looking for a way to specify that. Should that go into RuntimeConfiguration? @d-v-b @jhamman

src/zarr/v3/abc/codec.py

normanrz · 2024-05-08T20:45:19Z

I think this PR is ready for a final review. I updated the PR description with the major changes. The only thing missing from my pov is the user-configurable batch size and codecpipeline selection. I'll add that after #1855 lands.

src/zarr/abc/codec.py

d-v-b · 2024-05-14T19:56:25Z

src/zarr/codecs/pipeline/batched.py

+
+
+@dataclass(frozen=True)
+class BatchedCodecPipeline(CodecPipeline):


Since CodecPipeline defines read_batch and write_batch, what is the relationship between those methods and the new functionality offered by this class? I think a clarifying docstring for the class might be useful here, because I don't find the inheritance relationship intuitive (and ditto for the HybridPipeline -- I have no intuition for that that one is for :) )

d-v-b · 2024-05-14T19:59:17Z

src/zarr/codecs/bytes.py

@@ -25,7 +25,7 @@ class Endian(Enum):


 @dataclass(frozen=True)
-class BytesCodec(ArrayBytesCodec):
+class BytesCodec(ArrayBytesCodecBatchMixin):


Will ArrayBytesCodec be used anywhere other than as the base class fro ArrayBytesCodecBatchMixin? If not, we might want to consider simplifying the inheritance structure a bit.

I would anticipate that folks want to build a codec that implements their own batching, e.g. in rust or on the GPU. That is why, we should keep both classes.

src/zarr/codecs/mixins.py

src/zarr/codecs/pipeline/hybrid.py

madsbk · 2024-05-15T07:22:35Z

I think it would be helpful with some docstrings. At least, I find it hard to follow the intention without any help :)

normanrz · 2024-05-15T12:59:38Z

I added a few doc strings and implemented the abstract Codec classes with Generics.

jhamman

@normanrz -- this is an impressive piece of work. I should admit that the size of it made it hard to review so I just have a few comments. Overall, I think its a big step forward and I want to get my hands on it so I favor moving it into the v3 branch asap.

jhamman · 2024-05-15T21:05:31Z

src/zarr/codecs/batched_codec_pipeline.py

+    This batched codec pipeline divides the chunk batches into batches of a configurable
+    batch size ("mini-batch"). Fetching, decoding, encoding and storing are performed in
+    lock step for each mini-batch. Multiple mini-batches are processing concurrently.
+    """


this can come later but we're going to want some additional documentation on the behavior here. Reading this, I'm not entirely sure I get it.

src/zarr/codecs/batched_codec_pipeline.py

d-v-b · 2024-05-15T15:44:30Z

src/zarr/abc/codec.py

+        Returns
+        -------
+        ArraySpec
+        """
        return chunk_spec

    def evolve(self, array_spec: ArraySpec) -> Self:


this doesn't need to be addressed in this PR, but the docstring for this method doesn't describe the behavior I would expect from a method called "evolve" -- my intuition is that object.evolve(property=new_val) would return a copy of object, with property set to new_val, which I think is consistent with how it works in attrs. But codec.evolve here is rather different . Based on this docstring, I would think this method should be called "from_array_spec" or something, to make it clear that we are getting a new codec instance from an array spec (and it would make sense to use .evolve in this method of course).

What about naming the function evolve_from_array_spec?

i think that works

alternatively, we could make from_array_spec a class method that takes keyword arguments to cover the attributes that the input array spec doesn't convey

d-v-b · 2024-05-15T16:16:48Z

src/zarr/codecs/mixins.py

+CodecOutput = TypeVar("CodecOutput", bound=np.ndarray | BytesLike)
+
+
+async def batching_helper(


this is a very naive question, but given that this function exists, why do we need to implement batching by writing new methods for codecs classes? Unless I'm missing something, the new batching methods just wrap batching_helper around the base encode / decode functionality defined per-codec.

to put the point differently, why can't the codec pipeline class implement batching by calling batching_helper on the encode / decode methods of the codecs it contains?

This doesn't need to be addressed here, so feel free to ignore for now

How would you feel about this?

class _Codec(Generic[CodecInput, CodecOutput], Metadata): ... async def _decode_single(self, chunk_data: CodecOutput, chunk_spec: ArraySpec) -> CodecInput: raise NotImplementedError async def decode( self, chunk_data_and_specs: Iterable[tuple[CodecOutput | None, ArraySpec]] ) -> Iterable[CodecInput | None]: return await batching_helper(self.decode_single, chunk_data_and_specs) # same for encode ...

Batch-aware codecs would then only override the decode method and ignore the _decode_single method. That would be fine because _decode_single is a protected method not intended to be part of the public interface. Single-chunk codecs could override _decode_single and won't have to care about the batching. We could drop the batch mixins, then.

sounds good!

Co-authored-by: Joe Hamman <joe@earthmover.io>

…arr-python into batched-codec-pipeline

* adds wrapper codecs for the v2 codec pipeline * encode_chunk_key * refactor ArrayV2 away * empty zattrs * Apply suggestions from code review Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com> * unify ArrayMetadata * abstract ArrayMetadata * unified Array.create * use zarr.config for batch_size * __init__.py aktualisieren Co-authored-by: Joe Hamman <joe@earthmover.io> * ruff --------- Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com> Co-authored-by: Joe Hamman <joe@earthmover.io>

…arr-python into batched-codec-pipeline

normanrz · 2024-05-16T12:38:37Z

src/zarr/codecs/_v2.py

+@dataclass(frozen=True)
+class V2Compressor(ArrayBytesCodecBatchMixin):
+    compressor: dict[str, JSON] | None
+
+    is_fixed_size = False
+
+    async def decode_single(
+        self,
+        chunk_bytes: Buffer,
+        chunk_spec: ArraySpec,
+    ) -> NDBuffer:
+        if chunk_bytes is None:
+            return None
+
+        if self.compressor is not None:
+            compressor = numcodecs.get_codec(self.compressor)
+            chunk_numpy_array = ensure_ndarray(
+                await to_thread(compressor.decode, chunk_bytes.as_array_like())
+            )
+        else:
+            chunk_numpy_array = ensure_ndarray(chunk_bytes.as_array_like())
+
+        # ensure correct dtype
+        if str(chunk_numpy_array.dtype) != chunk_spec.dtype:
+            chunk_numpy_array = chunk_numpy_array.view(chunk_spec.dtype)
+
+        return NDBuffer.from_numpy_array(chunk_numpy_array)
+
+    async def encode_single(
+        self,
+        chunk_array: NDBuffer,
+        _chunk_spec: ArraySpec,
+    ) -> Buffer | None:
+        chunk_numpy_array = chunk_array.as_numpy_array()
+        if self.compressor is not None:
+            compressor = numcodecs.get_codec(self.compressor)
+            if (
+                not chunk_numpy_array.flags.c_contiguous
+                and not chunk_numpy_array.flags.f_contiguous
+            ):
+                chunk_numpy_array = chunk_numpy_array.copy(order="A")
+            encoded_chunk_bytes = ensure_bytes(
+                await to_thread(compressor.encode, chunk_numpy_array)
+            )
+        else:
+            encoded_chunk_bytes = ensure_bytes(chunk_numpy_array)
+
+        return Buffer.from_bytes(encoded_chunk_bytes)
+
+    def compute_encoded_size(self, _input_byte_length: int, _chunk_spec: ArraySpec) -> int:
+        raise NotImplementedError
+
+
+@dataclass(frozen=True)
+class V2Filters(ArrayArrayCodecBatchMixin):
+    filters: list[dict[str, JSON]]
+
+    is_fixed_size = False
+
+    async def decode_single(
+        self,
+        chunk_array: NDBuffer,
+        chunk_spec: ArraySpec,
+    ) -> NDBuffer:
+        chunk_numpy_array = chunk_array.as_numpy_array()
+        # apply filters in reverse order
+        if self.filters is not None:
+            for filter_metadata in self.filters[::-1]:
+                filter = numcodecs.get_codec(filter_metadata)
+                chunk_numpy_array = await to_thread(filter.decode, chunk_numpy_array)
+
+        # ensure correct chunk shape
+        if chunk_numpy_array.shape != chunk_spec.shape:
+            chunk_numpy_array = chunk_numpy_array.reshape(
+                chunk_spec.shape,
+                order=chunk_spec.order,
+            )
+
+        return NDBuffer.from_numpy_array(chunk_numpy_array)
+
+    async def encode_single(
+        self,
+        chunk_array: NDBuffer,
+        chunk_spec: ArraySpec,
+    ) -> NDBuffer | None:
+        chunk_numpy_array = chunk_array.as_numpy_array().ravel(order=chunk_spec.order)
+
+        for filter_metadata in self.filters:
+            filter = numcodecs.get_codec(filter_metadata)
+            chunk_numpy_array = await to_thread(filter.encode, chunk_numpy_array)
+
+        return NDBuffer.from_numpy_array(chunk_numpy_array)


@madsbk I added a very naive Buffer integration in these codecs. Could you please help me get that right?

I'll merge this PR for now. We can work on a new PR to optimize this.

Sounds good!

normanrz self-assigned this Feb 16, 2024

normanrz mentioned this pull request Feb 19, 2024

Remove attrs #1660

Merged

normanrz force-pushed the batched-codec-pipeline branch from 5d4d09a to 450bcc6 Compare February 20, 2024 10:28

akshaysubr reviewed Mar 13, 2024

View reviewed changes

src/zarr/v3/codecs/batched_pipeline.py Outdated Show resolved Hide resolved

akshaysubr reviewed Mar 13, 2024

View reviewed changes

src/zarr/v3/codecs/batched_pipeline.py Outdated Show resolved Hide resolved

akshaysubr reviewed Mar 13, 2024

View reviewed changes

src/zarr/v3/metadata.py Outdated Show resolved Hide resolved

sanketverma1704 added the V3 label Mar 28, 2024

normanrz added this to the 3.0.0.alpha milestone Apr 12, 2024

normanrz mentioned this pull request Apr 22, 2024

[v3] sharding api #1662

Closed

d-v-b reviewed Apr 25, 2024

View reviewed changes

src/zarr/v3/abc/codec.py Outdated Show resolved Hide resolved

akshaysubr mentioned this pull request Apr 29, 2024

[V3] Support for batched Store API in v3 #1806

Open

normanrz added 2 commits April 30, 2024 17:13

merge

9405fda

refactors CodecPipelines

019ecc8

normanrz force-pushed the batched-codec-pipeline branch from 6c8c706 to 019ecc8 Compare April 30, 2024 15:20

normanrz added 6 commits April 30, 2024 17:30

fixes

bd2160d

adds HybridCodecPipeline

4887c29

fixes

c3e3504

typing

a578d95

merge

13212b5

typing

e3cad7c

normanrz changed the title ~~Batched codec pipeline~~ Reworked codec pipelines May 8, 2024

consistent naming

027ebb5

normanrz requested review from jhamman and d-v-b May 10, 2024 09:09

d-v-b reviewed May 10, 2024

View reviewed changes

src/zarr/abc/codec.py Outdated Show resolved Hide resolved

d-v-b reviewed May 10, 2024

View reviewed changes

src/zarr/abc/codec.py Outdated Show resolved Hide resolved

don't use global lru_cache

530e88b

d-v-b reviewed May 14, 2024

View reviewed changes

src/zarr/codecs/mixins.py Outdated Show resolved Hide resolved

d-v-b reviewed May 14, 2024

View reviewed changes

src/zarr/codecs/pipeline/hybrid.py Outdated Show resolved Hide resolved

d-v-b reviewed May 14, 2024

View reviewed changes

src/zarr/codecs/pipeline/hybrid.py Outdated Show resolved Hide resolved

normanrz added 5 commits May 15, 2024 09:37

removes HybridCodecPipeline

9eda592

generic codec classes

d9aa24f

default batch size = 1

a5fb71e

default batch size = 1

efd9bce

docs

38c436d

normanrz requested a review from d-v-b May 15, 2024 13:32

normanrz added 2 commits May 15, 2024 17:37

merge

2397d3f

Merge remote-tracking branch 'origin/v3' into batched-codec-pipeline

1ad9896

jhamman approved these changes May 15, 2024

View reviewed changes

d-v-b approved these changes May 15, 2024

View reviewed changes

normanrz and others added 7 commits May 16, 2024 09:52

Update src/zarr/codecs/batched_codec_pipeline.py

3a85a0a

Co-authored-by: Joe Hamman <joe@earthmover.io>

mv batched_codec_pipeline -> pipeline

f33e66a

Merge branch 'batched-codec-pipeline' of github.com:zarr-developers/z…

9e42076

…arr-python into batched-codec-pipeline

Merge branch 'batched-codec-pipeline' of github.com:zarr-developers/z…

faa965d

…arr-python into batched-codec-pipeline

merge

9bb3243

merge

db97439

normanrz commented May 16, 2024

View reviewed changes

This was referenced May 16, 2024

Feature: Top level V3 API #1884

Merged

port top level api to v3 branch. #1598

Closed

normanrz merged commit 846c085 into v3 May 17, 2024
21 checks passed

normanrz deleted the batched-codec-pipeline branch May 17, 2024 07:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reworked codec pipelines #1670

Reworked codec pipelines #1670

normanrz commented Feb 16, 2024 •

edited

Loading

pep8speaks commented Feb 16, 2024 •

edited

Loading

akshaysubr commented Mar 13, 2024

normanrz commented Apr 19, 2024

normanrz commented May 8, 2024

d-v-b May 14, 2024

d-v-b May 14, 2024

normanrz May 14, 2024

madsbk commented May 15, 2024

normanrz commented May 15, 2024

jhamman left a comment

jhamman May 15, 2024

d-v-b May 15, 2024

normanrz May 17, 2024

d-v-b May 17, 2024

d-v-b May 17, 2024

d-v-b May 15, 2024

d-v-b May 15, 2024

d-v-b May 15, 2024

normanrz May 16, 2024

d-v-b May 17, 2024

normanrz May 16, 2024

normanrz May 17, 2024

madsbk May 17, 2024



		@dataclass(frozen=True)
		class BatchedCodecPipeline(CodecPipeline):

		CodecOutput = TypeVar("CodecOutput", bound=np.ndarray \| BytesLike)


		async def batching_helper(

Reworked codec pipelines #1670

Reworked codec pipelines #1670

Conversation

normanrz commented Feb 16, 2024 • edited Loading

pep8speaks commented Feb 16, 2024 • edited Loading

Comment last updated at 2024-05-10 11:48:21 UTC

akshaysubr commented Mar 13, 2024

normanrz commented Apr 19, 2024

normanrz commented May 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

madsbk commented May 15, 2024

normanrz commented May 15, 2024

jhamman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

normanrz commented Feb 16, 2024 •

edited

Loading

pep8speaks commented Feb 16, 2024 •

edited

Loading