refactor v3 data types #2874

d-v-b · 2025-02-28T11:43:49Z

As per #2750, we need a new model of data types if we want to support more data types. Accordingly, this PR will refactor data types for the zarr v3 side of the codebase and make them extensible. I would also like to handle v2 as well with the same data structures, and confine the v2 / v3 differences to the places where they vary.

In main,all the v3 data types are encoded as variants of an enum (i.e., strings). Enumerating each dtype as a string is cumbersome for datetimes, that are parametrized by a time unit, and plain unworkable for parametric dtypes like fixed-length strings, which are parametrized by their length. This means we need a model of data types that can be parametrized, and I think separate classes is probably the way to go here. Separating the different data types into different objects also gives us a natural way to capture some of the per-data type variability baked into the spec: each data type class can define its own default value, and also define methods for how its scalars should be converted to / from JSON.

This is a very rough draft right now -- I'm mostly posting this for visibility as I iterate on it.

…into feat/fixed-length-strings

d-v-b · 2025-02-28T13:23:18Z

copying a comment @nenb made in this zulip discussion:

The first thing that caught my eye was that you are using numpy character codes. What was the motivation for this? numpy character codes are not extensible in their current format, and lead to issues like: jax-ml/ml_dtypes#41.

A feature of the character code is that it provides a way to distinguish parametric types like U* from parametrized instances of those types (like U3). Defining a class with the character code U means instances of the class can be initialized with a "length" parameter, and then we can make U2, U3, etc, as instances of the same class. If instead we bind a concrete numpy dtype as class attributes, we need a separate class for U2, U3, etc, which is undesirable. I do think I can work around this, but I figured the explanation might be helpful.

src/zarr/core/metadata/dtype.py

d-v-b · 2025-03-02T20:15:13Z

src/zarr/core/metadata/dtype.py

+    name: ClassVar[str]
+    dtype_cls: ClassVar[type[TDType]]  # this class will create a numpy dtype
+    kind: ClassVar[DataTypeFlavor]
+    default_value: TScalar


child classes define a string name (which feeds into the zarr metadata), a dtype class dtype_cls (which gets assigned automatically from the generic type parameter) , a string kind (we use this for classifying scalars internally), and a default value (putting this here seems simpler than maintaining a function that matches dtype to default value, but we could potentially do that)

d-v-b · 2025-03-02T20:37:12Z

src/zarr/core/metadata/dtype.py

+class IntWrapperBase(DTypeWrapper[TDType, TScalar]):
+    kind = "numeric"
+
+    @classmethod
+    def from_dtype(cls, dtype: TDType) -> Self:
+        return cls()
+
+    def to_json_value(self, data: np.generic, zarr_format: ZarrFormat) -> int:
+        return int(data)
+
+    def from_json_value(
+        self, data: JSON, *, zarr_format: ZarrFormat, endianness: Endianness | None = None
+    ) -> TScalar:
+        if check_json_int(data):
+            return self.to_dtype(endianness=endianness).type(data)
+        raise TypeError(f"Invalid type: {data}. Expected an integer.")


I use inheritance for dtypes like the integers, which really only differ in their concrete dtype + scalar types.

src/zarr/codecs/sharding.py

…base

nenb · 2025-03-03T13:56:00Z

Summarising from a zulip discussion:

@nenb: How is the endianness of a dtype handled?

@d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype.

Proposed solution: Make endianness an attribute on in the dtype instance. This will be an implementation detail used by zarr-python to handle endianness, but won't be part of the dtype on disk (as requuired by the spec).

d-v-b · 2025-03-04T11:27:15Z

Summarising from a zulip discussion:

@nenb: How is the endianness of a dtype handled?

@d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype.

Proposed solution: Make endianness an attribute on in the dtype instance. This will be an implementation detail used by zarr-python to handle endianness, but won't be part of the dtype on disk (as requuired by the spec).

Thanks for the summary! I have implemented the proposed solution.

…at/fixed-length-strings

…e sharding codecs

…into feat/fixed-length-strings

d-v-b · 2025-05-13T16:35:07Z

this is ready for another look! A summary of recent changes:

I added datetime64 and timedelta64, since we have public v3 specs for them.
I put all the data types in a npy module (for numpy). This anticipates a future in which we define non-numpy data types.
.... within that module I split the data types by their type (int, float, etc). I think this makes it more readable.
I defined a test data type class, which uses a pytest trick to make it simple to define subclasses for each dtype. This was the easiest way to get a lot of parametrized tests with minimal boilerplate, but it does involve a little indirection.

I'd like to get this finished this week, so any feedback would be greatly appreciated!

…into feat/fixed-length-strings

dstansby · 2025-05-16T10:03:31Z

changes/2874.feature.rst

+Adds zarr-specific data type classes. This replaces the direct use of numpy data types for zarr
+v2 and a fixed set of string enums for zarr v3. For more on this new feature, see the `documentation </user-guide/data_types.html>`_


This needs to explain if there are any breaking changes (do users need to change their code at all?), and if so what they are.

see the changes to the changes added in d8a382a

I think we will need a separate 3.0 -> 3.1 migration guide as well.

src/zarr/codecs/bytes.py

d-v-b · 2025-05-16T10:23:36Z

src/zarr/core/dtype/npy/string.py

+                # check the compressors / filters for vlen-utf8
+                # Note that we are checking for the object dtype name.
+                return data == "|O"
+            elif zarr_format == 3:


note to self: we need to support the string literal "string" as an alias here, for backwards compatibility.

cc @rabernat

ianhi

Have not reviewed everything but commenting early to get this part out:

I think that the float to/from json infrastructure for fill values is not complete. As it stands if xarray were to transition to using from_json_value on the type to read fill values on loading it will not be able to read zarr files it wrote with prior versions of zarr python as it would encode the float fill values.

ianhi · 2025-05-16T17:28:51Z

src/zarr/core/dtype/__init__.py

+        # TODO: This check has a lot of assumptions in it! Chiefly, we assume that the
+        # numpy object dtype contains variable length strings, which is not in general true
+        # When / if zarr python supports ragged arrays, for example, this check will fail!


I'm not sure I fully understand this, when will this not be true, and in the future how will this fail, will it throw an error?

today, zarr-python only handles the numpy object dtype if the user wants to work with variable-length strings. So we can use the mapping {numpy object dtype : variable length string zarr dtype}. But there are other uses for the numpy object dtype, like variable-length arrays (ragged arrays), or arbitrary python objects. Both of these were supported by zarr-python 2.

So if we support these uses, we would have 3 or more different zarr data types that are all encoded into numpy arrays with the same numpy datatype (the object data type), and we would thus need more information than just a numpy dtype to choose the correct zarr data type.

I am inclined to completely drop automatic inference for the numpy object data type, and instead accept that it's ambiguous and push the burden of disambiguating it onto the users, i.e. if someone has variable-length strings pre-numpy 2, it's on them to choose the zarr data type directly instead of relying on opaque inference based on the ambiguous object dtype. But i haven't implemented this yet

ianhi · 2025-05-16T18:06:46Z

src/zarr/core/dtype/registry.py

+
+        self.lazy_load_list.clear()
+
+    def register(self: Self, key: str, cls: type[ZDType[TBaseDType, TBaseScalar]]) -> None:


It would be ncie to have even some minimal docstrings on the methods of this class as they are user (developer at least facing.

This could be a nice place to communicate some of the behavior around dtype insertion order

ianhi · 2025-05-16T20:18:51Z

src/zarr/core/dtype/npy/common.py

+def float_from_json_v3(data: JSONFloat) -> float:
+    """
+    Convert a JSON float to a float (v3).
+
+    Parameters
+    ----------
+    data : JSONFloat
+        The JSON float to convert.
+
+    Returns
+    -------
+    float
+        The float value.
+    """
+    # todo: support the v3-specific NaN handling
+    return float_from_json_v2(data)


Per the spec (https://zarr-specs.readthedocs.io/en/latest/v3/data-types/index.html#permitted-fill-values) this needs to handle the case of an encoded float, for example: "AAAAAAAA+H8=" (discovered in one of the array tests)

good catch!

ianhi · 2025-05-16T20:23:00Z

src/zarr/core/dtype/npy/common.py

+    return float(data)
+
+
+def float_to_json_v3(data: float | np.floating[Any]) -> JSONFloat:


I'm pretty sure that for a complete implementation of the v3 spec this needs to supply the option to the user of encoding as a string (see comment on the from_json method.

d-v-b · 2025-05-17T06:46:07Z

Have not reviewed everything but commenting early to get this part out:

I think that the float to/from json infrastructure for fill values is not complete. As it stands if xarray were to transition to using from_json_value on the type to read fill values on loading it will not be able to read zarr files it wrote with prior versions of zarr python as it would encode the float fill values.

Thanks for checking this! Can you give me instructions for replicating the failing xarray tests?

d-v-b added 9 commits February 21, 2025 13:43

modernize typing

f5e3f78

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

b4e71e2

…into feat/fixed-length-strings

lint

3c50f54

new dtypes

d74e7a4

rename base dtype, change type to kind

5000dcb

start working on JSON serialization

9cd5c51

get json de/serialization largely working, and start making tests pass

042fac1

tweak json type guards

556e390

fix dtype sizes, adjust fill value parsing in from_dict, fix tests

b588f70

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 28, 2025

d-v-b added 2 commits March 2, 2025 12:54

mid-refactor commit

4ed41c6

working form for dtype classes

1b2c773

d-v-b commented Mar 2, 2025

View reviewed changes

src/zarr/core/metadata/dtype.py Outdated Show resolved Hide resolved

d-v-b commented Mar 2, 2025

View reviewed changes

src/zarr/codecs/sharding.py Outdated Show resolved Hide resolved

d-v-b added 3 commits March 2, 2025 21:55

remove unused code

24930b3

use wrap / unwrap instead of to_dtype / from_dtype; push into v2 code…

703e0e1

…base

push into v2

3c232a4

remove endianness kwarg to methods, make it an instance variable instead

b7fe986

d-v-b mentioned this pull request Mar 4, 2025

support for datetime and timedelta dtypes (#2616) #2884

Draft

6 tasks

d-v-b added 4 commits March 4, 2025 18:10

make wrapping safe by default

d9b44b4

Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…

bf24d69

…at/fixed-length-strings

dtype-specific tests

c1a8566

more tests, fix void type default value logic

2868994

d-v-b mentioned this pull request Mar 5, 2025

Fix fill_value serialization issues #2802

Merged

6 tasks

fix dtype mechanics in bytescodec

9ab0b1e

d-v-b added 2 commits May 13, 2025 13:25

increase deadline to 500ms

703192c

fewer commented sections of problematic lru_store_cache section of th…

0fab5e5

…e sharding codecs

d-v-b mentioned this pull request May 13, 2025

sharding codec use of lru caching fails with numpy void scalars #3054

Open

d-v-b added 5 commits May 13, 2025 13:56

add link to gh issue about lru_cache for sharding codec

2f945bf

attempt to speed up hypothesis tests by reducing max array size

63a6af4

clean up docs

56e7c84

remove placeholder

eee0d7b

make final example section doctested and more readable

1dc8e72

d-v-b mentioned this pull request May 13, 2025

document data type extensions #3055

Open

d-v-b added 5 commits May 13, 2025 16:40

revert change to auto chunking

13ca230

revert quotation of literal type

2a42205

lint

3f775c8

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

5320a77

…into feat/fixed-length-strings

fix broken code block

b525b8e

d-v-b added 7 commits May 13, 2025 18:57

specialize test to handle stringdtype changes coming in numpy 2.3

ec94878

add docstring to _TestZDType class

3af98aa

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

6388203

…into feat/fixed-length-strings

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

6ef7924

…into feat/fixed-length-strings

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

1329c69

…into feat/fixed-length-strings

type hints

d8c3672

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

3f4d87a

…into feat/fixed-length-strings

dstansby reviewed May 16, 2025

View reviewed changes

src/zarr/codecs/bytes.py Show resolved Hide resolved

expand changelog

d8a382a

d-v-b commented May 16, 2025

View reviewed changes

d-v-b mentioned this pull request May 16, 2025

What options do I have for <U# type data saving in zarr? pydata/xarray#10077

Open

tweak docstring

9aa751b

ianhi suggested changes May 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor v3 data types #2874

refactor v3 data types #2874

d-v-b commented Feb 28, 2025

d-v-b commented Feb 28, 2025

d-v-b Mar 2, 2025

d-v-b Mar 2, 2025

nenb commented Mar 3, 2025

d-v-b commented Mar 4, 2025

d-v-b commented May 13, 2025

dstansby May 16, 2025

d-v-b May 16, 2025

d-v-b May 16, 2025

ianhi left a comment

ianhi May 16, 2025

d-v-b May 17, 2025 •

edited

Loading

ianhi May 16, 2025

ianhi May 16, 2025

d-v-b May 17, 2025

ianhi May 16, 2025

d-v-b commented May 17, 2025

		Adds zarr-specific data type classes. This replaces the direct use of numpy data types for zarr
		v2 and a fixed set of string enums for zarr v3. For more on this new feature, see the `documentation </user-guide/data_types.html>`_


		self.lazy_load_list.clear()

		def register(self: Self, key: str, cls: type[ZDType[TBaseDType, TBaseScalar]]) -> None:

		return float(data)


		def float_to_json_v3(data: float \| np.floating[Any]) -> JSONFloat:

refactor v3 data types #2874

Are you sure you want to change the base?

refactor v3 data types #2874

Conversation

d-v-b commented Feb 28, 2025

d-v-b commented Feb 28, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nenb commented Mar 3, 2025

d-v-b commented Mar 4, 2025

d-v-b commented May 13, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ianhi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d-v-b May 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d-v-b commented May 17, 2025

d-v-b May 17, 2025 •

edited

Loading