-
-
Notifications
You must be signed in to change notification settings - Fork 329
refactor v3 data types #2874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
refactor v3 data types #2874
Conversation
…into feat/fixed-length-strings
copying a comment @nenb made in this zulip discussion:
A feature of the character code is that it provides a way to distinguish parametric types like |
src/zarr/core/metadata/dtype.py
Outdated
name: ClassVar[str] | ||
dtype_cls: ClassVar[type[TDType]] # this class will create a numpy dtype | ||
kind: ClassVar[DataTypeFlavor] | ||
default_value: TScalar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
child classes define a string name
(which feeds into the zarr metadata), a dtype class dtype_cls
(which gets assigned automatically from the generic type parameter) , a string kind
(we use this for classifying scalars internally), and a default value (putting this here seems simpler than maintaining a function that matches dtype to default value, but we could potentially do that)
src/zarr/core/metadata/dtype.py
Outdated
class IntWrapperBase(DTypeWrapper[TDType, TScalar]): | ||
kind = "numeric" | ||
|
||
@classmethod | ||
def from_dtype(cls, dtype: TDType) -> Self: | ||
return cls() | ||
|
||
def to_json_value(self, data: np.generic, zarr_format: ZarrFormat) -> int: | ||
return int(data) | ||
|
||
def from_json_value( | ||
self, data: JSON, *, zarr_format: ZarrFormat, endianness: Endianness | None = None | ||
) -> TScalar: | ||
if check_json_int(data): | ||
return self.to_dtype(endianness=endianness).type(data) | ||
raise TypeError(f"Invalid type: {data}. Expected an integer.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use inheritance for dtypes like the integers, which really only differ in their concrete dtype + scalar types.
Summarising from a zulip discussion: @nenb: How is the endianness of a dtype handled? @d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype. Proposed solution: Make |
Thanks for the summary! I have implemented the proposed solution. |
this is ready for another look! A summary of recent changes:
I'd like to get this finished this week, so any feedback would be greatly appreciated! |
…into feat/fixed-length-strings
…into feat/fixed-length-strings
…into feat/fixed-length-strings
…into feat/fixed-length-strings
changes/2874.feature.rst
Outdated
Adds zarr-specific data type classes. This replaces the direct use of numpy data types for zarr | ||
v2 and a fixed set of string enums for zarr v3. For more on this new feature, see the `documentation </user-guide/data_types.html>`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to explain if there are any breaking changes (do users need to change their code at all?), and if so what they are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see the changes to the changes added in d8a382a
I think we will need a separate 3.0 -> 3.1 migration guide as well.
# check the compressors / filters for vlen-utf8 | ||
# Note that we are checking for the object dtype name. | ||
return data == "|O" | ||
elif zarr_format == 3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self: we need to support the string literal "string"
as an alias here, for backwards compatibility.
cc @rabernat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have not reviewed everything but commenting early to get this part out:
I think that the float to/from json infrastructure for fill values is not complete. As it stands if xarray
were to transition to using from_json_value
on the type to read fill values on loading it will not be able to read zarr files it wrote with prior versions of zarr python as it would encode the float fill values.
# TODO: This check has a lot of assumptions in it! Chiefly, we assume that the | ||
# numpy object dtype contains variable length strings, which is not in general true | ||
# When / if zarr python supports ragged arrays, for example, this check will fail! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I fully understand this, when will this not be true, and in the future how will this fail, will it throw an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
today, zarr-python only handles the numpy object dtype if the user wants to work with variable-length strings. So we can use the mapping {numpy object dtype : variable length string zarr dtype}
. But there are other uses for the numpy object dtype, like variable-length arrays (ragged arrays), or arbitrary python objects. Both of these were supported by zarr-python 2.
So if we support these uses, we would have 3 or more different zarr data types that are all encoded into numpy arrays with the same numpy datatype (the object data type), and we would thus need more information than just a numpy dtype to choose the correct zarr data type.
I am inclined to completely drop automatic inference for the numpy object data type, and instead accept that it's ambiguous and push the burden of disambiguating it onto the users, i.e. if someone has variable-length strings pre-numpy 2, it's on them to choose the zarr data type directly instead of relying on opaque inference based on the ambiguous object dtype. But i haven't implemented this yet
|
||
self.lazy_load_list.clear() | ||
|
||
def register(self: Self, key: str, cls: type[ZDType[TBaseDType, TBaseScalar]]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be ncie to have even some minimal docstrings on the methods of this class as they are user (developer at least facing.
This could be a nice place to communicate some of the behavior around dtype insertion order
def float_from_json_v3(data: JSONFloat) -> float: | ||
""" | ||
Convert a JSON float to a float (v3). | ||
|
||
Parameters | ||
---------- | ||
data : JSONFloat | ||
The JSON float to convert. | ||
|
||
Returns | ||
------- | ||
float | ||
The float value. | ||
""" | ||
# todo: support the v3-specific NaN handling | ||
return float_from_json_v2(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the spec (https://zarr-specs.readthedocs.io/en/latest/v3/data-types/index.html#permitted-fill-values) this needs to handle the case of an encoded float, for example: "AAAAAAAA+H8="
(discovered in one of the array tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
return float(data) | ||
|
||
|
||
def float_to_json_v3(data: float | np.floating[Any]) -> JSONFloat: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure that for a complete implementation of the v3 spec this needs to supply the option to the user of encoding as a string (see comment on the from_json
method.
Thanks for checking this! Can you give me instructions for replicating the failing xarray tests? |
As per #2750, we need a new model of data types if we want to support more data types. Accordingly, this PR will refactor data types for the zarr v3 side of the codebase and make them extensible. I would also like to handle v2 as well with the same data structures, and confine the v2 / v3 differences to the places where they vary.
In
main
,all the v3 data types are encoded as variants of an enum (i.e., strings). Enumerating each dtype as a string is cumbersome for datetimes, that are parametrized by a time unit, and plain unworkable for parametric dtypes like fixed-length strings, which are parametrized by their length. This means we need a model of data types that can be parametrized, and I think separate classes is probably the way to go here. Separating the different data types into different objects also gives us a natural way to capture some of the per-data type variability baked into the spec: each data type class can define its own default value, and also define methods for how its scalars should be converted to / from JSON.This is a very rough draft right now -- I'm mostly posting this for visibility as I iterate on it.