Skip to content

refactor v3 data types #2874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 119 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
119 commits
Select commit Hold shift + click to select a range
f5e3f78
modernize typing
d-v-b Feb 21, 2025
b4e71e2
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Feb 24, 2025
3c50f54
lint
d-v-b Feb 24, 2025
d74e7a4
new dtypes
d-v-b Feb 26, 2025
5000dcb
rename base dtype, change type to kind
d-v-b Feb 26, 2025
9cd5c51
start working on JSON serialization
d-v-b Feb 27, 2025
042fac1
get json de/serialization largely working, and start making tests pass
d-v-b Feb 27, 2025
556e390
tweak json type guards
d-v-b Feb 27, 2025
b588f70
fix dtype sizes, adjust fill value parsing in from_dict, fix tests
d-v-b Feb 27, 2025
4ed41c6
mid-refactor commit
d-v-b Mar 2, 2025
1b2c773
working form for dtype classes
d-v-b Mar 2, 2025
24930b3
remove unused code
d-v-b Mar 2, 2025
703e0e1
use wrap / unwrap instead of to_dtype / from_dtype; push into v2 code…
d-v-b Mar 2, 2025
3c232a4
push into v2
d-v-b Mar 3, 2025
b7fe986
remove endianness kwarg to methods, make it an instance variable instead
d-v-b Mar 3, 2025
d9b44b4
make wrapping safe by default
d-v-b Mar 4, 2025
bf24d69
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Mar 4, 2025
c1a8566
dtype-specific tests
d-v-b Mar 4, 2025
2868994
more tests, fix void type default value logic
d-v-b Mar 5, 2025
9ab0b1e
fix dtype mechanics in bytescodec
d-v-b Mar 5, 2025
e9f5e26
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 5, 2025
6df84a9
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Mar 7, 2025
e14279d
remove __post_init__ magic in favor of more explicit declaration
d-v-b Mar 7, 2025
381a264
fix tests
d-v-b Mar 9, 2025
6a7857b
refactor data types
d-v-b Mar 12, 2025
e8fd72c
start design doc
d-v-b Mar 13, 2025
b22f324
more design doc
d-v-b Mar 13, 2025
b7a231e
update docs
d-v-b Mar 13, 2025
7dfcd0f
fix sphinx warnings
d-v-b Mar 13, 2025
706e6b6
tweak docs
d-v-b Mar 13, 2025
8fbf673
info about v3 data types
d-v-b Mar 13, 2025
e9aff64
adjust note
d-v-b Mar 13, 2025
44e78f5
fix: use unparametrized types in direct assignment
d-v-b Mar 13, 2025
60cac04
start fixing config
d-v-b Mar 17, 2025
120df57
Update src/zarr/core/_info.py
d-v-b Mar 17, 2025
0d9922b
add placeholder disclaimer to v3 data types summary
d-v-b Mar 17, 2025
2075952
make example runnable
d-v-b Mar 17, 2025
44369d6
placeholder section for adding a custom dtype
d-v-b Mar 17, 2025
4f3381f
define native data type and native scalar
d-v-b Mar 17, 2025
c8d7680
update data type names
d-v-b Mar 17, 2025
2a7b5a8
fix config test failures
d-v-b Mar 17, 2025
e855e54
call to_dtype once in blosc evolve_from_array_spec
d-v-b Mar 17, 2025
a2da99a
refactor dtypewrapper -> zdtype
d-v-b Mar 19, 2025
5ea3fa4
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 19, 2025
cbb159d
update code examples in docs; remove native endianness
d-v-b Mar 19, 2025
c506d09
Merge branch 'feat/fixed-length-strings' of github.com:d-v-b/zarr-pyt…
d-v-b Mar 19, 2025
bb11867
adjust type annotations
d-v-b Mar 20, 2025
7a619e0
fix info tests to use zdtype
d-v-b Mar 20, 2025
ea2d0bf
remove dead code and add code coverage exemption to zarr format checks
d-v-b Mar 20, 2025
042c9e5
fix: add special check for resolving int32 on windows
d-v-b Mar 20, 2025
def5eb2
add dtype entry point test
d-v-b Mar 20, 2025
1b7273b
remove default parameters for parametric dtypes; add mixin classes fo…
d-v-b Mar 21, 2025
60b2e9d
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 21, 2025
83f508c
Update docs/user-guide/data_types.rst
d-v-b Mar 24, 2025
4ceb6ed
refactor: use inheritance to remove boilerplate in dtype definitions
d-v-b Mar 24, 2025
5b9cff0
Merge branch 'feat/fixed-length-strings' of github.com:d-v-b/zarr-pyt…
d-v-b Mar 24, 2025
65f0453
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 24, 2025
cb0a7d4
update data types documentation, and expose core/dtype module to autodoc
d-v-b Mar 24, 2025
40f0063
Merge branch 'feat/fixed-length-strings' of github.com:d-v-b/zarr-pyt…
d-v-b Mar 24, 2025
9989c64
add failing endianness round-trip test
d-v-b Mar 24, 2025
a276c84
fix endianness
d-v-b Mar 24, 2025
6285739
additional check in test_explicit_endianness
d-v-b Mar 24, 2025
e9241b9
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Mar 24, 2025
2bffe1a
add failing test for round-tripping vlen strings
d-v-b Mar 24, 2025
aa32271
route object dtype arrays to vlen string dtype when numpy > 2
d-v-b Mar 25, 2025
617d3f0
relax endianness mismatch to a warning instead of an error
d-v-b Mar 25, 2025
2b5fd8f
use public dtype module for docs instead of special-casing the core d…
d-v-b Mar 25, 2025
1831f20
use public dtype module for docs instead of special-casing the core d…
d-v-b Mar 25, 2025
a427a16
silence mypy error about array indexing
d-v-b Mar 25, 2025
41d7e58
add release note
d-v-b Mar 25, 2025
c08ffd9
fix doctests, excluding config tests
d-v-b Mar 25, 2025
778d740
revert addition of linkage between dtype endianness and bytes codec e…
d-v-b Mar 26, 2025
269215e
remove Any types
d-v-b Mar 26, 2025
8af0ce4
add docstring for wrapper module
d-v-b Mar 26, 2025
df60d05
simplify config and docs
d-v-b Mar 26, 2025
7f54bbf
update config test
d-v-b Mar 26, 2025
be83f03
fix S dtype test for v2
d-v-b Mar 26, 2025
3979746
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Apr 28, 2025
a210f9f
fully remove v3jsonencoder
d-v-b Apr 28, 2025
8fbd29a
refactor dtype module structure
d-v-b Apr 29, 2025
afc9872
add timedelta64
d-v-b Apr 29, 2025
e1bf901
refactor time dtypes
d-v-b Apr 30, 2025
45f0c88
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 1, 2025
890077e
widen dtype test strategies
d-v-b May 1, 2025
a3f05f0
modify structured dtype fill value rt to avoid to_dict
d-v-b May 2, 2025
4788f05
wip: begin creating isomorphic test suite for dtypes
d-v-b May 2, 2025
d3f9204
finish common tests
d-v-b May 2, 2025
fdf17e3
wip: test infrastructure for dtypes
d-v-b May 7, 2025
4afa42a
wip: use class-based tests for all dtypes
d-v-b May 7, 2025
4990803
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 7, 2025
1458aad
fill out more tests, and adjust sized dtypes
d-v-b May 8, 2025
9673997
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 8, 2025
aa11df4
wip: json schema test
d-v-b May 12, 2025
f706b46
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 12, 2025
52518c2
add casting tests
d-v-b May 13, 2025
4ab1c58
use relative link for changes
d-v-b May 13, 2025
e4c89f3
typo
d-v-b May 13, 2025
e386c2b
make bytes codec dtype logic a bit more literate
d-v-b May 13, 2025
703192c
increase deadline to 500ms
d-v-b May 13, 2025
0fab5e5
fewer commented sections of problematic lru_store_cache section of th…
d-v-b May 13, 2025
2f945bf
add link to gh issue about lru_cache for sharding codec
d-v-b May 13, 2025
63a6af4
attempt to speed up hypothesis tests by reducing max array size
d-v-b May 13, 2025
56e7c84
clean up docs
d-v-b May 13, 2025
eee0d7b
remove placeholder
d-v-b May 13, 2025
1dc8e72
make final example section doctested and more readable
d-v-b May 13, 2025
13ca230
revert change to auto chunking
d-v-b May 13, 2025
2a42205
revert quotation of literal type
d-v-b May 13, 2025
3f775c8
lint
d-v-b May 13, 2025
5320a77
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 13, 2025
b525b8e
fix broken code block
d-v-b May 13, 2025
ec94878
specialize test to handle stringdtype changes coming in numpy 2.3
d-v-b May 13, 2025
3af98aa
add docstring to _TestZDType class
d-v-b May 13, 2025
6388203
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 15, 2025
6ef7924
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 15, 2025
1329c69
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 15, 2025
d8c3672
type hints
d-v-b May 15, 2025
3f4d87a
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 16, 2025
d8a382a
expand changelog
d-v-b May 16, 2025
9aa751b
tweak docstring
d-v-b May 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions changes/2874.feature.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Adds zarr-specific data type classes. This replaces the internal use of numpy data types for zarr
v2 and a fixed set of string enums for zarr v3. This change is largely internal, but it does
change the type of the ``dtype`` and ``data_type`` fields on the ``ArrayV2Metadata`` and
``ArrayV3Metadata`` classes. It also changes the JSON metadata representation of the
variable-length string data type, but the old metadata representation can still be
used when reading arrays. The logic for automatically choosing the chunk encoding for a given data
type has also changed, and this necessitated changes to the ``config`` API.

For more on this new feature, see the `documentation </user-guide/data_types.html>`_
14 changes: 7 additions & 7 deletions docs/user-guide/arrays.rst
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ which can be used to print useful diagnostics, e.g.::
>>> z.info
Type : Array
Zarr format : 3
Data type : DataType.int32
Data type : Int32(endianness='little')
Shape : (10000, 10000)
Chunk shape : (1000, 1000)
Order : C
Expand All @@ -199,7 +199,7 @@ prints additional diagnostics, e.g.::
>>> z.info_complete()
Type : Array
Zarr format : 3
Data type : DataType.int32
Data type : Int32(endianness='little')
Shape : (10000, 10000)
Chunk shape : (1000, 1000)
Order : C
Expand Down Expand Up @@ -246,7 +246,7 @@ built-in delta filter::
The default compressor can be changed by setting the value of the using Zarr's
:ref:`user-guide-config`, e.g.::

>>> with zarr.config.set({'array.v2_default_compressor.numeric': {'id': 'blosc'}}):
>>> with zarr.config.set({'array.v2_default_compressor.default': {'id': 'blosc'}}):
... z = zarr.create_array(store={}, shape=(100000000,), chunks=(1000000,), dtype='int32', zarr_format=2)
>>> z.filters
()
Expand Down Expand Up @@ -286,7 +286,7 @@ Here is an example using a delta filter with the Blosc compressor::
>>> z.info
Type : Array
Zarr format : 3
Data type : DataType.int32
Data type : Int32(endianness='little')
Shape : (10000, 10000)
Chunk shape : (1000, 1000)
Order : C
Expand Down Expand Up @@ -600,18 +600,18 @@ Sharded arrays can be created by providing the ``shards`` parameter to :func:`za
>>> a.info_complete()
Type : Array
Zarr format : 3
Data type : DataType.uint8
Data type : UInt8()
Shape : (10000, 10000)
Shard shape : (1000, 1000)
Chunk shape : (100, 100)
Order : C
Read-only : False
Store type : LocalStore
Filters : ()
Serializer : BytesCodec(endian=<Endian.little: 'little'>)
Serializer : BytesCodec(endian=None)
Compressors : (ZstdCodec(level=0, checksum=False),)
No. bytes : 100000000 (95.4M)
No. bytes stored : 3981552
No. bytes stored : 3981473
Storage ratio : 25.1
Shards Initialized : 100

Expand Down
59 changes: 25 additions & 34 deletions docs/user-guide/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,39 +43,30 @@ This is the current default configuration::

>>> zarr.config.pprint()
{'array': {'order': 'C',
'v2_default_compressor': {'bytes': {'checksum': False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I manually set the config to this old default value (which I could do in the current v3 branch), does it work properly after this PR? I guess the bigger question here is, are there any breaking changes to what is/isn't allowed in the config with this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, the config in this PR has undergone breaking changes compared to main. We could make those changes backwards-compatible and add deprecation warnings to deprecated keys but this will require some effort.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, in that case the release notes definitely need expanding a lot to explain what the breaking changes are.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My two cents on breaking changes is we should definitely deprecate where possible, because v3 was already a big breaking change that users (well, at least me 😄 ) are struggling to get used to, so to have more breaking changes without deprecations and migration paths would not be great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, we just need to sketch out how to do deprecations and and migrations in our (terrible, IMO) config API

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"terrible" is an exaggeration -- our config API works today, but it has some flaws that make me think it should be overhauled

  • it's untyped
  • it uses raw python dictionaries, so we are missing a dynamic layer for adding indirection / deprecation warnings, etc

I'm not sure how many of these things can be addressed within the scope of donfig itself?

'id': 'zstd',
'level': 0},
'numeric': {'checksum': False,
'id': 'zstd',
'level': 0},
'string': {'checksum': False,
'v2_default_compressor': {'default': {'checksum': False,
'id': 'zstd',
'level': 0}},
'v2_default_filters': {'bytes': [{'id': 'vlen-bytes'}],
'numeric': None,
'raw': None,
'string': [{'id': 'vlen-utf8'}]},
'v3_default_compressors': {'bytes': [{'configuration': {'checksum': False,
'level': 0},
'name': 'zstd'}],
'numeric': [{'configuration': {'checksum': False,
'level': 0},
'variable-length-string': {'checksum': False,
'id': 'zstd',
'level': 0}},
'v2_default_filters': {'default': None,
'variable-length-string': [{'id': 'vlen-utf8'}]},
'v3_default_compressors': {'default': [{'configuration': {'checksum': False,
'level': 0},
'name': 'zstd'}],
'string': [{'configuration': {'checksum': False,
'level': 0},
'name': 'zstd'}]},
'v3_default_filters': {'bytes': [], 'numeric': [], 'string': []},
'v3_default_serializer': {'bytes': {'name': 'vlen-bytes'},
'numeric': {'configuration': {'endian': 'little'},
'name': 'bytes'},
'string': {'name': 'vlen-utf8'}},
'write_empty_chunks': False},
'async': {'concurrency': 10, 'timeout': None},
'buffer': 'zarr.core.buffer.cpu.Buffer',
'codec_pipeline': {'batch_size': 1,
'path': 'zarr.core.codec_pipeline.BatchedCodecPipeline'},
'codecs': {'blosc': 'zarr.codecs.blosc.BloscCodec',
'variable-length-string': [{'configuration': {'checksum': False,
'level': 0},
'name': 'zstd'}]},
'v3_default_filters': {'default': [], 'variable-length-string': []},
'v3_default_serializer': {'default': {'configuration': {'endian': 'little'},
'name': 'bytes'},
'variable-length-string': {'name': 'vlen-utf8'}},
'write_empty_chunks': False},
'async': {'concurrency': 10, 'timeout': None},
'buffer': 'zarr.core.buffer.cpu.Buffer',
'codec_pipeline': {'batch_size': 1,
'path': 'zarr.core.codec_pipeline.BatchedCodecPipeline'},
'codecs': {'blosc': 'zarr.codecs.blosc.BloscCodec',
'bytes': 'zarr.codecs.bytes.BytesCodec',
'crc32c': 'zarr.codecs.crc32c_.Crc32cCodec',
'endian': 'zarr.codecs.bytes.BytesCodec',
Expand All @@ -85,7 +76,7 @@ This is the current default configuration::
'vlen-bytes': 'zarr.codecs.vlen_utf8.VLenBytesCodec',
'vlen-utf8': 'zarr.codecs.vlen_utf8.VLenUTF8Codec',
'zstd': 'zarr.codecs.zstd.ZstdCodec'},
'default_zarr_format': 3,
'json_indent': 2,
'ndbuffer': 'zarr.core.buffer.cpu.NDBuffer',
'threading': {'max_workers': None}}
'default_zarr_format': 3,
'json_indent': 2,
'ndbuffer': 'zarr.core.buffer.cpu.NDBuffer',
'threading': {'max_workers': None}}
6 changes: 3 additions & 3 deletions docs/user-guide/consolidated_metadata.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ that can be used.:
>>> from pprint import pprint
>>> pprint(dict(sorted(consolidated_metadata.items())))
{'a': ArrayV3Metadata(shape=(1,),
data_type=<DataType.float64: 'float64'>,
data_type=Float64(endianness='little'),
chunk_grid=RegularChunkGrid(chunk_shape=(1,)),
chunk_key_encoding=DefaultChunkKeyEncoding(name='default',
separator='/'),
Expand All @@ -60,7 +60,7 @@ that can be used.:
node_type='array',
storage_transformers=()),
'b': ArrayV3Metadata(shape=(2, 2),
data_type=<DataType.float64: 'float64'>,
data_type=Float64(endianness='little'),
chunk_grid=RegularChunkGrid(chunk_shape=(2, 2)),
chunk_key_encoding=DefaultChunkKeyEncoding(name='default',
separator='/'),
Expand All @@ -73,7 +73,7 @@ that can be used.:
node_type='array',
storage_transformers=()),
'c': ArrayV3Metadata(shape=(3, 3, 3),
data_type=<DataType.float64: 'float64'>,
data_type=Float64(endianness='little'),
chunk_grid=RegularChunkGrid(chunk_shape=(3, 3, 3)),
chunk_key_encoding=DefaultChunkKeyEncoding(name='default',
separator='/'),
Expand Down
172 changes: 172 additions & 0 deletions docs/user-guide/data_types.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
Data types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is a super useful read. I'm wondering what to do with it though. Were you thinking it would go under the Advanced Topics section in the user guide?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion from me. IMO our docs right now are not the most logically organized, so I anticipate some churn there in any case.

==========

Zarr's data type model
----------------------

Every Zarr array has a "data type", which defines the meaning and physical layout of the
array's elements. As Zarr Python is tightly integrated with `NumPy <https://numpy.org/doc/stable/>`_,
it's easy to create arrays with NumPy data types:

.. code-block:: python

>>> import zarr
>>> import numpy as np
>>> z = zarr.create_array(store={}, shape=(10,), dtype=np.dtype('uint8'))
>>> z
<Array memory:... shape=(10,) dtype=uint8>

Unlike NumPy arrays, Zarr arrays are designed to accessed by Zarr
implementations in different programming languages. This means Zarr data types must be interpreted
correctly when clients read an array. Each Zarr data type defines procedures for
encoding and decoding both the data type itself, and scalars from that data type to and from Zarr array metadata. And these serialization procedures
depend on the Zarr format.

Data types in Zarr version 2
-----------------------------

Version 2 of the Zarr format defined its data types relative to
`NumPy's data types <https://numpy.org/doc/2.1/reference/arrays.dtypes.html#data-type-objects-dtype>`_,
and added a few non-NumPy data types as well. Thus the JSON identifier for a NumPy-compatible data
type is just the NumPy ``str`` attribute of that data type:

.. code-block:: python

>>> import zarr
>>> import numpy as np
>>> import json
>>>
>>> store = {}
>>> np_dtype = np.dtype('int64')
>>> z = zarr.create_array(store=store, shape=(1,), dtype=np_dtype, zarr_format=2)
>>> dtype_meta = json.loads(store['.zarray'].to_bytes())["dtype"]
>>> dtype_meta
'<i8'
>>> assert dtype_meta == np_dtype.str

.. note::
The ``<`` character in the data type metadata encodes the
`endianness <https://numpy.org/doc/2.2/reference/generated/numpy.dtype.byteorder.html>`_,
or "byte order", of the data type. Following NumPy's example,
in Zarr version 2 each data type has an endianness where applicable.
However, Zarr version 3 data types do not store endianness information.

In addition to defining a representation of the data type itself (which in the example above was
just a simple string ``"<i8"``), Zarr also
defines a metadata representation for scalars associated with each data type. This is necessary
because Zarr arrays have a ``JSON``-serializable ``fill_value`` attribute that defines a scalar value to use when reading
uninitialized chunks of a Zarr array.
Integer and float scalars are stored as ``JSON`` numbers, except for special floats like ``NaN``,
positive infinity, and negative infinity, which are stored as strings.

More broadly, each Zarr data type defines its own rules for how scalars of that type are stored in
``JSON``.


Data types in Zarr version 3
-----------------------------

Zarr V3 brings several key changes to how data types are represented:

- Zarr V3 identifies the basic data types as strings like ``"int8"``, ``"int16"``, etc.

By contrast, Zarr V2 uses the NumPy character code representation for data types:
In Zarr V2, ``int8`` is represented as ``"|i1"``.
- A Zarr V3 data type does not have endianness. This is a departure from Zarr V2, where multi-byte
data types are defined with endianness information. Instead, Zarr V3 requires that endianness,
where applicable, is specified in the ``codecs`` attribute of array metadata.
- While some Zarr V3 data types are identified by strings, others can be identified by a ``JSON``
object. For example, consider this specification of a ``datetime`` data type:

.. code-block:: json

{
"name": "numpy.datetime64",
"configuration": {
"unit": "s",
"scale_factor": 10
}
}


Zarr V2 generally uses structured string representations to convey the same information. The
data type given in the previous example would be represented as the string ``">M[10s]"`` in
Zarr V2. This is more compact, but can be harder to parse.

For more about data types in Zarr V3, see the
`V3 specification <https://zarr-specs.readthedocs.io/en/latest/v3/data-types/index.html>`_.

Data types in Zarr Python
-------------------------

The two Zarr formats that Zarr Python supports specify data types in two different ways:
data types in Zarr version 2 are encoded as NumPy-compatible strings, while data types in Zarr version
3 are encoded as either strings or ``JSON`` objects,
and the Zarr V3 data types don't have any associated endianness information, unlike Zarr V2 data types.

To abstract over these syntactical and semantic differences, Zarr Python uses a class called
`ZDType <../api/zarr/dtype/index.html#zarr.dtype.ZDType>`_ provide Zarr V2 and Zarr V3 compatibility
routines for ""native" data types. In this context, a "native" data type is a Python class,
typically defined in another library, that models an array's data type. For example, ``np.uint8`` is a native
data type defined in NumPy, which Zarr Python wraps with a ``ZDType`` instance called
`UInt8 <../api/zarr/dtype/index.html#zarr.dtype.ZDType>`_.

Each data type supported by Zarr Python is modeled by ``ZDType`` subclass, which provides an
API for the following operations:

- Wrapping / unwrapping a native data type
- Encoding / decoding a data type to / from Zarr V2 and Zarr V3 array metadata.
- Encoding / decoding a scalar value to / from Zarr V2 and Zarr V3 array metadata.


Example Usage
~~~~~~~~~~~~~

Create a ``ZDType`` from a native data type:

.. code-block:: python

>>> from zarr.core.dtype import Int8
>>> import numpy as np
>>> int8 = Int8.from_dtype(np.dtype('int8'))

Convert back to native data type:

.. code-block:: python

>>> native_dtype = int8.to_dtype()
>>> assert native_dtype == np.dtype('int8')

Get the default scalar value for the data type:

.. code-block:: python

>>> default_value = int8.default_value()
>>> assert default_value == np.int8(0)


Serialize to JSON for Zarr V2 and V3

.. code-block:: python

>>> json_v2 = int8.to_json(zarr_format=2)
>>> json_v2
'|i1'
>>> json_v3 = int8.to_json(zarr_format=3)
>>> json_v3
'int8'

Serialize a scalar value to JSON:

.. code-block:: python

>>> json_value = int8.to_json_value(42, zarr_format=3)
>>> json_value
42

Deserialize a scalar value from JSON:

.. code-block:: python

>>> scalar_value = int8.from_json_value(42, zarr_format=3)
>>> assert scalar_value == np.int8(42)
4 changes: 2 additions & 2 deletions docs/user-guide/groups.rst
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ property. E.g.::
>>> bar.info_complete()
Type : Array
Zarr format : 3
Data type : DataType.int64
Data type : Int64(endianness='little')
Shape : (1000000,)
Chunk shape : (100000,)
Order : C
Expand All @@ -144,7 +144,7 @@ property. E.g.::
>>> baz.info
Type : Array
Zarr format : 3
Data type : DataType.float32
Data type : Float32(endianness='little')
Shape : (1000, 1000)
Chunk shape : (100, 100)
Order : C
Expand Down
1 change: 1 addition & 0 deletions docs/user-guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ User guide

installation
arrays
data_types
groups
attributes
storage
Expand Down
Loading