Skip to content

resize(): Improve docs & control of what is modified #1017

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hailiangzhang opened this issue May 3, 2022 · 10 comments
Open

resize(): Improve docs & control of what is modified #1017

hailiangzhang opened this issue May 3, 2022 · 10 comments
Labels
documentation Improvements to the documentation

Comments

@hailiangzhang
Copy link
Contributor

hailiangzhang commented May 3, 2022

Problem description

For a zarr array with shape of (4,6) and chunksize of (2,3), I did the following:

  1. reduce its shape to be (1,1)
  2. increase its shape back to be (4,6)

I found that the final array after step-2 brought back the original values in the first whole chunk.
However, as an end user, I am expecting only the first element to be preserved (since I had already shrunk the shape to be (1,1), and the chunksize should be transparent to the end user).

Minimal, reproducible code sample

import zarr
import numpy as np

z = zarr.open('data', mode='w', shape=(4, 6), chunks=(2, 3), dtype='i4')
z[:] = 1

print("Original zarr array with shape (4,6):")
print(z[:])

print("\nAfter resizing shape to (1,1):")
z.resize((1,1))
print(z[:])

print("\nAfter resizing shape back to (4,6):")
z.resize((4,6))
print(z[:])

print("\nBut I was expecting it to be:")
arr_expected = np.zeros((4,6), dtype='i4')
arr_expected[0,0] = 1
print(arr_expected[:])

Output

Original zarr array with shape (4,6):
[[1 1 1 1 1 1]
 [1 1 1 1 1 1]
 [1 1 1 1 1 1]
 [1 1 1 1 1 1]]

After resizing shape to (1,1):
[[1]]

After resizing shape back to (4,6):
[[1 1 1 0 0 0]
 [1 1 1 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]]

But I was expecting it to be:
[[1 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]]

Version information

  • zarr.version: 2.11.2
@joshmoore
Copy link
Member

@hailiangzhang, interesting. I can definitely understand your surprise. Reading https://zarr.readthedocs.io/en/stable/api/core.html?highlight=resize#zarr.core.Array.resize, however, it's only clear (at least for me) that out-of-bound chunks are removed not that the in-bound chunk is re-written. Assuming that's functionality that someone already relies on, it might take an extra argument to resize() for rewriting.

@hailiangzhang
Copy link
Contributor Author

Ah, probably this is not very surprising based on the notes you provided:)

So, in this case, I can add the testing to my PR as we originally planned.

In the long term, as you mentioned above, probably we can add an extra argument to resize() for rewriting (which maybe useful and less confusing for some end users like me:)

This being said, this issue report is more like a feature request instead a bug report, so please feel free to close it if there is no immediate plan to add this feature (or leave it here and someone may be able to add it when having a chance:).

Thanks again for your comments @joshmoore !

@jakirkham
Copy link
Member

Deletion can be fairly expensive so I think this was implemented intentionally to avoid deleting data and instead being a metadata only change (fairly quick). Maybe it is worth documenting that resize does not necessarily delete the data?

@hailiangzhang
Copy link
Contributor Author

@jakirkham , agreed (and that's actually what I would imagine:)

Since resize actually deletes data at chunk level of resolution, this could be a little bit unexpected by the end users (who don't need to be aware of the internal data organization), and therefore yes, we could probably explain this more clearly in the documentation (bold is what I added):

If one or more dimensions are shrunk, any chunks falling outside the new array shape will be deleted from the underlying store. It is noteworthy that the chunks partially falling inside the new array (i.e. boundary chunks) will remain intact, and therefore, any data falling outside of the new array shape but inside the boundary chunks would be recovered by subsequent resize operation that increases the array shape.

If this looks correct and helpful, I will be happy to send another tiny PR:)

@joshmoore joshmoore changed the title Possible bug with Zarr array "resize" resize(): Improve docs & control of what is modified May 4, 2022
@joshmoore
Copy link
Member

👍 for the doc improvement. I've updated the title of this issue to:

  • resize(): Improve docs & control of what is modified

with the control being potential new arguments, etc.

@hailiangzhang hailiangzhang mentioned this issue May 5, 2022
6 tasks
@hailiangzhang
Copy link
Contributor Author

👍 for the doc improvement. I've updated the title of this issue to:

  • resize(): Improve docs & control of what is modified

with the control being potential new arguments, etc.

Cool, I just sent a small PR which adds the comments as described above.
Thanks @joshmoore !

@hailiangzhang
Copy link
Contributor Author

Hi, since my PR has been merged, feel free to close this ticket (and please let me know if I am supposed to do that:)

@joshmoore
Copy link
Member

Happy to leave that up to you. If you think new method arguments are worth it, feel free to leave open. Otherwise, feel free to close.

@Jaykold
Copy link

Jaykold commented Oct 20, 2022

I found out that if you put every item in its own individual chunk, you will get the desired output after resizing.

SAMPLE

import zarr
import numpy as np

z = zarr.open('data', mode='w', shape=(4, 6), chunks=(1, 1), dtype='i4')
z[:] = 1

print("Original zarr array with shape (4,6):")
print(z[:])

print("\nAfter resizing shape to (1,1):")
z.resize((1,1))
print(z[:])

print("\nAfter resizing shape back to (4,6):")
z.resize((4,6))
print(z[:])

OUTPUT

Original zarr array with shape (4,6):

[[1 1 1 1 1 1]
 [1 1 1 1 1 1]
 [1 1 1 1 1 1]
 [1 1 1 1 1 1]]

After resizing shape to (1,1):

[[1]]
After resizing shape back to (4,6):
[[1 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]]

Caveat: This is not optimal when dealing with a large data array and it will also create a huge number of files.

@joshmoore
Copy link
Member

Thanks for looking into this, @Jaykold. You're right that having a chunk size of 1 will work around the issue, but doing that for all dimensions with anything other than toy data isn't really an option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements to the documentation
Projects
None yet
Development

No branches or pull requests

5 participants