Replies: 1 comment 1 reply
-
Yes, this is the current implemented behavior. And it makes sense for the following reason: if every chunk is exactly the same size, then we can easily resize the array without ever having to rewrite chunks. Otherwise, Zarr would have to keep explicitly keep track of the size of the chunks somewhere. Can you explain what exactly you're trying to optimize for? Is your goal truly to minimize disk space? Or some other performance metric? Why not just use smaller chunks? If the goal is to minimize disk space, I'm confident that the right choice of compression algorithm can effectively compress away the missing data from these final chunks. |
Beta Was this translation helpful? Give feedback.
-
We use zarr in a regular program (i.e. not for data analysis but for saving massive, multi-dimensional time series). While time is constant and always-growing, the lengths of the other dimensions may vary from store to store.
Let's take a simple time-by-position store. It makes sense to chunk by 12 hours and 1000 positions, because most queries will be in time, but some queries will also be along the position axis.
Some stores may have 1000 positions, others may have 1500 positions, and we don't know this until the data is ready to be written. Now I thought there was no harm in chunking 1000 positions together, thinking that in the case a store has 1500 positions, the last chunk would be 12 hrs x 500 positions, meaning that my chunk would not be larger than required to fit exactly 12 hrs x 500 positions.
However, it turns out that setting the position chunksize to 500 actually has a massive positive effect on the size of the last chunk. They are much smaller than the 1000-position chunks with only 500 elements in them. I can only imagine this means that the remaining 500 positions are in fact written, but then filled with empties?. Compression does mean they're not exactly twice the size, but they aren't far off.
I wouldn't have thought that zarr fills chunks with filler. Since an array cannot be larger than its dimensions, is it really necessary to waste that (disk) space? Or am I doing something wrong?
Beta Was this translation helpful? Give feedback.
All reactions