-
-
Notifications
You must be signed in to change notification settings - Fork 330
Refactor storage around abstract file system? #301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks Ryan, pyfilesystem certainly looks like something we should
investigate. And I am certainly open to this general approach if the
underlying libraries are well maintained and performant and we can still
optimise for zarr usage patterns if/where needed.
Just to note here that this approach is basically what @martindurant has
been arguing for, albeit with the underlying filesystem abstraction and
implementations being different. @martindurant what's your view of this?
FWIW I think it will take some time to get enough experience to make a firm
decision in this direction, so I think we should be prepared to live with a
mixture of approaches and some duplication of effort for a while. Obviously
in the long run we should aim to consolidate efforts and remove redundancy
as much as possible.
Also various people (including me) have found it pleasantly straightforward
to implement the MutableMapping interface directly for a new storage
backend, so we shouldn't ignore those positive feelings. Maybe implementing
the pyfilesystem API is similarly straightforward, I don't have the
experience.
…On Sun, 23 Sep 2018, 20:28 Ryan Abernathey, ***@***.***> wrote:
We are recently seeing a lot of new proposals for new storage classes in
zarr (e.g. #299 <#299>, #294
<#294>, #293
<#293>, #252
<#252>). These are all great
ideas. Alternatively, we have several working storage layers (s3fs, gcsfs)
that *don't* live inside zarr because they already provide a
MutableMapping interface that zarr can talk to. The situation is
fragmented, and we don't see to have a clear roadmap for how to handle all
these different scenarios. There is some relevant discussion in #290
<https://github.com/zarr-developers/zarr/issues/290>.
I recently learned about pyfilesystem <https://www.pyfilesystem.org/>:
"PyFilesystem is a Python module that provides a common interface to any
filesystem." The index of supported filesystems
<https://www.pyfilesystem.org/page/index-of-filesystems/> provides
analogs for nearly all of the builtin zarr storage options. Plus there are
storage classes for cloud, ftp, dropbox, etc.
Perhaps one path forward would be to refactor zarr's storage to use
pyfilesystem objects. We would only really need a single storage class
which wraps pyfilesystem and provides the MutableMapping that zarr uses
internally. Then we could remove 80% of storage.py that deals with
listing directories, zip files, etc, since this would be handled by
pyfilesystem.
Once we had a generic filesystem, we could then create a Layout layer,
which describes how the zarr objects are laid out within the filesystem.
For example, today, we already have two de-facto layouts: DirectoryStore
and NestedDirectoryStore. We could consider others. For example, one with
all the metadata in a single file (e.g. #294
<#294>). The Layout and the
Filesystem could be independent from one another.
For new storage layers like mongodb, redis, etc., we would basically just
say, "go implement a pyfilesystem for that". This has the advantage of
- reducing the maintenance burden in zarr
- providing more general filesystem objects (that can also be used
outside of zarr)
The only con I can think of is performance: it is possible that the
pyfilesystem implementations could have worse performance than the zarr
built-in ones. But this cuts both ways: they could also perform better!
I know this implies a fairly big refactor of zarr. But it could save us
lots of headaches in the long run.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#301>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8Qv6wTqlhuTxwdzWk5Pw0CzSKG0gQks5ud9LmgaJpZM4W10WR>
.
|
Yes @martindurant's filesystem_spec is what turned me on to pyfilesytem! (They are discussing similarities here: https://github.com/martindurant/filesystem_spec/issues/5) I don't particularly care which abstract filesystem we pick--it's the principle of outsourcing this functionality to some other, more general software layer. pyfilesystem appears to be pretty mature. But of course I defer to @martindurant's recommendations--he is the real expert on this stuff! |
I am, naturally, not an unbiased observer here. Firstly, let me say that my fsspec is an aspirational project without any users as things stand, where as pyfilesystems is established and used by some people. However, I did consider building within the pyfilesystems framework and discussed some of what I see as shortfalls with them, but did not arrive at a satisfactory solution. Note that, although their interfaces to things like DropBox are interesting, I would say the only "cloud" interface they have is S3, which works by downloading whole files, doesn't give you random access to just chunks (please, someone correct me if I am wrong). The motivation for fsspec came out of the similarity between the projects I had been involved in (s3fs, gcsfs, adlfs, hdfs3) and the need for abstraction across them in the context of dask. It is important, for instance, that file-system objects be serialisable, so that they can be passed between client and workers; also, I wrote MutibleMapping interfaces and FUSE backends. These projects had similar, but not identical APIs, and a certain amount of shim code was required, which ended up within dask, as well as interfacing to arrow's file-systems. For the latter, only JNI hdfs is of note here, although arrow has it's own concept of a file-system class hierarchy and a local files implementation. In any case, such code doesn't really belong in dask, is generally useful: for example, the lazyness of an OpenFile, which can give a text interface to remote compressed data. It would also be in dask's interests to not have to write and maintain such code. All that being said, we are all I think pragmatists, with limited resources. I can also help to try to leverage pyfilesystems, maybe, if people prefer that route. I, of course, like my design and am prepared to defend certain decisions, but it is not useful without having all the backends of interest conform to it and, ideally, including the file-handling code that is not dask-specific from dask. As you can see in the code, I have made an effort to meet multiple standards such as the python stdlib and posix naming schemes, provide walk and glob for any backend that can do ls, have a concept of "transactional" operations (files all moved to destination or made concrete only when transaction is done, or discarded if cancelled). |
@martindurant -- thanks for the clarifications. I misunderstood the thread over in filesystem_spec discussing the relationship with pyfilesystem. I thought they were more similar than they really are, and that compatibility of api's was on the horizon. (I only just discovered pyfilesystem and clearly do not understand it well.) I have changed the name of this issue to reflect the fact that we are talking generically about some sort of filesystem abstraction. I appreciate all the work you have put into your cloud storage classes. They are excellent and very useful for zarr. It would be great to build on that success and factor more of the filesystem "details" out of zarr itself. |
So looking back: the goal of this issue would be roughly equivalent to making FSStore the default? |
Closing as stale. In 3.0, we've gone the route of a custom store classes (including a |
We are recently seeing a lot of new proposals for new storage classes in zarr (e.g. #299, #294, #293, #252). These are all great ideas. Alternatively, we have several working storage layers (s3fs, gcsfs) that don't live inside zarr because they already provide a MutableMapping interface that zarr can talk to. The situation is fragmented, and we don't see to have a clear roadmap for how to handle all these different scenarios. There is some relevant discussion in #290.
I recently learned about pyfilesystem: "PyFilesystem is a Python module that provides a common interface to any filesystem." The index of supported filesystems provides analogs for nearly all of the builtin zarr storage options. Plus there are storage classes for cloud, ftp, dropbox, etc.
Perhaps one path forward would be to refactor zarr's storage to use pyfilesystem objects. We would only really need a single storage class which wraps pyfilesystem and provides the MutableMapping that zarr uses internally. Then we could remove 80% of
storage.py
that deals with listing directories, zip files, etc, since this would be handled by pyfilesystem.Once we had a generic filesystem, we could then create a Layout layer, which describes how the zarr objects are laid out within the filesystem. For example, today, we already have two de-facto layouts:
DirectoryStore
andNestedDirectoryStore
. We could consider others. For example, one with all the metadata in a single file (e.g. #294). The Layout and the Filesystem could be independent from one another.For new storage layers like mongodb, redis, etc., we would basically just say, "go implement a pyfilesystem for that". This has the advantage of
The only con I can think of is performance: it is possible that the pyfilesystem implementations could have worse performance than the zarr built-in ones. But this cuts both ways: they could also perform better!
I know this implies a fairly big refactor of zarr. But it could save us lots of headaches in the long run.
The text was updated successfully, but these errors were encountered: