Skip to content

Refactor storage around abstract file system? #301

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rabernat opened this issue Sep 23, 2018 · 6 comments
Closed

Refactor storage around abstract file system? #301

rabernat opened this issue Sep 23, 2018 · 6 comments

Comments

@rabernat
Copy link
Contributor

We are recently seeing a lot of new proposals for new storage classes in zarr (e.g. #299, #294, #293, #252). These are all great ideas. Alternatively, we have several working storage layers (s3fs, gcsfs) that don't live inside zarr because they already provide a MutableMapping interface that zarr can talk to. The situation is fragmented, and we don't see to have a clear roadmap for how to handle all these different scenarios. There is some relevant discussion in #290.

I recently learned about pyfilesystem: "PyFilesystem is a Python module that provides a common interface to any filesystem." The index of supported filesystems provides analogs for nearly all of the builtin zarr storage options. Plus there are storage classes for cloud, ftp, dropbox, etc.

Perhaps one path forward would be to refactor zarr's storage to use pyfilesystem objects. We would only really need a single storage class which wraps pyfilesystem and provides the MutableMapping that zarr uses internally. Then we could remove 80% of storage.py that deals with listing directories, zip files, etc, since this would be handled by pyfilesystem.

Once we had a generic filesystem, we could then create a Layout layer, which describes how the zarr objects are laid out within the filesystem. For example, today, we already have two de-facto layouts: DirectoryStore and NestedDirectoryStore. We could consider others. For example, one with all the metadata in a single file (e.g. #294). The Layout and the Filesystem could be independent from one another.

For new storage layers like mongodb, redis, etc., we would basically just say, "go implement a pyfilesystem for that". This has the advantage of

  • reducing the maintenance burden in zarr
  • providing more general filesystem objects (that can also be used outside of zarr)

The only con I can think of is performance: it is possible that the pyfilesystem implementations could have worse performance than the zarr built-in ones. But this cuts both ways: they could also perform better!

I know this implies a fairly big refactor of zarr. But it could save us lots of headaches in the long run.

@alimanfoo
Copy link
Member

alimanfoo commented Sep 23, 2018 via email

@rabernat
Copy link
Contributor Author

Yes @martindurant's filesystem_spec is what turned me on to pyfilesytem! (They are discussing similarities here: https://github.com/martindurant/filesystem_spec/issues/5)

I don't particularly care which abstract filesystem we pick--it's the principle of outsourcing this functionality to some other, more general software layer. pyfilesystem appears to be pretty mature. But of course I defer to @martindurant's recommendations--he is the real expert on this stuff!

@martindurant
Copy link
Member

I am, naturally, not an unbiased observer here. Firstly, let me say that my fsspec is an aspirational project without any users as things stand, where as pyfilesystems is established and used by some people. However, I did consider building within the pyfilesystems framework and discussed some of what I see as shortfalls with them, but did not arrive at a satisfactory solution. Note that, although their interfaces to things like DropBox are interesting, I would say the only "cloud" interface they have is S3, which works by downloading whole files, doesn't give you random access to just chunks (please, someone correct me if I am wrong).

The motivation for fsspec came out of the similarity between the projects I had been involved in (s3fs, gcsfs, adlfs, hdfs3) and the need for abstraction across them in the context of dask. It is important, for instance, that file-system objects be serialisable, so that they can be passed between client and workers; also, I wrote MutibleMapping interfaces and FUSE backends. These projects had similar, but not identical APIs, and a certain amount of shim code was required, which ended up within dask, as well as interfacing to arrow's file-systems. For the latter, only JNI hdfs is of note here, although arrow has it's own concept of a file-system class hierarchy and a local files implementation. In any case, such code doesn't really belong in dask, is generally useful: for example, the lazyness of an OpenFile, which can give a text interface to remote compressed data. It would also be in dask's interests to not have to write and maintain such code.

All that being said, we are all I think pragmatists, with limited resources. I can also help to try to leverage pyfilesystems, maybe, if people prefer that route.

I, of course, like my design and am prepared to defend certain decisions, but it is not useful without having all the backends of interest conform to it and, ideally, including the file-handling code that is not dask-specific from dask. As you can see in the code, I have made an effort to meet multiple standards such as the python stdlib and posix naming schemes, provide walk and glob for any backend that can do ls, have a concept of "transactional" operations (files all moved to destination or made concrete only when transaction is done, or discarded if cancelled).

@rabernat rabernat changed the title Refactor storage around pyfilesystem? Refactor storage around abstract file system? Sep 24, 2018
@rabernat
Copy link
Contributor Author

@martindurant -- thanks for the clarifications. I misunderstood the thread over in filesystem_spec discussing the relationship with pyfilesystem. I thought they were more similar than they really are, and that compatibility of api's was on the horizon. (I only just discovered pyfilesystem and clearly do not understand it well.) I have changed the name of this issue to reflect the fact that we are talking generically about some sort of filesystem abstraction.

I appreciate all the work you have put into your cloud storage classes. They are excellent and very useful for zarr. It would be great to build on that success and factor more of the filesystem "details" out of zarr itself.

@joshmoore
Copy link
Member

So looking back: the goal of this issue would be roughly equivalent to making FSStore the default?

@jhamman
Copy link
Member

jhamman commented Oct 17, 2024

Closing as stale. In 3.0, we've gone the route of a custom store classes (including a RemoteStore built on fsspec).

@jhamman jhamman closed this as completed Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants