Wrap open_dataset and open_mfdataset to flexibly open datasets #385

pochedls · 2022-11-09T21:52:15Z

Description

This PR is intended to wrap xc.open_dataset() and xc.open_mfdataset() such that xcdat can flexibly open the dataset path passed in (whether it is a directory, wildcard search, xml file, single file, etc.).

Closes Add xcdat.open() to wrap xcdat.open_dataset() and xcdat.open_mfdataset() #128

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
My changes generate no new warnings
Any dependent changes have been merged and published in downstream modules

If applicable:

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass with my changes (locally and CI/CD build)
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

codecov-commenter · 2022-11-09T21:58:26Z

Codecov Report

Patch coverage: 100.00% and no project coverage change

Comparison is base (1efb807) 100.00% compared to head (b53a6e7) 100.00%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files

@@            Coverage Diff            @@
##              main      #385   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           14        14           
  Lines         1266      1293   +27     
=========================================
+ Hits          1266      1293   +27

Impacted Files	Coverage Δ
xcdat/dataset.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

pochedls · 2022-11-09T23:18:35Z

I created this draft PR in part to fine-tune the proposed API and so others could identify (and help fix?) issues with the approach in this PR.

I'm wondering if we should just make xcdat.open_dataset() open directories, xml files, individual files, lists, etc. (rather than xcdat.open(), which is what is in the initial PR and what is outlined in Issue #128. Or maybe it's better to distinguish the proposed functionality from open_dataset and open_mfdataset (that way you could always fall back to these foundational functions).

There's not a lot of error handling in the inital PR and I assume there are gaps in the PR logic (and other cases to be considered aside from xml/directory/list/etc). Help identifying these would be useful.

A big assumption in the current PR is that when the user passes in a directory, xcdat should try to open all files in the directory as a multi-file dataset. This gets around having to assume they are all .nc files, but if a directory has mixed files, this could be a problem.

lee1043 · 2022-11-10T00:37:37Z

@pochedls I like the idea of having open wrapper that can also open xml. I have drafted the similar ad-hoc function in the PMP because the current PMP workflow rely on xml. I was considering my function would be used for interim until the PMP workflow to adapt your xsearch. I like the way you coded that is looking great and solid. I'd happy to give a test when you feel this PR is ready.

lee1043 · 2022-11-11T23:00:20Z

@pochedls I confirm open warpper is working as expected. I tried with giving xml (str), directory path for netcdf files (str), list of netcdf files (list), and single netcdf file (str) to the xcdat.open. All work as expected! Thank you very much for adding this convenient wrapper. I am happy with the addition.

tomvothecoder · 2022-11-11T23:55:19Z

Thanks for this PR. I'll comment once I take a closer look.

tomvothecoder · 2023-02-15T00:23:46Z

Notes from meetings:

Consider other file types
HDFs might be tricky – embedded datasets into one another?

Tom will review PR

Naming of function (open vs. open_dataset/open_mfdataset())
Potential inconsistency with xarray usage
Make explicit that open wraps open_dataset()/open_mfdataset()
Handling mixed file types, file types outside supported xarray

lee1043 · 2023-02-15T01:03:14Z

Naming of function (open vs. open_dataset/open_mfdataset())

I think we can consider open as an addition to existing open_dataset/open_mfdataset, so "plus" rather than "vs".

tomvothecoder · 2023-02-17T22:53:15Z

xcdat/dataset.py

+def open(
    path: str,
    data_var: Optional[str] = None,
    add_bounds: bool = True,
    decode_times: bool = True,
    center_times: bool = False,
    lon_orient: Optional[Tuple[float, float]] = None,
+    **kwargs,
+) -> xr.Dataset:
+    """
+    Documentation to be written when this PR is
+    further along...
+    """
+
+    # inspect path attributes and use either open_dataset
+    # or open_mfdataset as appropriate
+    #   * list (open list of files with open_mfdataset)
+    #   * pathlib.Path (open pathlib.Path with open_mfdataset)
+    #   * deprecated CDAT xml file type
+    #   * directory (opens with open_dataset or open_mfdataset
+    #                depending in number of files)
+    #   * wildcard input (opens data with open_mfdataset)


From my understanding, the motivation behind adding a new open function is to provide an abstracted API that determines whether to use open_dataset()/open_mfdataset() based on the path argument to make opening datasets hopefully more intuitive. Additionally, open would include support for the deprecated CDAT XML format.

Let me know if this isn't correct.

tomvothecoder · 2023-02-17T22:53:41Z

xcdat/dataset.py

+    # or open_mfdataset as appropriate
+    #   * list (open list of files with open_mfdataset)
+    #   * pathlib.Path (open pathlib.Path with open_mfdataset)
+    #   * deprecated CDAT xml file type


A few questions I had about supporting XML:

I noticed in the docstring it says "deprecated CDAT xml file type". Is the structure of the XML formatted specifically by CDAT, and are there plans to eventually move away from it?

I think if users pass XML files that don't align with a specific format, then open will break. We'd have to include documentation that the XML file must follow a specific format (not sure if this is appropriate to support though since it isn't generalized like a list of paths).

I'm not really familiar with XML usage. How common is the XML format used for reading datasets in the community?

The structure of CDAT xml files (climate data markup language) is a dialect of xml files that have a defined set of attributes.

We only care that the xml file has a directory attribute – we can probably just return an error saying "the xml file provided does not have a root directory attribute" or something like that

XML probably isn't used that widely by the broad climate community, but it is used pretty widely by former CDAT users

Gotcha, this makes sense to me. I think being explicit about these details as you describe is a good idea.

tomvothecoder · 2023-02-17T22:54:44Z

xcdat/dataset.py

+    # use open_dataset or open_mfdataset depending on provided path
+    if mfdataset:
+        ds = open_mfdataset(
+            path,
+            data_var=data_var,
+            add_bounds=add_bounds,
+            decode_times=decode_times,
+            center_times=center_times,
+            lon_orient=lon_orient,
+            **kwargs,
+        )
+    else:
+        ds = open_dataset(
+            path,
+            data_var=data_var,
+            add_bounds=add_bounds,
+            decode_times=decode_times,
+            center_times=center_times,
+            lon_orient=lon_orient,
+            **kwargs,
+        )


I noticed this logic for determining whether to use open_dataset() or open_mfdataset().

Note, open_mfdataset() supports opening a single dataset file and multiple dataset files. The main difference (that I'm aware of) with opening a single dataset with open_dataset() vs. open_mfdataset() is that data variables are automatically Dask arrays with open_mfdataset().

I'm not sure of pros/cons, implications (e.g. performance), etc. of using open_dataset() vs. open_mfdataset() for a single dataset file, but I'd probably lean towards always using open_mfdataset() if you're looping over models or lots of directories. This would eliminate the need for determining which of these APIs to use.

When I loop over models I currently use open_mfdataset universally: xc.open_mfdataset(dpath + '*.nc'). xc.open_dataset() used to be faster than xc.open_mfdataset(), though this may have been addressed a few months ago (I remember discussing this, perhaps as part of the large PR that changes the logic in dealing with coords/bounds and I think it may have been addressed).

I just checked for one dataset and they were close (24 versus 21 ms), though I should look to see if both functions are close in speed across a broader range of datasets.

As an alternative to a separate open() function, what are your thoughts on extending xcdat.open_mfdataset() to support CDAT XML?

This seems more intuitive to me because xarray.open_mfdataset() supports glob with wildcard, which returns a list of file paths in the directory regardless if there is 1 or more files. We can extend xcdat.open_mfdataset() to parse the CDAT XML files similar to glob and set path arg to the list of file paths.

As mentioned previously, my general recommendation for users who expect to loop over directories and models is to use open_mfdataset() universally.

I'd be fine with this, but ideally open_mfdataset() could also handle a string directory. I think this would handle all cases in this PR.

I think I personally would then just use open_mfdataset().

The case where open() would still be useful is if open_dataset() is faster than open_mfdataset(); open() would be faster / better by intelligently using open_mfdataset() only when necessary.

The case where open() would still be useful is if open_dataset() is faster than open_mfdataset(); open() would be faster / better by intelligently using open_mfdataset() only when necessary.

I found some xarray GH issues related to the slow performance of open_mfdataset(), which are mostly related to coordinate compatibility checks when concatenating files and decoding time coordinates with more than 1 dataset. There are workarounds for users if they find open_mfdataset() too slow.

For our specific case involving opening only 1 dataset with open_dataset() vs. open_mfdataset(), I don't think the performance difference is that significant because coordinate concatenation isn't necessary like it is with multiple datasets. I seems safe to extend open_mfdataset(), but some performance metrics would be nice to confirm this.

I'd be fine with this, but ideally open_mfdataset() could also handle a string directory. I think this would handle all cases in this PR.

I'll take a look at your implementation and suggest some changes to extend open_mfdataset().

tomvothecoder

Hey @pochedls and @lee1043, this PR is ready for review.

It was pretty easy to adapt @pochedls's code to extend open_mfdataset(). The hardest part was figuring out how to generate a dummy XML file for the unit tests, but I figured it out lol.

tomvothecoder · 2023-02-22T23:47:27Z

xcdat/dataset.py

+    if not isinstance(paths, list):
+        if _is_paths_to_xml(paths):
+            paths = _parse_xml_for_nc_glob(paths)  # type: ignore
+        elif os.path.isdir(paths):
+            paths = _parse_dir_for_nc_glob(paths)  # type: ignore


The new conditional in open_mfdataset() for the paths argument.

tomvothecoder · 2023-02-22T23:48:16Z

xcdat/dataset.py

+def _is_paths_to_xml(paths: Union[str, pathlib.Path]) -> bool:
+    """Checks if the ``paths`` argument is a path to an XML file.
+
+    Parameters
+    ----------
+    paths : Union[str, pathlib.Path]
+        A string or pathlib.Path represnting a file path.
+
+    Returns
+    -------
+    bool
+    """
+    if isinstance(paths, str):
+        return paths.split(".")[-1] == "xml"
+    elif isinstance(paths, pathlib.Path):
+        return paths.parts[-1].endswith("xml")


We also need to support pathlib.Path alongside str.

tomvothecoder · 2023-02-22T23:48:42Z

xcdat/dataset.py

+def _parse_xml_for_nc_glob(xml_path: str) -> str:
+    """Parses a CDAT XML file for a glob of `*.nc` paths.
+
+    The CDAT "Climate Data Markup Language" (CDML) is a dialect of XML with a
+    defined set of attributes. The "directory" attribute pointing to a directory
+    of `.nc` files is extracted from the specified XML file. Refer to [6]_ for
+    more information on CDML.
+
+    Parameters
+    ----------
+    xml_path : str
+        The CDML XML file path.
+
+    Returns
+    -------
+    str
+        A glob of `*.nc` paths in the directory.
+
+    Notes
+    -----
+    CDML is a deprecated XML dialect that is still used by current and former
+    users in the CDAT community. xCDAT supports CDML to enable these users to
+    more easily integrate xCDAT into their existing CDML/CDAT-based workflows.
+
+    References
+    ----------
+    .. [6] https://cdms.readthedocs.io/en/latest/manual/cdms_6.html
+    """
+    # `resolve_entities=False` and `no_network=True` guards against XXE attacks.
+    # Source: https://rules.sonarsource.com/python/RSPEC-2755
+    parser = etree.XMLParser(resolve_entities=False, no_network=True)
+    tree = etree.parse(xml_path, parser)
+    root = tree.getroot()
+
+    dir_attr = root.attrib.get("directory")
+    if dir_attr is None:
+        raise KeyError(
+            f"The XML file ({xml_path}) does not have a 'directory' attribute "
+            "that points to a directory of `.nc` dataset files."
+        )
+
+    glob_path = dir_attr + "/*.nc"
+
+    return glob_path


New function _parse_xml_for_nc_glob() for parsing CDML XML files.

tomvothecoder · 2023-02-22T23:49:08Z

xcdat/dataset.py

+def _parse_dir_for_nc_glob(dir_path: str) -> str:
+    """Parses a directory for a glob of `*.nc` paths.
+
+    Parameters
+    ----------
+    dir_path : str
+        The directory.
+
+    Returns
+    -------
+    str
+        A glob of `*.nc` paths in the directory.
+    """
+    file_list = os.listdir(dir_path)
+
+    if len(file_list) == 0:
+        raise ValueError(
+            f"The directory {dir_path} has no netcdf (`.nc`) files to open."
+        )
+
+    glob_path = dir_path + "/*.nc"
+
+    return glob_path


New function _parse_dir_for_nc_glob for parsing a directory

xcdat/dataset.py

tomvothecoder · 2023-02-22T23:51:24Z

xcdat/dataset.py

+    # `resolve_entities=False` and `no_network=True` guards against XXE attacks.
+    # Source: https://rules.sonarsource.com/python/RSPEC-2755
+    parser = etree.XMLParser(resolve_entities=False, no_network=True)
+    tree = etree.parse(xml_path, parser)
+    root = tree.getroot()
+


I'm using lxml with resolve_entities=False and no_network=True to avoid XXE attacks.
These settings aren't easily configurable with ElementTree from the Python standard lib.

More info on XXE attacks:

https://rules.sonarsource.com/python/RSPEC-2755

https://codeql.github.com/codeql-query-help/python/py-xxe/

Python docs warning that ElementTree is not secure

tomvothecoder · 2023-02-22T23:51:52Z

tests/test_dataset.py

+    def test_raises_error_if_xml_does_not_have_root_direcory(self):
+        ds1 = generate_dataset(decode_times=False, cf_compliant=False, has_bounds=True)
+        ds1.to_netcdf(self.file_path1)
+        ds2 = generate_dataset(decode_times=False, cf_compliant=False, has_bounds=True)
+        ds2 = ds2.rename_vars({"ts": "tas"})
+        ds2.to_netcdf(self.file_path2)
+
+        # Create the XML file
+        xml_path = f"{self.dir}/datasets.xml"
+        page = etree.Element("dataset")
+        doc = etree.ElementTree(page)
+        doc.write(xml_path, xml_declaration=True, encoding="utf-16")
+
+        with pytest.raises(KeyError):
+            open_mfdataset(xml_path, decode_times=True)
+
+    def test_opens_datasets_from_xml_using_str_path(self):
+        ds1 = generate_dataset(decode_times=False, cf_compliant=False, has_bounds=True)
+        ds1.to_netcdf(self.file_path1)
+        ds2 = generate_dataset(decode_times=False, cf_compliant=False, has_bounds=True)
+        ds2 = ds2.rename_vars({"ts": "tas"})
+        ds2.to_netcdf(self.file_path2)
+
+        # Create the XML file
+        xml_path = f"{self.dir}/datasets.xml"
+        page = etree.Element("dataset", directory=str(self.dir))
+        doc = etree.ElementTree(page)
+        doc.write(xml_path, xml_declaration=True, encoding="utf-16")
+
+        result = open_mfdataset(xml_path, decode_times=True)
+        expected = ds1.merge(ds2)
+
+        result.identical(expected)
+
+    def test_opens_datasets_from_xml_using_pathlib_path(self):
+        ds1 = generate_dataset(decode_times=False, cf_compliant=False, has_bounds=True)
+        ds1.to_netcdf(self.file_path1)
+        ds2 = generate_dataset(decode_times=False, cf_compliant=False, has_bounds=True)
+        ds2 = ds2.rename_vars({"ts": "tas"})
+        ds2.to_netcdf(self.file_path2)
+
+        # Create the XML file
+        xml_path = self.dir / "datasets.xml"
+        page = etree.Element("dataset", directory=str(self.dir))
+        doc = etree.ElementTree(page)
+        doc.write(xml_path, xml_declaration=True, encoding="utf-16")
+
+        result = open_mfdataset(xml_path, decode_times=True)
+        expected = ds1.merge(ds2)
+
+        result.identical(expected)
+
+    def test_raises_error_if_directory_has_no_netcdf_files(self):
+        with pytest.raises(ValueError):
+            open_mfdataset(str(self.dir), decode_times=True)
+
+    def test_opens_netcdf_files_in_a_directory(self):
+        ds1 = generate_dataset(decode_times=False, cf_compliant=False, has_bounds=True)
+        ds1.to_netcdf(self.file_path1)
+        ds2 = generate_dataset(decode_times=False, cf_compliant=False, has_bounds=True)
+        ds2 = ds2.rename_vars({"ts": "tas"})
+        ds2.to_netcdf(self.file_path2)
+
+        result = open_mfdataset(str(self.dir), decode_times=True)
+        expected = ds1.merge(ds2)
+
+        result.identical(expected)
+


New unit tests covering XML and directory paths argument

tests/test_dataset.py

pochedls

This looks good to me. I think we could be more sophisticated in globbing directories. We could see if all files are either .nc, .hdf, .grib, etc. and, if so, open them as a multi-file dataset. If the files are mixed, we can throw an error. This sort of functionality can be left to future development (and worked on when someone identifies this need).

- Add `lxml` to conda env yml files - Add unit tests - Remove `open()` prototype function

tomvothecoder · 2023-03-06T18:02:46Z

This looks good to me. I think we could be more sophisticated in globbing directories. We could see if all files are either .nc, .hdf, .grib, etc. and, if so, open them as a multi-file dataset. If the files are mixed, we can throw an error. This sort of functionality can be left to future development (and worked on when someone identifies this need).

Great idea, I agree with you. I'll go ahead and merge this PR!

github-actions bot added the type: enhancement label Nov 9, 2022

pochedls requested a review from lee1043 November 9, 2022 23:19

pochedls marked this pull request as draft November 9, 2022 23:19

lee1043 approved these changes Nov 11, 2022

View reviewed changes

tomvothecoder force-pushed the feature/128-flexible-open branch from 3583e5e to 4a95198 Compare January 5, 2023 18:36

pochedls force-pushed the feature/128-flexible-open branch from 4a95198 to cda9400 Compare January 20, 2023 17:47

tomvothecoder force-pushed the feature/128-flexible-open branch from cda9400 to 3913ea2 Compare February 17, 2023 22:10

tomvothecoder reviewed Feb 17, 2023

View reviewed changes

tomvothecoder reviewed Feb 22, 2023

View reviewed changes

tomvothecoder marked this pull request as ready for review February 22, 2023 23:52

tomvothecoder reviewed Feb 22, 2023

View reviewed changes

tests/test_dataset.py Outdated Show resolved Hide resolved

pochedls commented Mar 6, 2023

View reviewed changes

pochedls and others added 9 commits March 6, 2023 09:57

initial work on issue #128

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

cba97c9

Extend open_mfdataset() with support for XML and directory parsing

a816660

- Add `lxml` to conda env yml files - Add unit tests - Remove `open()` prototype function

Remove unnecessary comments from a test

db511d3

Revert type annotation change

2256bba

Add support for pathlib.Path parsing

0cc4558

Update tests/test_dataset.py

d01ec0b

Update xcdat/dataset.py

2e11ff4

Update xcdat/dataset.py

7646893

Fix mypyerror

c6262f7

tomvothecoder force-pushed the feature/128-flexible-open branch from 481a8b2 to c6262f7 Compare March 6, 2023 17:57

tomvothecoder self-assigned this Mar 6, 2023

tomvothecoder merged commit e6dbed5 into main Mar 6, 2023

tomvothecoder deleted the feature/128-flexible-open branch March 6, 2023 18:03

tomvothecoder mentioned this pull request Mar 20, 2023

Bump to v0.5.0 conda-forge/xcdat-feedstock#17

Merged

4 tasks

tomvothecoder mentioned this pull request Jun 9, 2023

xml open update #460

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrap open_dataset and open_mfdataset to flexibly open datasets #385

Wrap open_dataset and open_mfdataset to flexibly open datasets #385

pochedls commented Nov 9, 2022 •

edited by tomvothecoder

Loading

codecov-commenter commented Nov 9, 2022 •

edited

Loading

pochedls commented Nov 9, 2022

lee1043 commented Nov 10, 2022

lee1043 commented Nov 11, 2022

tomvothecoder commented Nov 11, 2022

tomvothecoder commented Feb 15, 2023

lee1043 commented Feb 15, 2023

tomvothecoder Feb 17, 2023

tomvothecoder Feb 17, 2023

pochedls Feb 18, 2023 •

edited

Loading

tomvothecoder Feb 21, 2023

tomvothecoder Feb 17, 2023 •

edited

Loading

pochedls Feb 18, 2023

tomvothecoder Feb 21, 2023 •

edited

Loading

pochedls Feb 21, 2023

tomvothecoder Feb 22, 2023 •

edited

Loading

tomvothecoder left a comment •

edited

Loading

tomvothecoder Feb 22, 2023 •

edited

Loading

tomvothecoder Feb 22, 2023 •

edited

Loading

tomvothecoder Feb 22, 2023

tomvothecoder Feb 22, 2023

tomvothecoder Feb 22, 2023 •

edited

Loading

tomvothecoder Feb 22, 2023

pochedls left a comment

tomvothecoder commented Mar 6, 2023

Wrap open_dataset and open_mfdataset to flexibly open datasets #385

Wrap open_dataset and open_mfdataset to flexibly open datasets #385

Conversation

pochedls commented Nov 9, 2022 • edited by tomvothecoder Loading

Description

Checklist

codecov-commenter commented Nov 9, 2022 • edited Loading

Codecov Report

pochedls commented Nov 9, 2022

lee1043 commented Nov 10, 2022

lee1043 commented Nov 11, 2022

tomvothecoder commented Nov 11, 2022

tomvothecoder commented Feb 15, 2023

lee1043 commented Feb 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pochedls Feb 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomvothecoder Feb 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomvothecoder Feb 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomvothecoder Feb 22, 2023 • edited Loading

Choose a reason for hiding this comment

tomvothecoder left a comment • edited Loading

Choose a reason for hiding this comment

tomvothecoder Feb 22, 2023 • edited Loading

Choose a reason for hiding this comment

tomvothecoder Feb 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomvothecoder Feb 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pochedls left a comment

Choose a reason for hiding this comment

tomvothecoder commented Mar 6, 2023

pochedls commented Nov 9, 2022 •

edited by tomvothecoder

Loading

codecov-commenter commented Nov 9, 2022 •

edited

Loading

pochedls Feb 18, 2023 •

edited

Loading

tomvothecoder Feb 17, 2023 •

edited

Loading

tomvothecoder Feb 21, 2023 •

edited

Loading

tomvothecoder Feb 22, 2023 •

edited

Loading

tomvothecoder left a comment •

edited

Loading

tomvothecoder Feb 22, 2023 •

edited

Loading

tomvothecoder Feb 22, 2023 •

edited

Loading

tomvothecoder Feb 22, 2023 •

edited

Loading