-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrap open_dataset and open_mfdataset to flexibly open datasets #385
Conversation
Codecov ReportPatch coverage:
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more Additional details and impacted files@@ Coverage Diff @@
## main #385 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 14 14
Lines 1266 1293 +27
=========================================
+ Hits 1266 1293 +27
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
I created this draft PR in part to fine-tune the proposed API and so others could identify (and help fix?) issues with the approach in this PR. I'm wondering if we should just make There's not a lot of error handling in the inital PR and I assume there are gaps in the PR logic (and other cases to be considered aside from xml/directory/list/etc). Help identifying these would be useful. A big assumption in the current PR is that when the user passes in a directory, xcdat should try to open all files in the directory as a multi-file dataset. This gets around having to assume they are all |
@pochedls I like the idea of having |
@pochedls I confirm |
Thanks for this PR. I'll comment once I take a closer look. |
3583e5e
to
4a95198
Compare
4a95198
to
cda9400
Compare
Notes from meetings:
Tom will review PR
|
I think we can consider |
cda9400
to
3913ea2
Compare
xcdat/dataset.py
Outdated
def open( | ||
path: str, | ||
data_var: Optional[str] = None, | ||
add_bounds: bool = True, | ||
decode_times: bool = True, | ||
center_times: bool = False, | ||
lon_orient: Optional[Tuple[float, float]] = None, | ||
**kwargs, | ||
) -> xr.Dataset: | ||
""" | ||
Documentation to be written when this PR is | ||
further along... | ||
""" | ||
|
||
# inspect path attributes and use either open_dataset | ||
# or open_mfdataset as appropriate | ||
# * list (open list of files with open_mfdataset) | ||
# * pathlib.Path (open pathlib.Path with open_mfdataset) | ||
# * deprecated CDAT xml file type | ||
# * directory (opens with open_dataset or open_mfdataset | ||
# depending in number of files) | ||
# * wildcard input (opens data with open_mfdataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, the motivation behind adding a new open
function is to provide an abstracted API that determines whether to use open_dataset()
/open_mfdataset()
based on the path
argument to make opening datasets hopefully more intuitive. Additionally, open
would include support for the deprecated CDAT XML format.
Let me know if this isn't correct.
xcdat/dataset.py
Outdated
# or open_mfdataset as appropriate | ||
# * list (open list of files with open_mfdataset) | ||
# * pathlib.Path (open pathlib.Path with open_mfdataset) | ||
# * deprecated CDAT xml file type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions I had about supporting XML:
- I noticed in the docstring it says "deprecated CDAT xml file type". Is the structure of the XML formatted specifically by CDAT, and are there plans to eventually move away from it?
- I think if users pass XML files that don't align with a specific format, then
open
will break. We'd have to include documentation that the XML file must follow a specific format (not sure if this is appropriate to support though since it isn't generalized like a list of paths).
- I think if users pass XML files that don't align with a specific format, then
- I'm not really familiar with XML usage. How common is the XML format used for reading datasets in the community?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The structure of CDAT xml files (climate data markup language) is a dialect of xml files that have a defined set of attributes.
- We only care that the xml file has a
directory
attribute – we can probably just return an error saying "the xml file provided does not have a root directory attribute" or something like that - XML probably isn't used that widely by the broad climate community, but it is used pretty widely by former CDAT users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, this makes sense to me. I think being explicit about these details as you describe is a good idea.
xcdat/dataset.py
Outdated
# use open_dataset or open_mfdataset depending on provided path | ||
if mfdataset: | ||
ds = open_mfdataset( | ||
path, | ||
data_var=data_var, | ||
add_bounds=add_bounds, | ||
decode_times=decode_times, | ||
center_times=center_times, | ||
lon_orient=lon_orient, | ||
**kwargs, | ||
) | ||
else: | ||
ds = open_dataset( | ||
path, | ||
data_var=data_var, | ||
add_bounds=add_bounds, | ||
decode_times=decode_times, | ||
center_times=center_times, | ||
lon_orient=lon_orient, | ||
**kwargs, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this logic for determining whether to use open_dataset()
or open_mfdataset()
.
Note, open_mfdataset()
supports opening a single dataset file and multiple dataset files. The main difference (that I'm aware of) with opening a single dataset with open_dataset()
vs. open_mfdataset()
is that data variables are automatically Dask arrays with open_mfdataset()
.
I'm not sure of pros/cons, implications (e.g. performance), etc. of using open_dataset()
vs. open_mfdataset()
for a single dataset file, but I'd probably lean towards always using open_mfdataset()
if you're looping over models or lots of directories. This would eliminate the need for determining which of these APIs to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I loop over models I currently use open_mfdataset
universally: xc.open_mfdataset(dpath + '*.nc')
. xc.open_dataset()
used to be faster than xc.open_mfdataset()
, though this may have been addressed a few months ago (I remember discussing this, perhaps as part of the large PR that changes the logic in dealing with coords/bounds and I think it may have been addressed).
I just checked for one dataset and they were close (24 versus 21 ms), though I should look to see if both functions are close in speed across a broader range of datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an alternative to a separate open()
function, what are your thoughts on extending xcdat.open_mfdataset()
to support CDAT XML?
This seems more intuitive to me because xarray.open_mfdataset()
supports glob with wildcard, which returns a list of file paths in the directory regardless if there is 1 or more files. We can extend xcdat.open_mfdataset()
to parse the CDAT XML files similar to glob and set path
arg to the list of file paths.
As mentioned previously, my general recommendation for users who expect to loop over directories and models is to use open_mfdataset()
universally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be fine with this, but ideally open_mfdataset()
could also handle a string directory
. I think this would handle all cases in this PR.
I think I personally would then just use open_mfdataset()
.
The case where open()
would still be useful is if open_dataset()
is faster than open_mfdataset()
; open()
would be faster / better by intelligently using open_mfdataset()
only when necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case where
open()
would still be useful is ifopen_dataset()
is faster thanopen_mfdataset()
;open()
would be faster / better by intelligently usingopen_mfdataset()
only when necessary.
I found some xarray GH issues related to the slow performance of open_mfdataset()
, which are mostly related to coordinate compatibility checks when concatenating files and decoding time coordinates with more than 1 dataset. There are workarounds for users if they find open_mfdataset()
too slow.
For our specific case involving opening only 1 dataset with open_dataset()
vs. open_mfdataset()
, I don't think the performance difference is that significant because coordinate concatenation isn't necessary like it is with multiple datasets. I seems safe to extend open_mfdataset()
, but some performance metrics would be nice to confirm this.
I'd be fine with this, but ideally
open_mfdataset()
could also handle a stringdirectory
. I think this would handle all cases in this PR.
I'll take a look at your implementation and suggest some changes to extend open_mfdataset()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xcdat/dataset.py
Outdated
if not isinstance(paths, list): | ||
if _is_paths_to_xml(paths): | ||
paths = _parse_xml_for_nc_glob(paths) # type: ignore | ||
elif os.path.isdir(paths): | ||
paths = _parse_dir_for_nc_glob(paths) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new conditional in open_mfdataset()
for the paths
argument.
def _is_paths_to_xml(paths: Union[str, pathlib.Path]) -> bool: | ||
"""Checks if the ``paths`` argument is a path to an XML file. | ||
Parameters | ||
---------- | ||
paths : Union[str, pathlib.Path] | ||
A string or pathlib.Path represnting a file path. | ||
Returns | ||
------- | ||
bool | ||
""" | ||
if isinstance(paths, str): | ||
return paths.split(".")[-1] == "xml" | ||
elif isinstance(paths, pathlib.Path): | ||
return paths.parts[-1].endswith("xml") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to support pathlib.Path
alongside str
.
xcdat/dataset.py
Outdated
def _parse_xml_for_nc_glob(xml_path: str) -> str: | ||
"""Parses a CDAT XML file for a glob of `*.nc` paths. | ||
The CDAT "Climate Data Markup Language" (CDML) is a dialect of XML with a | ||
defined set of attributes. The "directory" attribute pointing to a directory | ||
of `.nc` files is extracted from the specified XML file. Refer to [6]_ for | ||
more information on CDML. | ||
Parameters | ||
---------- | ||
xml_path : str | ||
The CDML XML file path. | ||
Returns | ||
------- | ||
str | ||
A glob of `*.nc` paths in the directory. | ||
Notes | ||
----- | ||
CDML is a deprecated XML dialect that is still used by current and former | ||
users in the CDAT community. xCDAT supports CDML to enable these users to | ||
more easily integrate xCDAT into their existing CDML/CDAT-based workflows. | ||
References | ||
---------- | ||
.. [6] https://cdms.readthedocs.io/en/latest/manual/cdms_6.html | ||
""" | ||
# `resolve_entities=False` and `no_network=True` guards against XXE attacks. | ||
# Source: https://rules.sonarsource.com/python/RSPEC-2755 | ||
parser = etree.XMLParser(resolve_entities=False, no_network=True) | ||
tree = etree.parse(xml_path, parser) | ||
root = tree.getroot() | ||
|
||
dir_attr = root.attrib.get("directory") | ||
if dir_attr is None: | ||
raise KeyError( | ||
f"The XML file ({xml_path}) does not have a 'directory' attribute " | ||
"that points to a directory of `.nc` dataset files." | ||
) | ||
|
||
glob_path = dir_attr + "/*.nc" | ||
|
||
return glob_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New function _parse_xml_for_nc_glob()
for parsing CDML XML files.
def _parse_dir_for_nc_glob(dir_path: str) -> str: | ||
"""Parses a directory for a glob of `*.nc` paths. | ||
Parameters | ||
---------- | ||
dir_path : str | ||
The directory. | ||
Returns | ||
------- | ||
str | ||
A glob of `*.nc` paths in the directory. | ||
""" | ||
file_list = os.listdir(dir_path) | ||
|
||
if len(file_list) == 0: | ||
raise ValueError( | ||
f"The directory {dir_path} has no netcdf (`.nc`) files to open." | ||
) | ||
|
||
glob_path = dir_path + "/*.nc" | ||
|
||
return glob_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New function _parse_dir_for_nc_glob
for parsing a directory
# `resolve_entities=False` and `no_network=True` guards against XXE attacks. | ||
# Source: https://rules.sonarsource.com/python/RSPEC-2755 | ||
parser = etree.XMLParser(resolve_entities=False, no_network=True) | ||
tree = etree.parse(xml_path, parser) | ||
root = tree.getroot() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using lxml
with resolve_entities=False
and no_network=True
to avoid XXE attacks.
These settings aren't easily configurable with ElementTree
from the Python standard lib.
More info on XXE attacks:
- https://rules.sonarsource.com/python/RSPEC-2755
- https://codeql.github.com/codeql-query-help/python/py-xxe/
- Python docs warning that
ElementTree
is not secure
tests/test_dataset.py
Outdated
def test_raises_error_if_xml_does_not_have_root_direcory(self): | ||
ds1 = generate_dataset(decode_times=False, cf_compliant=False, has_bounds=True) | ||
ds1.to_netcdf(self.file_path1) | ||
ds2 = generate_dataset(decode_times=False, cf_compliant=False, has_bounds=True) | ||
ds2 = ds2.rename_vars({"ts": "tas"}) | ||
ds2.to_netcdf(self.file_path2) | ||
|
||
# Create the XML file | ||
xml_path = f"{self.dir}/datasets.xml" | ||
page = etree.Element("dataset") | ||
doc = etree.ElementTree(page) | ||
doc.write(xml_path, xml_declaration=True, encoding="utf-16") | ||
|
||
with pytest.raises(KeyError): | ||
open_mfdataset(xml_path, decode_times=True) | ||
|
||
def test_opens_datasets_from_xml_using_str_path(self): | ||
ds1 = generate_dataset(decode_times=False, cf_compliant=False, has_bounds=True) | ||
ds1.to_netcdf(self.file_path1) | ||
ds2 = generate_dataset(decode_times=False, cf_compliant=False, has_bounds=True) | ||
ds2 = ds2.rename_vars({"ts": "tas"}) | ||
ds2.to_netcdf(self.file_path2) | ||
|
||
# Create the XML file | ||
xml_path = f"{self.dir}/datasets.xml" | ||
page = etree.Element("dataset", directory=str(self.dir)) | ||
doc = etree.ElementTree(page) | ||
doc.write(xml_path, xml_declaration=True, encoding="utf-16") | ||
|
||
result = open_mfdataset(xml_path, decode_times=True) | ||
expected = ds1.merge(ds2) | ||
|
||
result.identical(expected) | ||
|
||
def test_opens_datasets_from_xml_using_pathlib_path(self): | ||
ds1 = generate_dataset(decode_times=False, cf_compliant=False, has_bounds=True) | ||
ds1.to_netcdf(self.file_path1) | ||
ds2 = generate_dataset(decode_times=False, cf_compliant=False, has_bounds=True) | ||
ds2 = ds2.rename_vars({"ts": "tas"}) | ||
ds2.to_netcdf(self.file_path2) | ||
|
||
# Create the XML file | ||
xml_path = self.dir / "datasets.xml" | ||
page = etree.Element("dataset", directory=str(self.dir)) | ||
doc = etree.ElementTree(page) | ||
doc.write(xml_path, xml_declaration=True, encoding="utf-16") | ||
|
||
result = open_mfdataset(xml_path, decode_times=True) | ||
expected = ds1.merge(ds2) | ||
|
||
result.identical(expected) | ||
|
||
def test_raises_error_if_directory_has_no_netcdf_files(self): | ||
with pytest.raises(ValueError): | ||
open_mfdataset(str(self.dir), decode_times=True) | ||
|
||
def test_opens_netcdf_files_in_a_directory(self): | ||
ds1 = generate_dataset(decode_times=False, cf_compliant=False, has_bounds=True) | ||
ds1.to_netcdf(self.file_path1) | ||
ds2 = generate_dataset(decode_times=False, cf_compliant=False, has_bounds=True) | ||
ds2 = ds2.rename_vars({"ts": "tas"}) | ||
ds2.to_netcdf(self.file_path2) | ||
|
||
result = open_mfdataset(str(self.dir), decode_times=True) | ||
expected = ds1.merge(ds2) | ||
|
||
result.identical(expected) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New unit tests covering XML and directory paths
argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I think we could be more sophisticated in globbing directories. We could see if all files are either .nc
, .hdf
, .grib
, etc. and, if so, open them as a multi-file dataset. If the files are mixed, we can throw an error. This sort of functionality can be left to future development (and worked on when someone identifies this need).
- Add `lxml` to conda env yml files - Add unit tests - Remove `open()` prototype function
481a8b2
to
c6262f7
Compare
Great idea, I agree with you. I'll go ahead and merge this PR! |
Description
This PR is intended to wrap
xc.open_dataset()
andxc.open_mfdataset()
such that xcdat can flexibly open the dataset path passed in (whether it is a directory, wildcard search, xml file, single file, etc.).xcdat.open()
to wrapxcdat.open_dataset()
andxcdat.open_mfdataset()
#128Checklist
If applicable: