Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure when ingesting the whole of CMIP6 on JASMIN #175

Open
mwklai opened this issue Mar 13, 2025 · 2 comments
Open

Failure when ingesting the whole of CMIP6 on JASMIN #175

mwklai opened this issue Mar 13, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@mwklai
Copy link

mwklai commented Mar 13, 2025

Describe the bug

When I tried to ingest the whole of CMIP6, it fails after ~20mins with the following error message.

Failing Test

Run the following command on JASMIN
uv run ref -vv datasets ingest --source-type cmip6 /badc/cmip6/data/CMIP6/

Expected behavior

Expected all the metadata from the CMIP6 archive to be put into .ref/db.

Screenshots

Full log output
[mwklai@sci-vm-04 climate-ref]$ uv run ref -vv datasets ingest --source-type cmip6 /badc/cmip6/data/CMIP6/
2025-03-13 14:28:56.528 | DEBUG    | cmip_ref.config:default:365 - Loading default configuration from /home/users/mwklai/workspace/climate-ref/climate-ref/.ref/ref.toml
2025-03-13 14:28:56.600 | INFO     | cmip_ref.database:__init__:79 - Connecting to database at sqlite:////home/users/mwklai/workspace/climate-ref/climate-ref/.ref/db/cmip_ref.db
2025-03-13 14:28:57.051 | DEBUG    | env_py:<module>:11 - Running alembic env
2025-03-13 14:28:57.066 | INFO     | alembic.runtime.migration:__init__:207 - Context impl SQLiteImpl.
2025-03-13 14:28:57.067 | INFO     | alembic.runtime.migration:__init__:210 - Will assume non-transactional DDL.
2025-03-13 14:28:57.333 | INFO     | cmip_ref.cli.datasets:ingest:98 - ingesting /badc/cmip6/data/CMIP6


╭──────────────────────────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────────────────────────╮
│ in pandas._libs.lib.maybe_convert_numeric:2391                                                                                                     │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Unable to parse string "nan"

During handling of the above exception, another exception occurred:

╭──────────────────────────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────────────────────────╮
│ /home/users/mwklai/workspace/climate-ref/climate-ref/packages/ref/src/cmip_ref/cli/datasets.py:112 in ingest                                       │
│                                                                                                                                                    │
│   109 │   │   logger.error(f"File or directory {file_or_directory} does not exist")                                                                │
│   110 │   │   raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), file_or_directo                                                     │
│   111 │                                                                                                                                            │
│ ❱ 112 │   data_catalog = adapter.find_local_datasets(file_or_directory)                                                                            │
│   113 │   data_catalog = adapter.validate_data_catalog(data_catalog, skip_invalid=skip_invalid                                                     │
│   114 │                                                                                                                                            │
│   115 │   logger.info(                                                                                                                             │
│                                                                                                                                                    │
│ ╭──────────────────────────────────────────────────────────── locals ────────────────────────────────────────────────────────────╮                 │
│ │           adapter = <cmip_ref.datasets.cmip6.CMIP6DatasetAdapter object at 0x7effcb0d00d0>                                     │                 │
│ │            config = Config(                                                                                                    │                 │
│ │                     │   paths=PathConfig(                                                                                      │                 │
│ │                     │   │   log=PosixPath('/home/users/mwklai/workspace/climate-ref/climate-ref/.ref/log'),                    │                 │
│ │                     │   │   scratch=PosixPath('/home/users/mwklai/workspace/climate-ref/climate-ref/.ref/scratch'),            │                 │
│ │                     │   │   software=PosixPath('/home/users/mwklai/workspace/climate-ref/climate-ref/.ref/software'),          │                 │
│ │                     │   │   results=PosixPath('/home/users/mwklai/workspace/climate-ref/climate-ref/.ref/results')             │                 │
│ │                     │   ),                                                                                                     │                 │
│ │                     │   db=DbConfig(                                                                                           │                 │
│ │                     │   │   database_url='sqlite:////home/users/mwklai/workspace/climate-ref/climate-ref/.ref/db/cmip_ref.'+2, │                 │
│ │                     │   │   run_migrations=True                                                                                │                 │
│ │                     │   ),                                                                                                     │                 │
│ │                     │   executor=ExecutorConfig(executor='cmip_ref.executor.local.LocalExecutor', config={}),                  │                 │
│ │                     │   metric_providers=[                                                                                     │                 │
│ │                     │   │   MetricsProviderConfig(provider='cmip_ref_metrics_esmvaltool.provider', config={}),                 │                 │
│ │                     │   │   MetricsProviderConfig(provider='cmip_ref_metrics_ilamb.provider', config={}),                      │                 │
│ │                     │   │   MetricsProviderConfig(provider='cmip_ref_metrics_pmp.provider', config={})                         │                 │
│ │                     │   ]                                                                                                      │                 │
│ │                     )                                                                                                          │                 │
│ │               ctx = <click.core.Context object at 0x7effcb210280>                                                              │                 │
│ │                db = <cmip_ref.database.Database object at 0x7effcb211690>                                                      │                 │
│ │           dry_run = False                                                                                                      │                 │
│ │ file_or_directory = PosixPath('/badc/cmip6/data/CMIP6')                                                                        │                 │
│ │            kwargs = {}                                                                                                         │                 │
│ │            n_jobs = None                                                                                                       │                 │
│ │      skip_invalid = False                                                                                                      │                 │
│ │             solve = False                                                                                                      │                 │
│ │       source_type = <SourceDatasetType.CMIP6: 'cmip6'>                                                                         │                 │
│ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯                 │
│                                                                                                                                                    │
│ /home/users/mwklai/workspace/climate-ref/climate-ref/packages/ref/src/cmip_ref/datasets/cmip6.py:204 in find_local_datasets                        │
│                                                                                                                                                    │
│   201 │   │                                                                                                                                        │
│   202 │   │   # Temporary fix for some datasets                                                                                                    │
│   203 │   │   # TODO: Replace with a standalone package that contains metadata fixes for CMIP6                                                     │
│ ❱ 204 │   │   datasets = _apply_fixes(datasets)                                                                                                    │
│   205 │   │                                                                                                                                        │
│   206 │   │   return datasets                                                                                                                      │
│   207                                                                                                                                              │
│                                                                                                                                                    │
│ ╭──────────────────────────────────────────────────────────────────── locals ────────────────────────────────────────────────────────────────────╮ │
│ │           builder = Builder(                                                                                                                   │ │
│ │                     │   paths=['/badc/cmip6/data/CMIP6'],                                                                                      │ │
│ │                     │   storage_options={},                                                                                                    │ │
│ │                     │   depth=10,                                                                                                              │ │
│ │                     │   exclude_patterns=[],                                                                                                   │ │
│ │                     │   include_patterns=['*.nc'],                                                                                             │ │
│ │                     │   joblib_parallel_kwargs={'n_jobs': 1}                                                                                   │ │
│ │                     )                                                                                                                          │ │
│ │          datasets = │   │      activity_id                                      branch_method  ...    version                                  │ │
│ │                     instance_id                                                                                                                │ │
│ │                     0           HighResMIP  fixed historical forcing from 1950 was applied...  ...         v0                                  │ │
│ │                     CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.hist-195...                                                                          │ │
│ │                     1           HighResMIP  fixed historical forcing from 1950 was applied...  ...  v20210416                                  │ │
│ │                     CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.hist-195...                                                                          │ │
│ │                     2         LS3MIP LUMIP                                          no parent  ...         v0  CMIP6.LS3MIP                    │ │
│ │                     LUMIP.MOHC.HadGEM3-GC31-LL.land-h...                                                                                       │ │
│ │                     3         LS3MIP LUMIP                                          no parent  ...  v20200807  CMIP6.LS3MIP                    │ │
│ │                     LUMIP.MOHC.HadGEM3-GC31-LL.land-h...                                                                                       │ │
│ │                     4         LS3MIP LUMIP                                          no parent  ...         v0  CMIP6.LS3MIP                    │ │
│ │                     LUMIP.MOHC.HadGEM3-GC31-LL.land-h...                                                                                       │ │
│ │                     ...                ...                                                ...  ...        ...                                  │ │
│ │                     ...                                                                                                                        │ │
│ │                     1081  RFMIP AerChemMIP                                           standard  ...  v20200214  CMIP6.RFMIP                     │ │
│ │                     AerChemMIP.MOHC.UKESM1-0-LL.piClim...                                                                                      │ │
│ │                     1082  RFMIP AerChemMIP                                           standard  ...         v0  CMIP6.RFMIP                     │ │
│ │                     AerChemMIP.MOHC.UKESM1-0-LL.piClim...                                                                                      │ │
│ │                     1083  RFMIP AerChemMIP                                           standard  ...  v20200214  CMIP6.RFMIP                     │ │
│ │                     AerChemMIP.MOHC.UKESM1-0-LL.piClim...                                                                                      │ │
│ │                     1084  RFMIP AerChemMIP                                           standard  ...         v0  CMIP6.RFMIP                     │ │
│ │                     AerChemMIP.MOHC.UKESM1-0-LL.piClim...                                                                                      │ │
│ │                     1085  RFMIP AerChemMIP                                           standard  ...  v20200214  CMIP6.RFMIP                     │ │
│ │                     AerChemMIP.MOHC.UKESM1-0-LL.piClim...                                                                                      │ │
│ │                                                                                                                                                │ │
│ │                     [1086 rows x 37 columns]                                                                                                   │ │
│ │         drs_items = [                                                                                                                          │ │
│ │                     │   'activity_id',                                                                                                         │ │
│ │                     │   'institution_id',                                                                                                      │ │
│ │                     │   'source_id',                                                                                                           │ │
│ │                     │   'experiment_id',                                                                                                       │ │
│ │                     │   'member_id',                                                                                                           │ │
│ │                     │   'table_id',                                                                                                            │ │
│ │                     │   'variable_id',                                                                                                         │ │
│ │                     │   'grid_label',                                                                                                          │ │
│ │                     │   'version'                                                                                                              │ │
│ │                     ]                                                                                                                          │ │
│ │ file_or_directory = PosixPath('/badc/cmip6/data/CMIP6')                                                                                        │ │
│ │              self = <cmip_ref.datasets.cmip6.CMIP6DatasetAdapter object at 0x7effcb0d00d0>                                                     │ │
│ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                                                                    │
│ /home/users/mwklai/workspace/climate-ref/climate-ref/packages/ref/src/cmip_ref/datasets/cmip6.py:60 in _apply_fixes                                │
│                                                                                                                                                    │
│    57 │   │   .reset_index(level="instance_id")                                                                                                    │
│    58 │   )                                                                                                                                        │
│    59 │                                                                                                                                            │
│ ❱  60 │   data_catalog["branch_time_in_child"] = _clean_branch_time(data_catalog["branch_time_                                                     │
│    61 │   data_catalog["branch_time_in_parent"] = _clean_branch_time(data_catalog["branch_time                                                     │
│    62 │                                                                                                                                            │
│    63 │   return data_catalog                                                                                                                      │
│                                                                                                                                                    │
│ ╭──────────────────────────────────────────────────────────────────── locals ────────────────────────────────────────────────────────────────────╮ │
│ │ data_catalog = │   │   │   │   │   │   │   │   │   │   │   instance_id       activity_id  ...                                                  │ │
│ │                path    version                                                                                                                 │ │
│ │                0     CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.hist-195...        HighResMIP  ...                                                  │ │
│ │                /badc/cmip6/data/CMIP6/HighResMIP/MOHC/HadGEM3...         v0                                                                    │ │
│ │                1     CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.hist-195...        HighResMIP  ...                                                  │ │
│ │                /badc/cmip6/data/CMIP6/HighResMIP/MOHC/HadGEM3...  v20210416                                                                    │ │
│ │                2     CMIP6.LS3MIP LUMIP.MOHC.HadGEM3-GC31-LL.land-h...      LS3MIP LUMIP  ...                                                  │ │
│ │                /badc/cmip6/data/CMIP6/LS3MIP/MOHC/HadGEM3-GC3...         v0                                                                    │ │
│ │                3     CMIP6.LS3MIP LUMIP.MOHC.HadGEM3-GC31-LL.land-h...      LS3MIP LUMIP  ...                                                  │ │
│ │                /badc/cmip6/data/CMIP6/LS3MIP/MOHC/HadGEM3-GC3...  v20200807                                                                    │ │
│ │                4     CMIP6.LS3MIP LUMIP.MOHC.HadGEM3-GC31-LL.land-h...      LS3MIP LUMIP  ...                                                  │ │
│ │                /badc/cmip6/data/CMIP6/LS3MIP/MOHC/HadGEM3-GC3...         v0                                                                    │ │
│ │                ...                                                 ...               ...  ...                                                  │ │
│ │                ...        ...                                                                                                                  │ │
│ │                1081  CMIP6.RFMIP AerChemMIP.MOHC.UKESM1-0-LL.piClim...  RFMIP AerChemMIP  ...                                                  │ │
│ │                /badc/cmip6/data/CMIP6/RFMIP/MOHC/UKESM1-0-LL/...  v20200214                                                                    │ │
│ │                1082  CMIP6.RFMIP AerChemMIP.MOHC.UKESM1-0-LL.piClim...  RFMIP AerChemMIP  ...                                                  │ │
│ │                /badc/cmip6/data/CMIP6/RFMIP/MOHC/UKESM1-0-LL/...         v0                                                                    │ │
│ │                1083  CMIP6.RFMIP AerChemMIP.MOHC.UKESM1-0-LL.piClim...  RFMIP AerChemMIP  ...                                                  │ │
│ │                /badc/cmip6/data/CMIP6/RFMIP/MOHC/UKESM1-0-LL/...  v20200214                                                                    │ │
│ │                1084  CMIP6.RFMIP AerChemMIP.MOHC.UKESM1-0-LL.piClim...  RFMIP AerChemMIP  ...                                                  │ │
│ │                /badc/cmip6/data/CMIP6/RFMIP/MOHC/UKESM1-0-LL/...         v0                                                                    │ │
│ │                1085  CMIP6.RFMIP AerChemMIP.MOHC.UKESM1-0-LL.piClim...  RFMIP AerChemMIP  ...                                                  │ │
│ │                /badc/cmip6/data/CMIP6/RFMIP/MOHC/UKESM1-0-LL/...  v20200214                                                                    │ │
│ │                                                                                                                                                │ │
│ │                [1086 rows x 37 columns]                                                                                                        │ │
│ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                                                                    │
│ /home/users/mwklai/workspace/climate-ref/climate-ref/packages/ref/src/cmip_ref/datasets/cmip6.py:69 in _clean_branch_time                          │
│                                                                                                                                                    │
│    66 def _clean_branch_time(branch_time: pd.Series[str]) -> pd.Series[float]:                                                                     │
│    67 │   # EC-Earth3 uses "D" as a suffix for the branch_time_in_child and branch_time_in_par                                                     │
│    68 │   # Handle missing values (these result in nan values)                                                                                     │
│ ❱  69 │   return pd.to_numeric(branch_time.astype(str).str.replace("D", "").replace("None", ""                                                     │
│    70                                                                                                                                              │
│    71                                                                                                                                              │
│    72 class CMIP6DatasetAdapter(DatasetAdapter):                                                                                                   │
│                                                                                                                                                    │
│ ╭──────────────────────────────── locals ────────────────────────────────╮                                                                         │
│ │ branch_time = 0       0.0                                              │                                                                         │
│ │               1       0.0                                              │                                                                         │
│ │               2       NaN                                              │                                                                         │
│ │               3       NaN                                              │                                                                         │
│ │               4       NaN                                              │                                                                         │
│ │               │      ...                                               │                                                                         │
│ │               1081    0.0                                              │                                                                         │
│ │               1082    0.0                                              │                                                                         │
│ │               1083    0.0                                              │                                                                         │
│ │               1084    0.0                                              │                                                                         │
│ │               1085    0.0                                              │                                                                         │
│ │               Name: branch_time_in_child, Length: 1086, dtype: float64 │                                                                         │
│ ╰────────────────────────────────────────────────────────────────────────╯                                                                         │
│                                                                                                                                                    │
│ /home/users/mwklai/workspace/climate-ref/climate-ref/.venv/lib/python3.10/site-packages/pandas/core/tools/numeric.py:232 in to_numeric             │
│                                                                                                                                                    │
│   229 │   │   values = ensure_object(values)                                                                                                       │
│   230 │   │   coerce_numeric = errors not in ("ignore", "raise")                                                                                   │
│   231 │   │   try:                                                                                                                                 │
│ ❱ 232 │   │   │   values, new_mask = lib.maybe_convert_numeric(  # type: ignore[call-overload]                                                     │
│   233 │   │   │   │   values,                                                                                                                      │
│   234 │   │   │   │   set(),                                                                                                                       │
│   235 │   │   │   │   coerce_numeric=coerce_numeric,                                                                                               │
│                                                                                                                                                    │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮                                                                       │
│ │            arg = 0       0.0                                             │                                                                       │
│ │                  1       0.0                                             │                                                                       │
│ │                  2       nan                                             │                                                                       │
│ │                  3       nan                                             │                                                                       │
│ │                  4       nan                                             │                                                                       │
│ │                  │      ...                                              │                                                                       │
│ │                  1081    0.0                                             │                                                                       │
│ │                  1082    0.0                                             │                                                                       │
│ │                  1083    0.0                                             │                                                                       │
│ │                  1084    0.0                                             │                                                                       │
│ │                  1085    0.0                                             │                                                                       │
│ │                  Name: branch_time_in_child, Length: 1086, dtype: object │                                                                       │
│ │ coerce_numeric = False                                                   │                                                                       │
│ │       downcast = None                                                    │                                                                       │
│ │  dtype_backend = <no_default>                                            │                                                                       │
│ │         errors = 'raise'                                                 │                                                                       │
│ │       is_index = False                                                   │                                                                       │
│ │     is_scalars = False                                                   │                                                                       │
│ │      is_series = True                                                    │                                                                       │
│ │           mask = None                                                    │                                                                       │
│ │       new_mask = None                                                    │                                                                       │
│ │    orig_values = array(['0.0', '0.0', 'nan', ..., '0.0', '0.0', '0.0'],  │                                                                       │
│ │                  │     shape=(1086,), dtype=object)                      │                                                                       │
│ │         values = array(['0.0', '0.0', 'nan', ..., '0.0', '0.0', '0.0'],  │                                                                       │
│ │                  │     shape=(1086,), dtype=object)                      │                                                                       │
│ │   values_dtype = dtype('O')                                              │                                                                       │
│ ╰──────────────────────────────────────────────────────────────────────────╯                                                                       │
│                                                                                                                                                    │
│ in pandas._libs.lib.maybe_convert_numeric:2433                                                                                                     │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Unable to parse string "nan" at position 2

System

On JASMIN

Additional context

@mwklai mwklai added the bug Something isn't working label Mar 13, 2025
@mikapfl
Copy link
Contributor

mikapfl commented Mar 14, 2025

Hi,

I played around a little with ingesting lots of files on JASMIN (I updated the cmip6 reading code to avoid first walking the entire tree and add logging, these changes should hit main soonish). From a quick back-of-the-envelope calculation and some testing on the sci-vm nodes on JASMIN, it looks like just ingesting all of cmip6 naively into the REF database would take ~70 years. So, this is not a very useful thing to try.

I think we really need to use intake-esgf or some other service which already has an index of CMIP6 data to first narrow down what to read, then only read in data which can, in principle, be of interest to the REF (as discussed with @nocollier ). Or, if there is no other index to use (e.g. when using the REF within a modelling centre with fresh-out-of-the-pipline results that aren't on ESGF yet), some bespoke scripts need to be written to pre-filter the netcdf files which are ingested into the REF (and probably, systems with much better I/O characteristics than JASMIN sci vms must be used).

Cheers

Mika

@lewisjared
Copy link
Contributor

lewisjared commented Mar 14, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants