Failure when ingesting the whole of CMIP6 on JASMIN #175

mwklai · 2025-03-13T14:59:33Z

Describe the bug

When I tried to ingest the whole of CMIP6, it fails after ~20mins with the following error message.

Failing Test

Run the following command on JASMIN
uv run ref -vv datasets ingest --source-type cmip6 /badc/cmip6/data/CMIP6/

Expected behavior

Expected all the metadata from the CMIP6 archive to be put into .ref/db.

Screenshots

Full log output

[mwklai@sci-vm-04 climate-ref]$ uv run ref -vv datasets ingest --source-type cmip6 /badc/cmip6/data/CMIP6/
2025-03-13 14:28:56.528 | DEBUG    | cmip_ref.config:default:365 - Loading default configuration from /home/users/mwklai/workspace/climate-ref/climate-ref/.ref/ref.toml
2025-03-13 14:28:56.600 | INFO     | cmip_ref.database:__init__:79 - Connecting to database at sqlite:////home/users/mwklai/workspace/climate-ref/climate-ref/.ref/db/cmip_ref.db
2025-03-13 14:28:57.051 | DEBUG    | env_py:<module>:11 - Running alembic env
2025-03-13 14:28:57.066 | INFO     | alembic.runtime.migration:__init__:207 - Context impl SQLiteImpl.
2025-03-13 14:28:57.067 | INFO     | alembic.runtime.migration:__init__:210 - Will assume non-transactional DDL.
2025-03-13 14:28:57.333 | INFO     | cmip_ref.cli.datasets:ingest:98 - ingesting /badc/cmip6/data/CMIP6


╭──────────────────────────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────────────────────────╮
│ in pandas._libs.lib.maybe_convert_numeric:2391                                                                                                     │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Unable to parse string "nan"

During handling of the above exception, another exception occurred:

╭──────────────────────────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────────────────────────╮
│ /home/users/mwklai/workspace/climate-ref/climate-ref/packages/ref/src/cmip_ref/cli/datasets.py:112 in ingest                                       │
│                                                                                                                                                    │
│   109 │   │   logger.error(f"File or directory {file_or_directory} does not exist")                                                                │
│   110 │   │   raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), file_or_directo                                                     │
│   111 │                                                                                                                                            │
│ ❱ 112 │   data_catalog = adapter.find_local_datasets(file_or_directory)                                                                            │
│   113 │   data_catalog = adapter.validate_data_catalog(data_catalog, skip_invalid=skip_invalid                                                     │
│   114 │                                                                                                                                            │
│   115 │   logger.info(                                                                                                                             │
│                                                                                                                                                    │
│ ╭──────────────────────────────────────────────────────────── locals ────────────────────────────────────────────────────────────╮                 │
│ │           adapter = <cmip_ref.datasets.cmip6.CMIP6DatasetAdapter object at 0x7effcb0d00d0>                                     │                 │
│ │            config = Config(                                                                                                    │                 │
│ │                     │   paths=PathConfig(                                                                                      │                 │
│ │                     │   │   log=PosixPath('/home/users/mwklai/workspace/climate-ref/climate-ref/.ref/log'),                    │                 │
│ │                     │   │   scratch=PosixPath('/home/users/mwklai/workspace/climate-ref/climate-ref/.ref/scratch'),            │                 │
│ │                     │   │   software=PosixPath('/home/users/mwklai/workspace/climate-ref/climate-ref/.ref/software'),          │                 │
│ │                     │   │   results=PosixPath('/home/users/mwklai/workspace/climate-ref/climate-ref/.ref/results')             │                 │
│ │                     │   ),                                                                                                     │                 │
│ │                     │   db=DbConfig(                                                                                           │                 │
│ │                     │   │   database_url='sqlite:////home/users/mwklai/workspace/climate-ref/climate-ref/.ref/db/cmip_ref.'+2, │                 │
│ │                     │   │   run_migrations=True                                                                                │                 │
│ │                     │   ),                                                                                                     │                 │
│ │                     │   executor=ExecutorConfig(executor='cmip_ref.executor.local.LocalExecutor', config={}),                  │                 │
│ │                     │   metric_providers=[                                                                                     │                 │
│ │                     │   │   MetricsProviderConfig(provider='cmip_ref_metrics_esmvaltool.provider', config={}),                 │                 │
│ │                     │   │   MetricsProviderConfig(provider='cmip_ref_metrics_ilamb.provider', config={}),                      │                 │
│ │                     │   │   MetricsProviderConfig(provider='cmip_ref_metrics_pmp.provider', config={})                         │                 │
│ │                     │   ]                                                                                                      │                 │
│ │                     )                                                                                                          │                 │
│ │               ctx = <click.core.Context object at 0x7effcb210280>                                                              │                 │
│ │                db = <cmip_ref.database.Database object at 0x7effcb211690>                                                      │                 │
│ │           dry_run = False                                                                                                      │                 │
│ │ file_or_directory = PosixPath('/badc/cmip6/data/CMIP6')                                                                        │                 │
│ │            kwargs = {}                                                                                                         │                 │
│ │            n_jobs = None                                                                                                       │                 │
│ │      skip_invalid = False                                                                                                      │                 │
│ │             solve = False                                                                                                      │                 │
│ │       source_type = <SourceDatasetType.CMIP6: 'cmip6'>                                                                         │                 │
│ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯                 │
│                                                                                                                                                    │
│ /home/users/mwklai/workspace/climate-ref/climate-ref/packages/ref/src/cmip_ref/datasets/cmip6.py:204 in find_local_datasets                        │
│                                                                                                                                                    │
│   201 │   │                                                                                                                                        │
│   202 │   │   # Temporary fix for some datasets                                                                                                    │
│   203 │   │   # TODO: Replace with a standalone package that contains metadata fixes for CMIP6                                                     │
│ ❱ 204 │   │   datasets = _apply_fixes(datasets)                                                                                                    │
│   205 │   │                                                                                                                                        │
│   206 │   │   return datasets                                                                                                                      │
│   207                                                                                                                                              │
│                                                                                                                                                    │
│ ╭──────────────────────────────────────────────────────────────────── locals ────────────────────────────────────────────────────────────────────╮ │
│ │           builder = Builder(                                                                                                                   │ │
│ │                     │   paths=['/badc/cmip6/data/CMIP6'],                                                                                      │ │
│ │                     │   storage_options={},                                                                                                    │ │
│ │                     │   depth=10,                                                                                                              │ │
│ │                     │   exclude_patterns=[],                                                                                                   │ │
│ │                     │   include_patterns=['*.nc'],                                                                                             │ │
│ │                     │   joblib_parallel_kwargs={'n_jobs': 1}                                                                                   │ │
│ │                     )                                                                                                                          │ │
│ │          datasets = │   │      activity_id                                      branch_method  ...    version                                  │ │
│ │                     instance_id                                                                                                                │ │
│ │                     0           HighResMIP  fixed historical forcing from 1950 was applied...  ...         v0                                  │ │
│ │                     CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.hist-195...                                                                          │ │
│ │                     1           HighResMIP  fixed historical forcing from 1950 was applied...  ...  v20210416                                  │ │
│ │                     CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.hist-195...                                                                          │ │
│ │                     2         LS3MIP LUMIP                                          no parent  ...         v0  CMIP6.LS3MIP                    │ │
│ │                     LUMIP.MOHC.HadGEM3-GC31-LL.land-h...                                                                                       │ │
│ │                     3         LS3MIP LUMIP                                          no parent  ...  v20200807  CMIP6.LS3MIP                    │ │
│ │                     LUMIP.MOHC.HadGEM3-GC31-LL.land-h...                                                                                       │ │
│ │                     4         LS3MIP LUMIP                                          no parent  ...         v0  CMIP6.LS3MIP                    │ │
│ │                     LUMIP.MOHC.HadGEM3-GC31-LL.land-h...                                                                                       │ │
│ │                     ...                ...                                                ...  ...        ...                                  │ │
│ │                     ...                                                                                                                        │ │
│ │                     1081  RFMIP AerChemMIP                                           standard  ...  v20200214  CMIP6.RFMIP                     │ │
│ │                     AerChemMIP.MOHC.UKESM1-0-LL.piClim...                                                                                      │ │
│ │                     1082  RFMIP AerChemMIP                                           standard  ...         v0  CMIP6.RFMIP                     │ │
│ │                     AerChemMIP.MOHC.UKESM1-0-LL.piClim...                                                                                      │ │
│ │                     1083  RFMIP AerChemMIP                                           standard  ...  v20200214  CMIP6.RFMIP                     │ │
│ │                     AerChemMIP.MOHC.UKESM1-0-LL.piClim...                                                                                      │ │
│ │                     1084  RFMIP AerChemMIP                                           standard  ...         v0  CMIP6.RFMIP                     │ │
│ │                     AerChemMIP.MOHC.UKESM1-0-LL.piClim...                                                                                      │ │
│ │                     1085  RFMIP AerChemMIP                                           standard  ...  v20200214  CMIP6.RFMIP                     │ │
│ │                     AerChemMIP.MOHC.UKESM1-0-LL.piClim...                                                                                      │ │
│ │                                                                                                                                                │ │
│ │                     [1086 rows x 37 columns]                                                                                                   │ │
│ │         drs_items = [                                                                                                                          │ │
│ │                     │   'activity_id',                                                                                                         │ │
│ │                     │   'institution_id',                                                                                                      │ │
│ │                     │   'source_id',                                                                                                           │ │
│ │                     │   'experiment_id',                                                                                                       │ │
│ │                     │   'member_id',                                                                                                           │ │
│ │                     │   'table_id',                                                                                                            │ │
│ │                     │   'variable_id',                                                                                                         │ │
│ │                     │   'grid_label',                                                                                                          │ │
│ │                     │   'version'                                                                                                              │ │
│ │                     ]                                                                                                                          │ │
│ │ file_or_directory = PosixPath('/badc/cmip6/data/CMIP6')                                                                                        │ │
│ │              self = <cmip_ref.datasets.cmip6.CMIP6DatasetAdapter object at 0x7effcb0d00d0>                                                     │ │
│ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                                                                    │
│ /home/users/mwklai/workspace/climate-ref/climate-ref/packages/ref/src/cmip_ref/datasets/cmip6.py:60 in _apply_fixes                                │
│                                                                                                                                                    │
│    57 │   │   .reset_index(level="instance_id")                                                                                                    │
│    58 │   )                                                                                                                                        │
│    59 │                                                                                                                                            │
│ ❱  60 │   data_catalog["branch_time_in_child"] = _clean_branch_time(data_catalog["branch_time_                                                     │
│    61 │   data_catalog["branch_time_in_parent"] = _clean_branch_time(data_catalog["branch_time                                                     │
│    62 │                                                                                                                                            │
│    63 │   return data_catalog                                                                                                                      │
│                                                                                                                                                    │
│ ╭──────────────────────────────────────────────────────────────────── locals ────────────────────────────────────────────────────────────────────╮ │
│ │ data_catalog = │   │   │   │   │   │   │   │   │   │   │   instance_id       activity_id  ...                                                  │ │
│ │                path    version                                                                                                                 │ │
│ │                0     CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.hist-195...        HighResMIP  ...                                                  │ │
│ │                /badc/cmip6/data/CMIP6/HighResMIP/MOHC/HadGEM3...         v0                                                                    │ │
│ │                1     CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.hist-195...        HighResMIP  ...                                                  │ │
│ │                /badc/cmip6/data/CMIP6/HighResMIP/MOHC/HadGEM3...  v20210416                                                                    │ │
│ │                2     CMIP6.LS3MIP LUMIP.MOHC.HadGEM3-GC31-LL.land-h...      LS3MIP LUMIP  ...                                                  │ │
│ │                /badc/cmip6/data/CMIP6/LS3MIP/MOHC/HadGEM3-GC3...         v0                                                                    │ │
│ │                3     CMIP6.LS3MIP LUMIP.MOHC.HadGEM3-GC31-LL.land-h...      LS3MIP LUMIP  ...                                                  │ │
│ │                /badc/cmip6/data/CMIP6/LS3MIP/MOHC/HadGEM3-GC3...  v20200807                                                                    │ │
│ │                4     CMIP6.LS3MIP LUMIP.MOHC.HadGEM3-GC31-LL.land-h...      LS3MIP LUMIP  ...                                                  │ │
│ │                /badc/cmip6/data/CMIP6/LS3MIP/MOHC/HadGEM3-GC3...         v0                                                                    │ │
│ │                ...                                                 ...               ...  ...                                                  │ │
│ │                ...        ...                                                                                                                  │ │
│ │                1081  CMIP6.RFMIP AerChemMIP.MOHC.UKESM1-0-LL.piClim...  RFMIP AerChemMIP  ...                                                  │ │
│ │                /badc/cmip6/data/CMIP6/RFMIP/MOHC/UKESM1-0-LL/...  v20200214                                                                    │ │
│ │                1082  CMIP6.RFMIP AerChemMIP.MOHC.UKESM1-0-LL.piClim...  RFMIP AerChemMIP  ...                                                  │ │
│ │                /badc/cmip6/data/CMIP6/RFMIP/MOHC/UKESM1-0-LL/...         v0                                                                    │ │
│ │                1083  CMIP6.RFMIP AerChemMIP.MOHC.UKESM1-0-LL.piClim...  RFMIP AerChemMIP  ...                                                  │ │
│ │                /badc/cmip6/data/CMIP6/RFMIP/MOHC/UKESM1-0-LL/...  v20200214                                                                    │ │
│ │                1084  CMIP6.RFMIP AerChemMIP.MOHC.UKESM1-0-LL.piClim...  RFMIP AerChemMIP  ...                                                  │ │
│ │                /badc/cmip6/data/CMIP6/RFMIP/MOHC/UKESM1-0-LL/...         v0                                                                    │ │
│ │                1085  CMIP6.RFMIP AerChemMIP.MOHC.UKESM1-0-LL.piClim...  RFMIP AerChemMIP  ...                                                  │ │
│ │                /badc/cmip6/data/CMIP6/RFMIP/MOHC/UKESM1-0-LL/...  v20200214                                                                    │ │
│ │                                                                                                                                                │ │
│ │                [1086 rows x 37 columns]                                                                                                        │ │
│ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                                                                    │
│ /home/users/mwklai/workspace/climate-ref/climate-ref/packages/ref/src/cmip_ref/datasets/cmip6.py:69 in _clean_branch_time                          │
│                                                                                                                                                    │
│    66 def _clean_branch_time(branch_time: pd.Series[str]) -> pd.Series[float]:                                                                     │
│    67 │   # EC-Earth3 uses "D" as a suffix for the branch_time_in_child and branch_time_in_par                                                     │
│    68 │   # Handle missing values (these result in nan values)                                                                                     │
│ ❱  69 │   return pd.to_numeric(branch_time.astype(str).str.replace("D", "").replace("None", ""                                                     │
│    70                                                                                                                                              │
│    71                                                                                                                                              │
│    72 class CMIP6DatasetAdapter(DatasetAdapter):                                                                                                   │
│                                                                                                                                                    │
│ ╭──────────────────────────────── locals ────────────────────────────────╮                                                                         │
│ │ branch_time = 0       0.0                                              │                                                                         │
│ │               1       0.0                                              │                                                                         │
│ │               2       NaN                                              │                                                                         │
│ │               3       NaN                                              │                                                                         │
│ │               4       NaN                                              │                                                                         │
│ │               │      ...                                               │                                                                         │
│ │               1081    0.0                                              │                                                                         │
│ │               1082    0.0                                              │                                                                         │
│ │               1083    0.0                                              │                                                                         │
│ │               1084    0.0                                              │                                                                         │
│ │               1085    0.0                                              │                                                                         │
│ │               Name: branch_time_in_child, Length: 1086, dtype: float64 │                                                                         │
│ ╰────────────────────────────────────────────────────────────────────────╯                                                                         │
│                                                                                                                                                    │
│ /home/users/mwklai/workspace/climate-ref/climate-ref/.venv/lib/python3.10/site-packages/pandas/core/tools/numeric.py:232 in to_numeric             │
│                                                                                                                                                    │
│   229 │   │   values = ensure_object(values)                                                                                                       │
│   230 │   │   coerce_numeric = errors not in ("ignore", "raise")                                                                                   │
│   231 │   │   try:                                                                                                                                 │
│ ❱ 232 │   │   │   values, new_mask = lib.maybe_convert_numeric(  # type: ignore[call-overload]                                                     │
│   233 │   │   │   │   values,                                                                                                                      │
│   234 │   │   │   │   set(),                                                                                                                       │
│   235 │   │   │   │   coerce_numeric=coerce_numeric,                                                                                               │
│                                                                                                                                                    │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮                                                                       │
│ │            arg = 0       0.0                                             │                                                                       │
│ │                  1       0.0                                             │                                                                       │
│ │                  2       nan                                             │                                                                       │
│ │                  3       nan                                             │                                                                       │
│ │                  4       nan                                             │                                                                       │
│ │                  │      ...                                              │                                                                       │
│ │                  1081    0.0                                             │                                                                       │
│ │                  1082    0.0                                             │                                                                       │
│ │                  1083    0.0                                             │                                                                       │
│ │                  1084    0.0                                             │                                                                       │
│ │                  1085    0.0                                             │                                                                       │
│ │                  Name: branch_time_in_child, Length: 1086, dtype: object │                                                                       │
│ │ coerce_numeric = False                                                   │                                                                       │
│ │       downcast = None                                                    │                                                                       │
│ │  dtype_backend = <no_default>                                            │                                                                       │
│ │         errors = 'raise'                                                 │                                                                       │
│ │       is_index = False                                                   │                                                                       │
│ │     is_scalars = False                                                   │                                                                       │
│ │      is_series = True                                                    │                                                                       │
│ │           mask = None                                                    │                                                                       │
│ │       new_mask = None                                                    │                                                                       │
│ │    orig_values = array(['0.0', '0.0', 'nan', ..., '0.0', '0.0', '0.0'],  │                                                                       │
│ │                  │     shape=(1086,), dtype=object)                      │                                                                       │
│ │         values = array(['0.0', '0.0', 'nan', ..., '0.0', '0.0', '0.0'],  │                                                                       │
│ │                  │     shape=(1086,), dtype=object)                      │                                                                       │
│ │   values_dtype = dtype('O')                                              │                                                                       │
│ ╰──────────────────────────────────────────────────────────────────────────╯                                                                       │
│                                                                                                                                                    │
│ in pandas._libs.lib.maybe_convert_numeric:2433                                                                                                     │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Unable to parse string "nan" at position 2

System

On JASMIN

Additional context

The text was updated successfully, but these errors were encountered:

mikapfl · 2025-03-14T09:40:24Z

Hi,

I played around a little with ingesting lots of files on JASMIN (I updated the cmip6 reading code to avoid first walking the entire tree and add logging, these changes should hit main soonish). From a quick back-of-the-envelope calculation and some testing on the sci-vm nodes on JASMIN, it looks like just ingesting all of cmip6 naively into the REF database would take ~70 years. So, this is not a very useful thing to try.

I think we really need to use intake-esgf or some other service which already has an index of CMIP6 data to first narrow down what to read, then only read in data which can, in principle, be of interest to the REF (as discussed with @nocollier ). Or, if there is no other index to use (e.g. when using the REF within a modelling centre with fresh-out-of-the-pipline results that aren't on ESGF yet), some bespoke scripts need to be written to pre-filter the netcdf files which are ingested into the REF (and probably, systems with much better I/O characteristics than JASMIN sci vms must be used).

Cheers

Mika

lewisjared · 2025-03-14T12:03:56Z

Yikes. Perhaps we can use the drs in the path to filter. We only want a selected set of tables so that could greatly reduce the scope

…

On Fri, 14 Mar 2025, 3:40 am Mika Pflüger, ***@***.***> wrote: Hi, I played around a little with ingesting lots of files on JASMIN (I updated the cmip6 reading code to avoid first walking the entire tree and add logging, these changes should hit main soonish). From a quick back-of-the-envelope calculation and some testing on the sci-vm nodes on JASMIN, it looks like just ingesting all of cmip6 naively into the REF database would take ~70 years. So, this is not a very useful thing to try. I think we really need to use intake-esgf or some other service which already has an index of CMIP6 data to first narrow down what to read, then only read in data which can, in principle, be of interest to the REF (as discussed with @nocollier <https://github.com/nocollier> ). Or, if there is no other index to use (e.g. when using the REF within a modelling centre with fresh-out-of-the-pipline results that aren't on ESGF yet), some bespoke scripts need to be written to pre-filter the netcdf files which are ingested into the REF (and probably, systems with much better I/O characteristics than JASMIN sci vms must be used). Cheers Mika — Reply to this email directly, view it on GitHub <#175 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKFLQKXMFHNQH3BQDTEL3T2UKPZ5AVCNFSM6AAAAABY6OYYWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMRUGE3DQMBSHA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***> [image: mikapfl]*mikapfl* left a comment (Climate-REF/climate-ref#175) <#175 (comment)> Hi, I played around a little with ingesting lots of files on JASMIN (I updated the cmip6 reading code to avoid first walking the entire tree and add logging, these changes should hit main soonish). From a quick back-of-the-envelope calculation and some testing on the sci-vm nodes on JASMIN, it looks like just ingesting all of cmip6 naively into the REF database would take ~70 years. So, this is not a very useful thing to try. I think we really need to use intake-esgf or some other service which already has an index of CMIP6 data to first narrow down what to read, then only read in data which can, in principle, be of interest to the REF (as discussed with @nocollier <https://github.com/nocollier> ). Or, if there is no other index to use (e.g. when using the REF within a modelling centre with fresh-out-of-the-pipline results that aren't on ESGF yet), some bespoke scripts need to be written to pre-filter the netcdf files which are ingested into the REF (and probably, systems with much better I/O characteristics than JASMIN sci vms must be used). Cheers Mika — Reply to this email directly, view it on GitHub <#175 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKFLQKXMFHNQH3BQDTEL3T2UKPZ5AVCNFSM6AAAAABY6OYYWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMRUGE3DQMBSHA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

mwklai added the bug Something isn't working label Mar 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure when ingesting the whole of CMIP6 on JASMIN #175

Failure when ingesting the whole of CMIP6 on JASMIN #175

mwklai commented Mar 13, 2025

mikapfl commented Mar 14, 2025

lewisjared commented Mar 14, 2025 via email

Failure when ingesting the whole of CMIP6 on JASMIN #175

Failure when ingesting the whole of CMIP6 on JASMIN #175

Comments

mwklai commented Mar 13, 2025

Describe the bug

Failing Test

Expected behavior

Screenshots

System

Additional context

mikapfl commented Mar 14, 2025

lewisjared commented Mar 14, 2025 via email