-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure when ingesting the whole of CMIP6 on JASMIN #175
Comments
Hi, I played around a little with ingesting lots of files on JASMIN (I updated the cmip6 reading code to avoid first walking the entire tree and add logging, these changes should hit main soonish). From a quick back-of-the-envelope calculation and some testing on the sci-vm nodes on JASMIN, it looks like just ingesting all of cmip6 naively into the REF database would take ~70 years. So, this is not a very useful thing to try. I think we really need to use intake-esgf or some other service which already has an index of CMIP6 data to first narrow down what to read, then only read in data which can, in principle, be of interest to the REF (as discussed with @nocollier ). Or, if there is no other index to use (e.g. when using the REF within a modelling centre with fresh-out-of-the-pipline results that aren't on ESGF yet), some bespoke scripts need to be written to pre-filter the netcdf files which are ingested into the REF (and probably, systems with much better I/O characteristics than JASMIN sci vms must be used). Cheers Mika |
Yikes. Perhaps we can use the drs in the path to filter. We only want a
selected set of tables so that could greatly reduce the scope
…On Fri, 14 Mar 2025, 3:40 am Mika Pflüger, ***@***.***> wrote:
Hi,
I played around a little with ingesting lots of files on JASMIN (I updated
the cmip6 reading code to avoid first walking the entire tree and add
logging, these changes should hit main soonish). From a quick
back-of-the-envelope calculation and some testing on the sci-vm nodes on
JASMIN, it looks like just ingesting all of cmip6 naively into the REF
database would take ~70 years. So, this is not a very useful thing to try.
I think we really need to use intake-esgf or some other service which
already has an index of CMIP6 data to first narrow down what to read, then
only read in data which can, in principle, be of interest to the REF (as
discussed with @nocollier <https://github.com/nocollier> ). Or, if there
is no other index to use (e.g. when using the REF within a modelling centre
with fresh-out-of-the-pipline results that aren't on ESGF yet), some
bespoke scripts need to be written to pre-filter the netcdf files which are
ingested into the REF (and probably, systems with much better I/O
characteristics than JASMIN sci vms must be used).
Cheers
Mika
—
Reply to this email directly, view it on GitHub
<#175 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKFLQKXMFHNQH3BQDTEL3T2UKPZ5AVCNFSM6AAAAABY6OYYWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMRUGE3DQMBSHA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
[image: mikapfl]*mikapfl* left a comment (Climate-REF/climate-ref#175)
<#175 (comment)>
Hi,
I played around a little with ingesting lots of files on JASMIN (I updated
the cmip6 reading code to avoid first walking the entire tree and add
logging, these changes should hit main soonish). From a quick
back-of-the-envelope calculation and some testing on the sci-vm nodes on
JASMIN, it looks like just ingesting all of cmip6 naively into the REF
database would take ~70 years. So, this is not a very useful thing to try.
I think we really need to use intake-esgf or some other service which
already has an index of CMIP6 data to first narrow down what to read, then
only read in data which can, in principle, be of interest to the REF (as
discussed with @nocollier <https://github.com/nocollier> ). Or, if there
is no other index to use (e.g. when using the REF within a modelling centre
with fresh-out-of-the-pipline results that aren't on ESGF yet), some
bespoke scripts need to be written to pre-filter the netcdf files which are
ingested into the REF (and probably, systems with much better I/O
characteristics than JASMIN sci vms must be used).
Cheers
Mika
—
Reply to this email directly, view it on GitHub
<#175 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKFLQKXMFHNQH3BQDTEL3T2UKPZ5AVCNFSM6AAAAABY6OYYWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMRUGE3DQMBSHA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Describe the bug
When I tried to ingest the whole of CMIP6, it fails after ~20mins with the following error message.
Failing Test
Run the following command on JASMIN
uv run ref -vv datasets ingest --source-type cmip6 /badc/cmip6/data/CMIP6/
Expected behavior
Expected all the metadata from the CMIP6 archive to be put into
.ref/db
.Screenshots
Full log output
System
On JASMIN
Additional context
The text was updated successfully, but these errors were encountered: