Skip to content
This repository was archived by the owner on Sep 1, 2022. It is now read-only.

TDS not properly updating GRIB collections with directory time partitions #857

Closed
lesserwhirls opened this issue Jun 9, 2017 · 16 comments
Closed

Comments

@lesserwhirls
Copy link
Collaborator

Reported in e-support ticket VHE-971307.

For a grib collection that updates using the TDM, the TDS will not update the catalog listing for directory time partitions. The TDM creates new indices for new files, and successfully sends the TDS an update trigger. The TDS receives the trigger and goes through the motions of updating the collection, but the catalog does not reflect the new data. If the TDS is restarted, the catalog reflects the new additions. Therefore, I believe the problem is isolated to the collection update code in the TDS and not the TDM.

Likely the same issue as report by #246

@lesserwhirls
Copy link
Collaborator Author

Some notes:

CollectionUpdateListener <- InvDatsetFeatureCollection <- InvDatasetFcGrib

sendEvent calls update, both of which are in InvDatasetFeatureCollection.

in update, thread is locked, state checked, copy made

Then, updateCollection called (overridden method in InvDatasetFcGrib), makeDatasetTope (also overridden)

thread locked, state updated.

I think updateCollection is doing the right thing, as the indexes created/updated read in just fine once the TDS is restarted. The issue seems to be that the TDS catalog does not update to reflect new index files.

makeDatasetTop calls makeDatasetFromCollection, which seems to have the object used to make catalogs (InvDatasetImpl result;). It is inside makeDatasetFromCollection that the magic is done, likely down in makeDatasetsFromGroups.

@pmspire
Copy link

pmspire commented Dec 5, 2017

Looks like this one didn't make the cut for v4.6.11. Any hope of a fix in the next release?

@lesserwhirls
Copy link
Collaborator Author

Hi @pmspire - I've been digging around with this quite a bit, and while I have not found a singular cause, I have managed to make things work on my end (only through config changes). I had to do two things:

  1. include the <update> element in the catalog, as well a dateFormatMark attribute to extract the date and run time from the path, like this:
<featureCollection featureType="GRIB2"
                   harvest="true"
                   name="Test Paul Setup"
                   path="grib/tps">

    <collection name="tps"
                spec="/Users/sarms/dev/unidata/content/data/grib_collection_46/gfs_0p25/**/.*\.f[0-9][0-9][0-9]$"
                dateFormatMark="yyyyMMdd'/'HH#/gfs\.#"
                timePartition="directory"
                olderThan="15 seconds" />
    <update startup="never" trigger="allow" />
    <tdm rewrite="test" rescan="10 * * * * ? *" />
</featureCollection>

I think I had said to remove the update element, and while the values for startup and trigger should be the default, I don't think it's being picked up correct for directory partitions (still investigating this). So, to be safe, go ahead and include <update...>.

  1. Make sure the spec regular expression was only picking up the grib files, not the grib files plus any index files that may already exists (say, when new files are added and an update is trying to happen).

My directory structure looks like this:

20170527/00:
gfs.t00z.pgrb2.0p25.f000
gfs.t00z.pgrb2.0p25.f003      

20170527/06:
gfs.t06z.pgrb2.0p25.f000
gfs.t06z.pgrb2.0p25.f003      

20170527/12:
gfs.t12z.pgrb2.0p25.f000

20170528/00:
gfs.t00z.pgrb2.0p25.f000
gfs.t00z.pgrb2.0p25.f003

Originally, I used a spec of:

spec="/Users/sarms/dev/unidata/content/data/grib_collection_46/gfs_0p25/**/.*pgrb2.*"

which is what you had in the example you sent me via eSupport (or close to it), but .*pgrib2.* ended up picking up the index files created on startup, which caused things to fail during the update.

@pmspire
Copy link

pmspire commented Jan 26, 2018

Hi Sean. Thank you for following up on this. I can confirm that, after implementing the configuration changes you suggested, our TDS is properly updating after receiving a TDM trigger. This is a big help to us, and we're grateful for your help.

@lesserwhirls
Copy link
Collaborator Author

I apologize that I took so long debugging when it was just a config issue 👎 I'm glad it is up and running for you now!

@pmspire
Copy link

pmspire commented Jan 26, 2018

Since none of us figured out before now that it was a matter of configuration, I suppose you might be able to make a case that there ought to be better error reporting somewhere, since all the signs we got were that things should be working. But I'll leave that to you. Thanks again.

@pmspire
Copy link

pmspire commented Jan 27, 2018

Sean, I'm sorry, I think I spoke too soon. I'm still having trouble. It had been a while since I looked at this, and so had forgotten the actual bad behavior: The first time I request the catalog (I'm using the web interface), I do see all the indexed files. However, when more files become available, and TDM indexes them and sends a trigger to TDS, a new catalog request does not show any change. I still have to reload Tomcat to see the newly indexed files. I'd forgotten that that was the failure mode.

Now, I'm not quite sure I'm doing the right thing with dateFormatMark. Your example doesn't seem to jive with what little documentation I've been able to find so far about it (https://www.unidata.ucar.edu/software/thredds/v4.6/netcdf-java/ncml/Aggregation.html):

The dateFormatMark attribute is used on joinNew aggregation, as well as joinExisting if there is only one time slice in each file, to create date coordinate values out of the filename. It consists of a section of text, a '#' marking character, then a java.text.SimpleDateFormat string. The number of characters before the # is skipped in the filename, then the next part of the filename must match the SimpleDateFormat string, then it ignores any trailing text.

Does it really work spanning path elements? Also, the bit about "The number of characters before the # is skipped" seems problematic: In your example, "yyyyMMdd'/'HH#/gfs\.#", all the important info is before the (first) #. And I didn't see any examples with two #s. Am I looking at the wrong docs?

For example, here's a block from my catalog.xml, where I've attempted to incorporate your suggestions:

  <dataset name="Community Modeling System (CMS)"
           serviceName="all">
    <featureCollection featureType="GRIB2"
                       harvest="true"
                       name="CMS"
                       path="grib/cms">
      <metadata inherited="true">
        <serviceName>all</serviceName>
        <dataFormat>GRIB-2</dataFormat>
      </metadata>
      <collection dateFormatMark="yyyyMMdd'/'HH#/cms\.#"
                  olderThan="15 seconds"
                  name="cms"
                  spec="/global/spire-cache/cms/**/.*\.f[0-9]{3}$"
                  timePartition="directory" />
      <update startup="never" trigger="allow" />
      <tdm rescan="0 0/5 * * * ? *"
           rewrite="test" />
    </featureCollection>
  </dataset>

(I used the regex form [0-9]{3} in place of your [0-9][0-9][0-9] but, presumably, if that were an issue TDM wouldn't be finding the files at all.)

This is meant to match files like

/global/spire-cache/cms/20180127/00/cms.t00z.pgrb2.0p25.f000

But the need to restart Tomcat persists. Can you see a problem with the above vs your recommended changes? Apologies if I missed something.

Can you verify that, for you, TDS actually serves an updated catalog if you have already requested one but new data have subsequently become available and been indexed?

@pmspire
Copy link

pmspire commented Jan 27, 2018

I'll take a stab at interpreting your dateFormatMark: The first # says to ignore the '/' literal text, and the second one says to ignore the /cms. and everything after, so that just the yyyyMMddHH part is left to be interpreted as a datetime, but all of the text is there to get a unique match on the path. Warm?

Also, just to refresh my memory and be sure, I ssh'ed to the system running Tomcat and queried the URL http://localhost:8080/thredds/catalog/grib/cms/cms-20180127/cms-00/catalog.html with wget, just to be sure nginx, browser caching, etc. are not in play. The catalog is never updated after the first request, even though TDM has indexed more files, etc.

I think my last question above is the crucial one: If you can have some files indexed, then request the catalog, then have some more files indexed and actually receive an updated catalog, then there's still hope in a configuration-based fix and I just have to figure out what I'm doing wrong.

@pmspire
Copy link

pmspire commented Jan 27, 2018

A colleague pointed me to better docs on dateFormatMark here: https://www.unidata.ucar.edu/software/thredds/v5.0/tds/reference/collections/FeatureCollections.html#collection. These docs include an example with two #s in the format string, and make it clear that it applies to the full path, not just the filename. The URL says v5.0, but hopefully this applies to v4.6, too.

I'm still testing but haven't yet been able to overcome the TDS update problem.

@pmspire
Copy link

pmspire commented Jan 27, 2018

I just did an experiment using timePartition="file", with this catalog.xml block:

  <dataset name="FOO"
           serviceName="all">
    <featureCollection featureType="GRIB2"
                       name="FOO"
                       path="grib/foo">
      <metadata inherited="true">
        <serviceName>all</serviceName>
        <dataFormat>GRIB-2</dataFormat>
      </metadata>
      <collection name="foo"
                  spec="/global/spire-cache/foo/foo.*$"
                  timePartition="file" />
      <update startup="never" trigger="allow" />
      <tdm rescan="0 0/1 * * * ? *"
           rewrite="test" />
    </featureCollection>
  </dataset>

I cated the first 10 GRIB files from a collection into the file /global/spire-cache/foo/foo-20180127-00.grib, let TDM index it, confirmed that it was available in TDS:

TimeCoverage:
Start: 2018-01-27T00:00:00Z
End: 2018-01-27T09:00:00Z

Then I cated the next ten files, let TDM index it:

2018-01-27T10:52:00.093 -0600 INFO  - updateGribCollection foo changed true took 92 msecs
2018-01-27T10:52:00.101 -0600 INFO  - send trigger to https://localhost:8443/thredds/admin/collection/trigger?trigger=never&collection=foo status = 200

But, querying TDS again, the TimeCoverage did not change. When I restarted Tomcat, though:

TimeCoverage:
Start: 2018-01-27T00:00:00Z
End: 2018-01-27T19:00:00Z

which is correct. It seems to me that this isn't related to the directory-partitioning scheme.

FWIW, the entry in thredds/logs/fc.foo.log apparently corresponding to the trigger event by TDM shown above says

2018-01-27T10:52:00.101 -0600 INFO  - updateGribCollection foo changed false took 0 msecs

It seems like TDS is rejecting the update, or there's something wrong with the update. I tried a number of manual tests using curl, using the various update values in the "update element" section here:

# curl -k -u tdm:$password 'https://localhost:8443/thredds/admin/collection/trigger?collection=foo&trigger=never'
<h3>Collection foo</h3>
 TRIGGER SENT
# curl -k -u tdm:$password 'https://localhost:8443/thredds/admin/collection/trigger?collection=foo&trigger=nocheck'
<h3>Collection foo</h3>
 TRIGGER SENT
# curl -k -u tdm:$password 'https://localhost:8443/thredds/admin/collection/trigger?collection=foo&trigger=testIndexOnly'
<h3>Collection foo</h3>
 TRIGGER SENT
# curl -k -u tdm:$password 'https://localhost:8443/thredds/admin/collection/trigger?collection=foo&trigger=test'
<h3>Collection foo</h3>
 TRIGGER SENT
# curl -k -u tdm:$password 'https://localhost:8443/thredds/admin/collection/trigger?collection=foo&trigger=always'
<h3>Collection foo</h3>
 TRIGGER SENT

with the corresponding log entries in thredds/logs/fc.foo.log:

2018-01-27T11:41:34.062 -0600 INFO  - updateGribCollection foo changed false took 0 msecs
2018-01-27T11:41:41.107 -0600 INFO  - updateGribCollection foo changed false took 0 msecs
2018-01-27T11:41:53.492 -0600 INFO  - updateGribCollection foo changed false took 1 msecs
2018-01-27T11:41:59.768 -0600 INFO  - updateGribCollection foo changed false took 2 msecs
2018-01-27T11:42:15.192 -0600 INFO  - RewriteFilePartition foo took 82 msecs
2018-01-27T11:42:15.193 -0600 INFO  - updateGribCollection foo changed true took 83 msecs

That last one looked promising but, alas, the reported TimeCoverage shows no change. But after restarting Tomcat, I see the correctly updated TimeCoverage.

@lesserwhirls lesserwhirls reopened this Jan 28, 2018
@lesserwhirls
Copy link
Collaborator Author

HI @pmspire - it looks like it does the right thing, but only sometimes. If I add a new date/time directory with a new file, it shows up when I refresh the catalog. If I add a new file to an existing directory, it does not show up unless I restart the TDS.

I've checked out the index files, and everything is being added to those index files correctly, so all of the information about the new files is there. However, it does not look like the TDS recognizes the updated directory index file, and new files added to an existing directory are not exposed. We're getting close! Thanks for working with me on this!

@pmspire
Copy link

pmspire commented Jan 28, 2018

Thanks -- I'm ready to do any testing that would be helpful!

lesserwhirls added a commit to lesserwhirls/thredds that referenced this issue Jan 30, 2018
@lesserwhirls
Copy link
Collaborator Author

So the issue was that the grib collection index files were being cached when users visited a catalog, but that cache was not getting updated when the index files changed, so old collection objects stored in the cache were being used to construct the catalogs. I've made a PR that clears out the cache when an update has been detected, and this fixes the problem. I'll let you know when there is an artifact ready to test on your end. Although the fix is pretty straight forward, finding this issue was a beast.

@pmspire
Copy link

pmspire commented Jan 30, 2018

Nice work, thank you!

@pmspire
Copy link

pmspire commented Mar 2, 2018

Hi Sean -- I saw that the PR was merged. I would be happy to tests an artifact when one is available, as you suggested. I'm still subscribed to this ticket, so if you post something here, I can start testing then. Thanks again.

@lesserwhirls
Copy link
Collaborator Author

Hi @pmspire! This snapshot should have the fix. Just be sure to rename it to thredds##4.6.12-SNAPSHOT-20180301.war or thredds.war before you deploy.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants