Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TG2 - understanding the status and process of developing tests and assertions #192

Open
ymgan opened this issue Aug 27, 2021 · 4 comments
Open

Comments

@ymgan
Copy link
Collaborator

ymgan commented Aug 27, 2021

Hey @tucotuco

As mentioned, these are the questions from OBIS data quality task team:

  • What is the process the BDQ TG2 went through to develop that spreadsheet of tests and assertions?
  • As far as I understood, the tests and assertions are finalized - does that mean they will no longer be updated?
  • If we encounter a test that was missed should we inform TG2?
  • Can you talk to us about if/how GBIF has integrated these tests into their processes?
  • What is the current status of the BDQ IG? Are there active tasks groups we could join?

Thank you so much!

@Tasilee
Copy link
Collaborator

Tasilee commented Aug 30, 2021

@tucotuco is likely far busier than me, and I reckon I can answer these questions.

  • The process that was followed was that I trawled the web looking for existing tests that were being used by agencies such as GBIF, the ALA, CRIA, iDigBio etc. These were compiled into a spreadsheet and then we classified the tests in multiple ways, such as did they deal with NAME (of species etc), SPACE (e.g., lat/long in correct country), TIME (e.g., viable ISO date/times) or OTHER (e.g., valid license for use). We then refined the tests, filtering out those that we didn't consider as CORE (=basic), those that were hard to implement (as we wanted wide implementation), continued to add to the classification that you now see on GitHub (e.g., Expected response). We filled the discovered gaps with tests, and occasionally refined existing tests. a LOT of work has gone into those that remain. Once the tests were finalized, we started on test data. All the details can be found in the paper https://biss.pensoft.net/article/50889/. We have completed the test data for most of the 'tests' and now just needing to finish a subset of the amendment 'tests'.
  • The tests are finalized as far as TG2 is concerned as we have been refining now for nearly 5 years. However, that is not to say that the team (Arthur, John, Paul, Paula or I) may find something we need to discuss, but it is very unlikely. Once we finalize the test data, then the tests will be submitted as a TDWG standard. Note again, these are the CORE tests. We understand others may be added for domains such as marine. Whether they become a new section of the 'standard', can't say, but it would seem a good idea to have some QA/QC - and the usual benefit of standards.
  • If you find a CORE test that is missing, then please inform me. It may be that we considered it and rejected it for reasons which will be documented, hopefully. If it is a genuine GOTCHA, then we would always be open to addition prior to submission.
  • GBIF integration: No, can't answer that one myself, but @timrobertson100 may be be able to tell you what if anything is happening. I am aware that the ALA had given a commitment for test integration and seeing that the 'back-end' of the main databases are now 'aligned', this would seem an easier task.
  • DQ IG: Probably need to have @ArthurChapman or @saraiva-usp fill you in on this, but the IG will always be open for new members and either of those two could let you know the status of the TGs. I would certainly expect Paula's TG4 is always open for help. We in TG2 would always welcome help on finalizing the test data. It would be fair to say that after years of effort, Arthur, John, Paul, Paula and I are a tad 'burnt out'. New blood would be nice.

@tucotuco, @ArthurChapman, @chicoreus, @pzermoglio - can comment further.

@timrobertson100
Copy link
Member

timrobertson100 commented Aug 30, 2021

Can you talk to us about if/how GBIF has integrated these tests into their processes?

Thanks, @ymgan. I think the GBIF processing covers what's behind these tests, for the most part, flagging records accordingly but it isn't a strict implementation. There may be some slight differences in the rules, likely arising from the fact GBIF deals with data in a variety of formats and due to long-term API stability. The GBIF validations and enrichments are done in the gbif/pipelines project which powers the GBIF and ALA ingestion and the GBIF validator which will shortly be integrated into the GBIF IPT.

@tucotuco
Copy link
Member

tucotuco commented Aug 30, 2021 via email

@ymgan
Copy link
Collaborator Author

ymgan commented Oct 12, 2021

Thank you so much Lee, Tim and John!! I really appreciate it!

@pieterprovoost - This is the issue that I mentioned in our previous data QC task team meeting. Let's see if we can make use existing flags developed by the task group and GBIF.

I believe GBIF's flags for the data validator is here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants