Skip to content

Pipelines data validator integration #1635

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
fmendezh opened this issue Aug 24, 2021 · 1 comment
Open

Pipelines data validator integration #1635

fmendezh opened this issue Aug 24, 2021 · 1 comment
Assignees
Milestone

Comments

@fmendezh
Copy link
Contributor

Integrating the IPT and Data Validator can help publishers to improve data before publishing it into GBIF, the data validator provides a consistent API with the running data ingestion platform, such API provides the necessary services to validate Occurrence, Checklist, and Metadata only datasets.

Basic functionality

  1. Once a dataset/resource contains the desired metadata and its data has been uploaded or mapped, the user desires to validate it before publishing it to GBIF.
  2. The IPT generates a DwC-A in a staging location accessible as an external URL and through the Data Validator API requests to validate it.
    • This can also be accomplished by using the Validator API to upload a file.
    • The authentication method must respect the implemented procedures for the IPT.
  3. The Data Validator starts the validation process, returns the validation key for the requested archive, which will be used to track its progress.
  4. Upon successful validation, the IPT should allow the user to publish the resource into GBIF.

Additional considerations

  1. The IPT must provide a way to track the validation progress of an individual resource.
  2. Multiple validation requests for the same resource must be prevented to happen by allowing only one validation running at a time per resource. The Data Validator, already imposes a suggested maximum validation a single user can run in parallel.
  3. Once validation has finished the IPT must delete all temporary files and elements created.
  4. For the IPT shouldn't be necessary to store other information than the validation identifiers executed for each resource, a specific endpoint for IPT validation can also be considered to relieve the IPT of storing additional data.
@spalp
Copy link
Contributor

spalp commented Aug 30, 2024

Wow, thanks to @ckotwn, I just became aware of this incredibly useful feature. Cannot wait to see it in production.
Meanwhile, I added a step for the publisher in the documentation suggesting them to manually check their data using the IPT. Here's the commit: master...spalp:ipt:patch-2 I hope it makes sense.

@mike-podolskiy90 mike-podolskiy90 added this to the 3.1.x milestone Sep 3, 2024
@mike-podolskiy90 mike-podolskiy90 modified the milestones: 3.1.x, 3.2 Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants