Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProjectIDs on individual records, rather than a dataset as a whole #836

Closed
ahahn-gbif opened this issue Nov 2, 2022 · 12 comments
Closed
Assignees

Comments

@ahahn-gbif
Copy link

Idea/wish captured from feedback of the regional support contractors (BID) to GBIFS:

"It is being defined with SiB Colombia how to identify in each record of a dataset its link with the BID project, within the framework of the publication of data from partner organizations/collections in the Colombian BID-CA2020 projects. The use of DwC fields such as datasetID or datasetName has been proposed by the Regional Support, but in some cases that could create conflict when the field was filled with previous data. GBIF is encouraged in building its new data model to look for a more effective mechanism to accomplish this and clarify it for the BID projects (and project partners)."

There are two main reasons for this request:

  • being able to adequately report what has been delivered in the context of a project where records are added or amended in an already pre-existing dataset (including, but not limited to, eBird, iNaturalist), and
  • being able to show such records within he project context, without either having to omit the dataset completely (as above), or alternatively overstating the dataset’s contribution by co-reporting all already existing records

Unfortunately, this is not easy – individual records would have to carry the project ID right from the point they are captured at record level – our transfer schema does not really allow for that. We are presently getting around the delivery-reporting requirement, e.g. in cases where records are published through eBird or iNaturalist, by requesting an explicit report on the data published in the project context. This is only for internal evaluation though. The second part is not easily possible, since there is no “project ID” field at record level.

Open question: do the benefits outweigh the added requirements, including internal data management and UI needs for surfacing this information?

@timrobertson100
Copy link
Member

Adding a multivalue gbif:projectID* field to the records that is supported in the IPT, ingestion, search and download on GBIF.org is not particularly difficult. We have done this for recordedByID before DwC accepted the term for example.

I'd suggest we support any project ID but give clear guidance on how people should refer to GBIF-issued project IDs (e.g. gbif:projectID=gbif:BID-PA2020-010-REG) so that we are able to clearly link them using search /search?projectID=gbif:BID-PA2020-010-REG - it may be that we don't need to prefix them if we are confident they are likely globally unique.

Would that be desirable? If so, we should move this request into gbif/pipelines.

⁣* note gbif: here is to indicate the namespace of the term, not that it is a GBIF-issued ID

@ahahn-gbif
Copy link
Author

I would think it desirable, thanks - and agree about a prescribed syntax for the record level.

Some follow-up considerations, just off the top of my head:

  • for a dataset that has a ProjectID at metadata level but none at record level - would we need to auto-populate all records from the metadata, or does that not make sense?
  • for a GBIF (BID, BIFA, CESP) project page, record level ProjectID filters would need to be included alongside dataset level ones to document the project's data contribution (the use case that started this request in the first place)
  • inclusion in documentation/training materials needed

@ManonGros
Copy link

@dagendresen FYI

@timrobertson100
Copy link
Member

Moving this into pipelines then.

@timrobertson100 timrobertson100 transferred this issue from gbif/portal-feedback Dec 8, 2022
@camiplata
Copy link

Great @ahahn-gbif, we are highly interested in adopting this solution as with the BID project we had to create many matadata only datasets to fullfill the BID report needs

@marcos-lg
Copy link
Contributor

I was exploring what we need to do in the development side in pipelines.

We already have a projectId term in the GBIF namespace and we populate it with the dataset projectId. So we have to do the following:

  • Make the projectId a multivalue field
  • Populate it with the projectId of the record if exists. Otherwise we take the projectId from the dataset. If both exist and they are different we take both values.

Then we need to adapt the IPT, search, downloads, portal, etc. and the field will be used as the other multivalue fields that we already have.

Is there anything that we are missing or has to be done differently?

@camiplata
Copy link

@marcos-lg This means that projectID from metadata will also become a multivalue field? That also will be useful as a collection o monitoring programs will have multiple financial sources across the years.

@marcos-lg
Copy link
Contributor

@camiplata we can make the projectID from the metadata multivalue too but it has more implications so we need to plan it more carefully. I created this issue in the IPT so we can track it gbif/ipt#1927

@MBLaursen
Copy link

I concur that allowing adding projectID to individual occurrences would be very useful to monitor/acknowledge contribution of various projects to bigger datasets. Right now, projects can only refer to metadata only datasets, which is not really representative of their data mobilzation work.

@timhirsch
Copy link

timhirsch commented Aug 21, 2023

While this may be over-interpreting the current suggestion, I can see this approach being very useful in a number of contexts for GBIF, e.g.

  • as mentioned by @MBLaursen , a means of demonstrating a project's contribution to very large existing datasets, .e. g. for the African Bird Atlas project where a lot of disambiguation was required to avoid over-counting of mobilized records
  • (possibly) a means of tagging records contributed indirectly to GBIF by means of another aggregator such as eBird, e.g. in the case of an early BIFA project in India where we were not able to reflect the huge mobilization effort made by the national eBird partner Bird Count India. Probably several steps down the road, but would record-level IDs make this kind of attribution within large datasets be more feasible? Also potentially for iNat platforms, projects?

@ymgan
Copy link

ymgan commented Aug 21, 2023

@debpaul this reminds me of your question at tdwg/dwc-qa#199

@marcos-lg
Copy link
Contributor

marcos-lg commented Aug 31, 2023

Deployed to PROD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants