Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC Project Proposal]: Combined DwCA/Croissant JSON-LD examples and tools #70

Open
csbrown-noaa opened this issue Jan 24, 2025 · 9 comments
Labels
GSoC25 project idea Designates a proposed project idea

Comments

@csbrown-noaa
Copy link

csbrown-noaa commented Jan 24, 2025

Project Description

The Croissant ML Dataset specification and tooling [1] from MLCommons is a metadata and vocabulary standard for AI/ML datasets. This standard permits a uniform language to describe datasets useful for common ML tasks, such as bounding-box image annotations. The Darwin Core (DwC) specification and tooling [2] from Biodiversity Information Standards (TDWG) is a metadata and vocabulary standard for Ecological/Biodiversity data. NOAA Fisheries has large quantities of data that span this divide: Ecological images with concomitant annotations. As such, it would be convenient to have metadata and tooling that simultaneously employs both of these standards. The project would primarily consist of python tools to convert existing DwC metadata (in XML format) into JSON-LD format, example metadata for datasets that employ both standards, and a proof-of-concept pipeline that uses a combined DwC/Croissant-described dataset to train an AI model that predicts DwC categories from images.

[1] https://docs.mlcommons.org/croissant/docs/croissant-spec.html
[2] https://dwc.tdwg.org/terms/

Expected Outcomes

  1. Skeleton libraries to convert Darwin Core Archive (DwCA) format XML metadata into JSON-LD.
  2. Example metadata file employing Croissant and DwC vocabularies to encode a NOAA Fisheries image annotation dataset
  3. Proof-of-concept pipeline to ingest this example metadata file, and use it to feed the accompanying data through an off-the-shelf AI computer vision model training/test workflow.

Skills Required

python, json, xml, interest in Ecology

Additional Background/Issues

No response

Mentor(s)

scott.brown@noaa.gov

Expected Project Size

175 hours

Project Difficulty

Intermediate

@csbrown-noaa csbrown-noaa added GSoC25 project idea Designates a proposed project idea labels Jan 24, 2025
@AryanPrakhar
Copy link

AryanPrakhar commented Jan 24, 2025

Hey! I’d love to work on this project. I’ve built datasets and benchmarks during internships and worked on data-driven scientific research, including a publication at ICLR. I’m confident I can help with building a modular XML parser for metadata mapping and a vision pipeline.

I had a few questions:

  1. How will fields without direct Croissant equivalents be handled?
  2. What evaluation metrics will be used for metadata conversion?
  3. What kind of vision model are we planning to test with (e.g., ResNet, Vision Transformers)?

Looking forward to collaborating!

@csbrown-noaa
Copy link
Author

@AryanPrakhar great questions.

  1. JSON-LD [1] is flexible enough that we can reference multiple vocabularies. In this case, we are interested in Croissant [2] and Darwin Core [3] vocabularies. Croissant is already JSON-LD native. Darwin Core is not, but the vocabulary reference could easily admit a JSON-LD format (just nobody has done it yet [4]). While it is the case that some of our datasets have additional variables (like depth, salinity, etc), we'll just leave those without metadata until the standards come up to date. Right now our datasets have little or no metadata, so getting just the annotations and the biodiversity data encoded would be 💯

  2. There are no metrics, because the expected output (Darwin Core + Croissant JSON-LD) does not exist. Creating this expected output would be part of the work in figuring out the XML<->JSON-LD translation, and I am happy to mentor/advise/assist w.r.t. the vocabulary here. Depending on the outcome, we should present our derived example (with or without the automated tooling) at TDWG 2025 [5] to illustrate 1) what a Darwin Core Archive [6] meta.xml should look like in JSON-LD and 2) how JSON-LD empowers interoperability with other standards already using JSON-LD (with Croissant as an example)

  3. This depends on what is easy and/or interesting to you. The Ultralytics YOLO pipeline is pretty straightforward, and I can help in this regard. If you have other suggestions, let's do it. 😄

[1] https://json-ld.org/
[2] https://docs.mlcommons.org/croissant/docs/croissant-spec.html
[3] https://dwc.tdwg.org/terms/
[4] tdwg/dwc#447
[5] https://www.tdwg.org/conferences/2024/
[6] https://ipt.gbif.org/manual/en/ipt/latest/dwca-guide

@AryanPrakhar
Copy link

AryanPrakhar commented Jan 25, 2025

Thanks for the details! I’m already diving into the resources—super excited to work on this and take it to the conference! Ultralytics YOLO is a good option. Let’s make this happen.

@MathewBiddle
Copy link
Contributor

This is an exciting looking project @csbrown-noaa! I'm looping in a few other folks who might have some additional background to help this project.

cc: @sformel-usgs, @7yl4r

@7yl4r
Copy link
Contributor

7yl4r commented Jan 30, 2025

I found this related GBIF discussion, so perhaps GBIF is already publishing json-ld from DwC archives. I haven't found the files yet. Some more discussion with technical teams at GBIF+OBIS is needed at this stage.

If there is already a pipeline in place, this project could focus on using the json-ld.

@csbrown-noaa
Copy link
Author

@7yl4r nice catch. It looks like this issue is maybe related to the metadata in eml.xml, which has organization info, etc, not the actual darwin core vocab.

@pieterprovoost
Copy link

@7yl4r Yes, both GBIF and OBIS are embedding JSON-LD in their dataset pages based on EML, just view the page source and look for application/ld+json. I'm not aware of a full-fledged EML to JSON-LD solution, our implementation is rather bespoke but maybe GBIF has one. We do have an interest in a better implementation as JSON-LD will be the main entry point for dataset metadata into the next version of the GOOS BioEco portal.

@AryanPrakhar
Copy link

Thanks for the updates. If there’s already an existing pipeline, that could definitely be helpful. I’m happy to dive deeper into any specific tasks you think would be most valuable. Let me know what’s next!

@csbrown-noaa
Copy link
Author

@AryanPrakhar there will need to be discussions and paperwork and such on our end to organize the GSOC projects that may take a little while. There's no rush just now, and we're glad that you're so excited. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GSoC25 project idea Designates a proposed project idea
Projects
None yet
Development

No branches or pull requests

5 participants