[GSoC Project Proposal]: Combined DwCA/Croissant JSON-LD examples and tools #70

csbrown-noaa · 2025-01-24T19:13:51Z

Project Description

The Croissant ML Dataset specification and tooling [1] from MLCommons is a metadata and vocabulary standard for AI/ML datasets. This standard permits a uniform language to describe datasets useful for common ML tasks, such as bounding-box image annotations. The Darwin Core (DwC) specification and tooling [2] from Biodiversity Information Standards (TDWG) is a metadata and vocabulary standard for Ecological/Biodiversity data. NOAA Fisheries has large quantities of data that span this divide: Ecological images with concomitant annotations. As such, it would be convenient to have metadata and tooling that simultaneously employs both of these standards. The project would primarily consist of python tools to convert existing DwC metadata (in XML format) into JSON-LD format, example metadata for datasets that employ both standards, and a proof-of-concept pipeline that uses a combined DwC/Croissant-described dataset to train an AI model that predicts DwC categories from images.

[1] https://docs.mlcommons.org/croissant/docs/croissant-spec.html
[2] https://dwc.tdwg.org/terms/

Expected Outcomes

Skeleton libraries to convert Darwin Core Archive (DwCA) format XML metadata into JSON-LD.
Example metadata file employing Croissant and DwC vocabularies to encode a NOAA Fisheries image annotation dataset
Proof-of-concept pipeline to ingest this example metadata file, and use it to feed the accompanying data through an off-the-shelf AI computer vision model training/test workflow.

Skills Required

python, json, xml, interest in Ecology

Additional Background/Issues

No response

Mentor(s)

scott.brown@noaa.gov

Expected Project Size

175 hours

Project Difficulty

Intermediate

AryanPrakhar · 2025-01-24T20:14:17Z

Hey! I’d love to work on this project. I’ve built datasets and benchmarks during internships and worked on data-driven scientific research, including a publication at ICLR. I’m confident I can help with building a modular XML parser for metadata mapping and a vision pipeline.

I had a few questions:

How will fields without direct Croissant equivalents be handled?
What evaluation metrics will be used for metadata conversion?
What kind of vision model are we planning to test with (e.g., ResNet, Vision Transformers)?

Looking forward to collaborating!

csbrown-noaa · 2025-01-24T20:39:34Z

@AryanPrakhar great questions.

JSON-LD [1] is flexible enough that we can reference multiple vocabularies. In this case, we are interested in Croissant [2] and Darwin Core [3] vocabularies. Croissant is already JSON-LD native. Darwin Core is not, but the vocabulary reference could easily admit a JSON-LD format (just nobody has done it yet [4]). While it is the case that some of our datasets have additional variables (like depth, salinity, etc), we'll just leave those without metadata until the standards come up to date. Right now our datasets have little or no metadata, so getting just the annotations and the biodiversity data encoded would be 💯
There are no metrics, because the expected output (Darwin Core + Croissant JSON-LD) does not exist. Creating this expected output would be part of the work in figuring out the XML<->JSON-LD translation, and I am happy to mentor/advise/assist w.r.t. the vocabulary here. Depending on the outcome, we should present our derived example (with or without the automated tooling) at TDWG 2025 [5] to illustrate 1) what a Darwin Core Archive [6] meta.xml should look like in JSON-LD and 2) how JSON-LD empowers interoperability with other standards already using JSON-LD (with Croissant as an example)
This depends on what is easy and/or interesting to you. The Ultralytics YOLO pipeline is pretty straightforward, and I can help in this regard. If you have other suggestions, let's do it. 😄

[1] https://json-ld.org/
[2] https://docs.mlcommons.org/croissant/docs/croissant-spec.html
[3] https://dwc.tdwg.org/terms/
[4] tdwg/dwc#447
[5] https://www.tdwg.org/conferences/2024/
[6] https://ipt.gbif.org/manual/en/ipt/latest/dwca-guide

AryanPrakhar · 2025-01-25T06:59:05Z

Thanks for the details! I’m already diving into the resources—super excited to work on this and take it to the conference! Ultralytics YOLO is a good option. Let’s make this happen.

MathewBiddle · 2025-01-30T15:39:24Z

This is an exciting looking project @csbrown-noaa! I'm looping in a few other folks who might have some additional background to help this project.

cc: @sformel-usgs, @7yl4r

7yl4r · 2025-01-30T16:41:59Z

I found this related GBIF discussion, so perhaps GBIF is already publishing json-ld from DwC archives. I haven't found the files yet. Some more discussion with technical teams at GBIF+OBIS is needed at this stage.

If there is already a pipeline in place, this project could focus on using the json-ld.

csbrown-noaa · 2025-01-30T17:05:56Z

@7yl4r nice catch. It looks like this issue is maybe related to the metadata in eml.xml, which has organization info, etc, not the actual darwin core vocab.

pieterprovoost · 2025-01-30T17:19:38Z

@7yl4r Yes, both GBIF and OBIS are embedding JSON-LD in their dataset pages based on EML, just view the page source and look for application/ld+json. I'm not aware of a full-fledged EML to JSON-LD solution, our implementation is rather bespoke but maybe GBIF has one. We do have an interest in a better implementation as JSON-LD will be the main entry point for dataset metadata into the next version of the GOOS BioEco portal.

AryanPrakhar · 2025-02-02T10:22:40Z

Thanks for the updates. If there’s already an existing pipeline, that could definitely be helpful. I’m happy to dive deeper into any specific tasks you think would be most valuable. Let me know what’s next!

csbrown-noaa · 2025-02-03T15:40:36Z

@AryanPrakhar there will need to be discussions and paperwork and such on our end to organize the GSOC projects that may take a little while. There's no rush just now, and we're glad that you're so excited. :)

csbrown-noaa added GSoC25 project idea labels Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GSoC Project Proposal]: Combined DwCA/Croissant JSON-LD examples and tools #70

[GSoC Project Proposal]: Combined DwCA/Croissant JSON-LD examples and tools #70

csbrown-noaa commented Jan 24, 2025 •

edited

Loading

AryanPrakhar commented Jan 24, 2025 •

edited

Loading

csbrown-noaa commented Jan 24, 2025

AryanPrakhar commented Jan 25, 2025 •

edited

Loading

MathewBiddle commented Jan 30, 2025

7yl4r commented Jan 30, 2025

csbrown-noaa commented Jan 30, 2025

pieterprovoost commented Jan 30, 2025

AryanPrakhar commented Feb 2, 2025

csbrown-noaa commented Feb 3, 2025

[GSoC Project Proposal]: Combined DwCA/Croissant JSON-LD examples and tools #70

[GSoC Project Proposal]: Combined DwCA/Croissant JSON-LD examples and tools #70

Comments

csbrown-noaa commented Jan 24, 2025 • edited Loading

Project Description

Expected Outcomes

Skills Required

Additional Background/Issues

Mentor(s)

Expected Project Size

Project Difficulty

AryanPrakhar commented Jan 24, 2025 • edited Loading

csbrown-noaa commented Jan 24, 2025

AryanPrakhar commented Jan 25, 2025 • edited Loading

MathewBiddle commented Jan 30, 2025

7yl4r commented Jan 30, 2025

csbrown-noaa commented Jan 30, 2025

pieterprovoost commented Jan 30, 2025

AryanPrakhar commented Feb 2, 2025

csbrown-noaa commented Feb 3, 2025

csbrown-noaa commented Jan 24, 2025 •

edited

Loading

AryanPrakhar commented Jan 24, 2025 •

edited

Loading

AryanPrakhar commented Jan 25, 2025 •

edited

Loading