-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSoC Project Proposal]: Combined DwCA/Croissant JSON-LD examples and tools #70
Comments
Hey! I’d love to work on this project. I’ve built datasets and benchmarks during internships and worked on data-driven scientific research, including a publication at ICLR. I’m confident I can help with building a modular XML parser for metadata mapping and a vision pipeline. I had a few questions:
Looking forward to collaborating! |
@AryanPrakhar great questions.
[1] https://json-ld.org/ |
Thanks for the details! I’m already diving into the resources—super excited to work on this and take it to the conference! Ultralytics YOLO is a good option. Let’s make this happen. |
This is an exciting looking project @csbrown-noaa! I'm looping in a few other folks who might have some additional background to help this project. cc: @sformel-usgs, @7yl4r |
I found this related GBIF discussion, so perhaps GBIF is already publishing json-ld from DwC archives. I haven't found the files yet. Some more discussion with technical teams at GBIF+OBIS is needed at this stage. If there is already a pipeline in place, this project could focus on using the json-ld. |
@7yl4r nice catch. It looks like this issue is maybe related to the metadata in |
@7yl4r Yes, both GBIF and OBIS are embedding JSON-LD in their dataset pages based on EML, just view the page source and look for |
Thanks for the updates. If there’s already an existing pipeline, that could definitely be helpful. I’m happy to dive deeper into any specific tasks you think would be most valuable. Let me know what’s next! |
@AryanPrakhar there will need to be discussions and paperwork and such on our end to organize the GSOC projects that may take a little while. There's no rush just now, and we're glad that you're so excited. :) |
Project Description
The Croissant ML Dataset specification and tooling [1] from MLCommons is a metadata and vocabulary standard for AI/ML datasets. This standard permits a uniform language to describe datasets useful for common ML tasks, such as bounding-box image annotations. The Darwin Core (DwC) specification and tooling [2] from Biodiversity Information Standards (TDWG) is a metadata and vocabulary standard for Ecological/Biodiversity data. NOAA Fisheries has large quantities of data that span this divide: Ecological images with concomitant annotations. As such, it would be convenient to have metadata and tooling that simultaneously employs both of these standards. The project would primarily consist of python tools to convert existing DwC metadata (in XML format) into JSON-LD format, example metadata for datasets that employ both standards, and a proof-of-concept pipeline that uses a combined DwC/Croissant-described dataset to train an AI model that predicts DwC categories from images.
[1] https://docs.mlcommons.org/croissant/docs/croissant-spec.html
[2] https://dwc.tdwg.org/terms/
Expected Outcomes
Skills Required
python, json, xml, interest in Ecology
Additional Background/Issues
No response
Mentor(s)
scott.brown@noaa.gov
Expected Project Size
175 hours
Project Difficulty
Intermediate
The text was updated successfully, but these errors were encountered: