-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Google Dataset Search - Getting dataset metadata as JSON-LD Markup (schema.org definition) #1669
Comments
We should also do this for downloads, using |
Agreed. I actually meant datasets in the wider context–so anything for which we mint DOIs, including downloads... |
I will gladly help map data against the schema.org definitions–when/if this issue is prioritized. (our own DOI metadata mapping could probably use a reworking too + upgrade to vers 4.x schema) |
It is likely trivial to add, so yes please :) If you write up what you believe is the appropriate mapping, then we can likely add it easily |
Ok, here's my first attempt at a definition for datasets. I apologize for the pseudo-code, but I'm sure you understand what I'm referring to :) (we might want to add translations for the hardcoded values in the provider section–your call)
|
That logo has a lot of white space around it. Otherwise, can we try this in Dev, @MortenHofft? Also, is there any issue with it appearing on UAT? The UAT registry includes many (older) production downloads, and will be crawled, so it might be necessary to hardcode www.gbif.org rather than let this vary based on the dev/uat/prod site. |
Personally, I'd like to see this in action before we allow Google to crawl it. Why is UAT indexed by Google anyway? |
So the download pages are already not supposed to be crawled; I think Google's data is just what's in DataCite. We can update that with
once it's ready. |
Because the old site was and I was told that was a deliberate decision. I'd be more than happy to change that. |
We have had sitemaps for datasets all along. The reference to them was just listed in robots.txt |
Re the logo file, it looks like Google are scaling down to 50 pixels in height. This one will work fine, I think: https://gbif.box.com/shared/static/dxxlqeikavxw4zadqrryh0hd7ad0gc5q.png |
Any chance we could move forward with this one? If anything further is needed from my side, please let me know :) |
I did a mockup of marked-up metadata for a dataset—this time adding a bit more detail for the contacts. Should be pretty self-explanatory...
This tests ok with the Structured Data Testing Tool |
Here's a mock-up of what markup for a download could look like:
Obviously |
Issue with isBasedOn
@dnoesgaard If we add a list of all datasets then this will significantly increase the size (kb) of those pages. 2.7 mb in the example i tried. I'm not keen to do that. What alternatives do we have? |
Issue with description prose filter
Generating a readable text string is not a simple task. Queries can be nested arbitrarily deep and have NOT parts etc. |
Datasets now have the described data attached. Downloads need more discussion/consideration I find (see comments above) |
@dnoesgaard could you please check to see if it is as you imagine it on UAT or staging? Notice that we do not model people but roles. That means that the same first name might appear many times under different roles. I chose to only list contacts with role ORIGNIATOR as authors. For this dataset this looks like:
UPDATE |
Yeah, I'm just going over this now, also noticing that only type ORIGINATOR was being mapped. I'll get back to you asap... |
kk - as said if we choose more roles, then contacts are likely to be duplicated. And the same person will have slightly different contact info for different roles as publishers understandably do not care to fill in the same data 4 times for the same person just because that person has 4 roles. That makes it quite fragile. The code is here btw: https://github.com/gbif/portal16/blob/master/app/controllers/dataset/key/datasetKey.ctrl.js#L73 |
@dnoesgaard regarding the downloads - is it possible to refer to an external document instead of adding 23K |
Can we please add that?
"address": contact.address + contact.city etc.... |
We generate a string like this for the DOI metadata, perhaps the same method could be employed? Example: |
I did a quick check of 500 datasets and found ORIGINATOR to be the most common contact type used. We can't really do a one-to-one mapping of contacts, so I think your approach is sensible. |
A few additional improvements to the dataset metadata schema - based on http://api.gbif.org/v1/dataset/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d as an example:
That is, temporalCoverages.start and .end separated by a forward slash.
with values from geographicCoverages.boundingBox—can also just otherwise be a string: "spatialCoverage": "France" I realize that we probably allow several instances of these, but I'll leave that to you to decide how to handle... |
@dnoesgaard The result to dnot look correct to me. For 10.15468/dl.cq5skd this returns We can of course do a somewhat readable serialization if it is deemed important. But above solution doesn't look correct to me. If all of this already is in that endpoint (https://api.datacite.org/dois/10.15468/dl.2ohxaa) perhaps we could use that somehow? |
I for one doubt the usefulness of that description in the first place. We could even simply do something like
Or we could include a minified version of the query predicates, like what we show for complex queries, e.g.
A great deal of escaping required though. Or even:
:) |
Great to see this initiative! I can't fault your mapping, but you might consider this...
Also... |
Hey @qgroom, thanks for the input. Can you please clarify which changes/additions to |
@MortenHofft - it looks like we have "version": datasetKey.dataset.version, Make sense? |
I was to add the name as the fullname, rather than the given and family name. I think this is more usual, so I suppose it would make the name more findable. |
Currently, there is a I think we can make a strong argument for this being added to dataset. This should perhaps be raised as an issue for schema.org. @stylesm might be interested in this thread |
It looks like we should be doing either
Re. identifier absoluty agree on importance. We should add:
For contacts, I believe we're already using the organization of the contact rather than the publisher org, see https://github.com/gbif/portal16/blob/5f07c9d37b5a0937181aa511bca2800d3a2e08b1/app/controllers/dataset/key/datasetKey.ctrl.js#L92 |
@MortenHofft - if you're losing track of this, I'm happy to compile all input for you for later... ;) |
With that being said, I'd also like to see a version of this live soon. We can always make additions later on... |
I think the issue with adding The example we were talking about with the Sample type and the reason for renaming the Bioschemas However, the So in code this would look a bit like:
Thoughts? |
Summarizing outstanding issues that I'd like to see done:
(or simply e.g.
After testing these, I suggest we do the first prod release. |
(as always, apologies for the pseudocode) |
Indeed, most of the properties of
Although taxonomy is very bio, it is also quite fundamental for many datasets. We could perhaps make more use of identifiers and |
@qgroom but if you use |
We have the first version of this live now and would like to close this issue. Please consider raising new issues with additional improvements to the markup. |
Looks like GDS is starting to pick up the metadata now: |
Does this include JSON-LD for the info in |
This issue has been discussed in various places, including Twitter, Discourse, etc. but I'm not sure we ever had a real issue for it. At least I couldn't find one...
Background: Google has recently launched Dataset Search and through Datacite, all GBIF datasets for which DOIs have been minted, are exposed. However, due to a lack of structured markup, the search engine falls back to the Datacite version (which has a Datacite logo and links to a Datacite search), e.g.
Bernice P. Bishop Museum
This could be improved if the GBIF dataset page included JSON-LD Markup, e.g.
Google reference: https://developers.google.com/search/docs/data-types/dataset
Schema.org dataset definition: http://schema.org/Dataset
The text was updated successfully, but these errors were encountered: