Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Dataset Search - Getting dataset metadata as JSON-LD Markup (schema.org definition) #1669

Closed
dnoesgaard opened this issue Nov 28, 2018 · 43 comments

Comments

@dnoesgaard
Copy link
Member

This issue has been discussed in various places, including Twitter, Discourse, etc. but I'm not sure we ever had a real issue for it. At least I couldn't find one...

Background: Google has recently launched Dataset Search and through Datacite, all GBIF datasets for which DOIs have been minted, are exposed. However, due to a lack of structured markup, the search engine falls back to the Datacite version (which has a Datacite logo and links to a Datacite search), e.g.

Bernice P. Bishop Museum

screenshot 2018-11-28 at 17 37 32

This could be improved if the GBIF dataset page included JSON-LD Markup, e.g.

<script type="application/ld+json">
{
  "@context" : "http://schema.org",
  "@type" : "Dataset",
  "name" : "Bernice P. Bishop Museum",
  "description" : "The Bernice Pauahi Bishop Museum, designated the Hawaiʻi State Museum of Natural and Cultural History, is a museum of history and science located in the Kalihi district of Honolulu on the Hawaiian island of O’ahu. Founded in 1889, it is the largest museum in Hawai’i and is home to one of the world’s largest collections of natural history material from the Pacific region, with approximately 21 million specimens. The main collections include Entomology, Malacology, Botany, Ichthyology, Vertebrate…",
  "spatialCoverage" : "Primarily focused on the tropical Indo-Pacific region, with an emphasis on Oceania",
  "identifier" : "10.15468/s6ctus",
  "license" : "CC0 1.0",
  "distribution" : {
    "@type" : "DataDownload",
    "contentUrl" : "https://www.gbif.org/dataset/b929f23d-290f-4e85-8f17-764c55b3b284"
  },
  "sourceOrganization" : "Bernice Pauahi Bishop Museum",
  "datePublished" : "2012-10-23"
}
</script>

Google reference: https://developers.google.com/search/docs/data-types/dataset
Schema.org dataset definition: http://schema.org/Dataset

@MattBlissett
Copy link
Member

We should also do this for downloads, using isBasedOn, and add (restore?) the sitemap for datasets

@dnoesgaard
Copy link
Member Author

Agreed. I actually meant datasets in the wider context–so anything for which we mint DOIs, including downloads...

@dnoesgaard
Copy link
Member Author

I will gladly help map data against the schema.org definitions–when/if this issue is prioritized.

(our own DOI metadata mapping could probably use a reworking too + upgrade to vers 4.x schema)

@MortenHofft
Copy link
Member

@dnoesgaard : I will gladly help map data against the schema.org definitions–when/if this issue is prioritized.

It is likely trivial to add, so yes please :) If you write up what you believe is the appropriate mapping, then we can likely add it easily

@dnoesgaard
Copy link
Member Author

dnoesgaard commented Dec 3, 2018

Ok, here's my first attempt at a definition for datasets. I apologize for the pseudo-code, but I'm sure you understand what I'm referring to :)

(we might want to add translations for the hardcoded values in the provider section–your call)

<script type="application/ld+json">{
  "@context": "http://schema.org",
  "@type": "Dataset",
  "@id": "https://doi.org/" + datasetKey.dataset.doi,
  "identifier": [
    {
      "@type": "PropertyValue",
      "propertyID": "doi",
      "value": "https://doi.org/" + datasetKey.dataset.doi
    },
    {
      "@type": "PropertyValue",
      "propertyID": "UUID",
      "value": datasetKey
    }
  ],
  "url": "https://www.gbif.org/dataset/" + datasetKey,
  "name": datasetKey.dataset.title,
  // in the author section, we need an entry for each of the dataset contacts
  "author": [
    // repeat the next element for each of the dataset contacts
    {
      "@type": "Person",
      "givenName": contact.firstName,
      "familyName": contact.lastName,
      "email": contact.email,
      "identifier" : contact.userId,
      "telephone": contact.phone,
      "url": contact.homepage
    },
  	... 
  ],
  "description": datasetKey.dataset.description,
  "license": datasetKey.dataset.license,
  "inLanguage": datasetKey.dataset.dataLanguage,
  "datePublished": datasetKey.dataset.created,
  "dateModified": datasetKey.dataset.modified,
  "publisher": {
      "@type": "Organization",
      "name": publisherKey.publisher.title,
      "url": publisherKey.publisher.homepage[0],
      "logo": publisherKey.publisher.logoUrl,
      "email": publisherKey.publisher.email[0],
      "telephone": publisherKey.publisher.phone[0]
    },
  "provider": {
    "@type": "Organization",
      "name": "GBIF",
      "url": "https://www.gbif.org",
      "logo": "https://www.gbif.org/img/logo/GBIF-2015.png",
      "email": "info@gbif.org",
      "telephone": "+45 35 32 14 70"
  }
}</script>

@MortenHofft MortenHofft self-assigned this Dec 5, 2018
@MattBlissett
Copy link
Member

That logo has a lot of white space around it.

Otherwise, can we try this in Dev, @MortenHofft? Also, is there any issue with it appearing on UAT? The UAT registry includes many (older) production downloads, and will be crawled, so it might be necessary to hardcode www.gbif.org rather than let this vary based on the dev/uat/prod site.

@dnoesgaard
Copy link
Member Author

Personally, I'd like to see this in action before we allow Google to crawl it.

Why is UAT indexed by Google anyway?

@MattBlissett
Copy link
Member

MattBlissett commented Jan 28, 2019

User-agent: *
...
Disallow: /occurrence/

So the download pages are already not supposed to be crawled; I think Google's data is just what's in DataCite.

We can update that with

Disallow: /occurrence/1
Disallow: /occurrence/2
Disallow: /occurrence/3
...

once it's ready.

@MortenHofft
Copy link
Member

MortenHofft commented Jan 28, 2019

@dnoesgaard: Why is UAT indexed by Google anyway?

Because the old site was and I was told that was a deliberate decision. I'd be more than happy to change that.

@MortenHofft
Copy link
Member

@MattBlissett: We should also [...] add (restore?) the sitemap for datasets

We have had sitemaps for datasets all along. The reference to them was just listed in robots.txt

@dnoesgaard
Copy link
Member Author

Re the logo file, it looks like Google are scaling down to 50 pixels in height. This one will work fine, I think: https://gbif.box.com/shared/static/dxxlqeikavxw4zadqrryh0hd7ad0gc5q.png

Mock-up:
screenshot 2019-01-28 at 10 15 16

@dnoesgaard
Copy link
Member Author

Any chance we could move forward with this one? If anything further is needed from my side, please let me know :)

@dnoesgaard
Copy link
Member Author

I did a mockup of marked-up metadata for a dataset—this time adding a bit more detail for the contacts. Should be pretty self-explanatory...

<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "Dataset",
  "@id": "https://doi.org/10.15468/6q5vuc",
  "identifier": [
    {
      "@type": "PropertyValue",
      "propertyID": "doi",
      "value": "https://doi.org/10.15468/6q5vuc"
    },
    {
      "@type": "PropertyValue",
      "propertyID": "UUID",
      "value": "005eb8d8-ed94-41be-89cf-e3115a9058e4"
    }
  ],
  "url": "https://www.gbif.org/dataset/005eb8d8-ed94-41be-89cf-e3115a9058e4",
  "name": "Field Museum of Natural History (Zoology) Invertebrate Collection",
  "author": [
    {
      "@type": "Person",
      "givenName": "Sharon",
      "familyName": "Grant",
      "email": "sgrant@fieldmuseum.org",
      "telephone": "3126657203",
      "jobTitle": "Technology Liaison to Science",
      "affiliation": {
        "@type": "Organization",
        "name": "Field Museum of Natural History"
      },
      "address": {
      	"@type": "PostalAddress",
      	"streetAddress": "1400 S Lake Shore Drive",
      	"addressLocality": "Chicago",
      	"postalCode": "60605",
      	"addressRegion": "CA",
      	"addressCountry": "US"
      }
    },
    {
      "@type": "Person",
      "givenName": "Jones",
      "familyName": "Janeen",
      "email": "jjones@fieldmuseum.org",
      "telephone": "",
      "jobTitle": "Assistant Collections Manager",
      "affiliation": {
        "@type": "Organization",
        "name": "Field Museum of Natural History"
      },
      "address": {
      	"@type": "PostalAddress",
      	"streetAddress": "1400 S Lake Shore Drive",
      	"addressLocality": "Chicago",
      	"postalCode": "60605",
      	"addressRegion": "CA",
      	"addressCountry": "US"
      }
    } 
  ],
  "description": "Established in 1938, the Division of Invertebrates is in charge of all invertebrate groups except insects and other non-marine arthropods. The first curator of this Division was Fritz Haas, formerly of the Senckenberg Museum in Frankfurt, Germany. Haas (1938 - 1969) and his successor Alan Solem (1957 - 1990) built massive mollusk collections, particularly strong in unionid bivalves and terrestrial snails, reflecting their respective research interests. Current curators Rüdiger Bieler (1990 -) and Janet Voight (1990 -) focus their research and collection-building on marine molluscan groups. The varied curatorial research interests, the collecting efforts of past and present collections managers (e.g., John Slapcinsky and Jochen Gerber), and acquisitions of private collections and “orphan collections”",
  "license": "http://creativecommons.org/publicdomain/zero/1.0/legalcode",
  "inLanguage": "eng",
  "datePublished": "2012-07-11",
  "dateModified": "2019-02-05",
  "publisher": {
      "@type": "Organization",
      "name": "Field Museum",
      "url": "http://www.fieldmuseum.org/",
      "email": "sgrant@fieldmuseum.org",
      "telephone": "312-665-7957"
    },
  "provider": {
    "@type": "Organization",
      "name": "GBIF",
      "url": "https://www.gbif.org",
      "logo": "https://www.gbif.org/img/logo/GBIF-2015.png",
      "email": "info@gbif.org",
      "telephone": "+45 35 32 14 70"
  } 
}  
</script>

This tests ok with the Structured Data Testing Tool

@dnoesgaard
Copy link
Member Author

Here's a mock-up of what markup for a download could look like:

<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "DataSet",
  "@id": "https://doi.org/10.15468/dl.2ohxaa",
  "distribution": {
  	"@type": "DataDownload",
  	"contentUrl": "http://api.gbif.org/v1/occurrence/download/request/0029115-180131172636756.zip",
  	"contentSize": "450889",
  	"encodingFormat": "text/csv",
  	"expires": "2020-03-13"
  },
  "isBasedOn": [
  	{
  		"@type": "DataSet",
  		"@id": "https://doi.org/10.15468/pdlhty"
  	},
  	{
  		"@type": "DataSet",
  		"@id": "https://doi.org/10.15468/xezr5g"
  	}],
  "identifier": [
    {
      "@type": "PropertyValue",
      "propertyID": "doi",
      "value": "https://doi.org/10.15468/dl.2ohxaa"
    },
    {
      "@type": "PropertyValue",
      "propertyID": "UUID",
      "value": "0029115-180131172636756"
    }
  ],
  "url": "https://www.gbif.org/occurrence/download/0029115-180131172636756",
  "name": "GBIF Occurrence Download",
  "description": "A dataset containing 8285 species occurrences available in GBIF matching the query: TaxonKey: Macrolepiota procera (Scop.) Singer, 1948. The dataset includes 8285 records from 146 constituent datasets: ... Data from some individual datasets included in this download may be licensed under less restrictive terms.",
  "license": "http://creativecommons.org/licenses/by/4.0/legalcode",
  "inLanguage": "eng",
  "datePublished": "2018-04-05",
  "provider": {
    "@type": "Organization",
      "name": "GBIF",
      "url": "https://www.gbif.org",
      "logo": "https://www.gbif.org/img/logo/GBIF-2015.png",
      "email": "info@gbif.org"
  } 
}  
</script>

Obviously isBasedOn needs entries for each contributing dataset, and description should/could be expanded to include the full description.

@MortenHofft
Copy link
Member

MortenHofft commented May 1, 2019

Issue with isBasedOn

Obviously isBasedOn needs entries for each contributing dataset, and description should/could be expanded to include the full description.

@dnoesgaard If we add a list of all datasets then this will significantly increase the size (kb) of those pages. 2.7 mb in the example i tried. I'm not keen to do that. What alternatives do we have?

@MortenHofft
Copy link
Member

MortenHofft commented May 1, 2019

Issue with description prose filter

matching the query: TaxonKey: Macrolepiota procera (Scop.) Singer, 1948.

Generating a readable text string is not a simple task. Queries can be nested arbitrarily deep and have NOT parts etc.

MortenHofft added a commit to gbif/portal16 that referenced this issue May 1, 2019
@MortenHofft
Copy link
Member

Datasets now have the described data attached. Downloads need more discussion/consideration I find (see comments above)

@MortenHofft MortenHofft assigned dnoesgaard and unassigned MortenHofft May 1, 2019
@MortenHofft
Copy link
Member

MortenHofft commented May 2, 2019

@dnoesgaard could you please check to see if it is as you imagine it on UAT or staging? Notice that we do not model people but roles. That means that the same first name might appear many times under different roles. I chose to only list contacts with role ORIGNIATOR as authors.

For this dataset this looks like:

<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "Dataset",
  "@id": "https://doi.org/10.15468/uuvlm6",
  "identifier": [
    {
      "@type": "PropertyValue",
      "propertyID": "doi",
      "value": "https://doi.org/10.15468/uuvlm6"
    },
    {
      "@type": "PropertyValue",
      "propertyID": "UUID",
      "value": "6fd297bb-888a-46a7-a870-1f5af1ab1616"
    }
  ],
  "url": "https://www.gbif.org/dataset/6fd297bb-888a-46a7-a870-1f5af1ab1616",
  "name": "Field Museum of Natural History (Geology) Paleobotany Collection",
  "author": [
    {
      "@type": "Person",
      "givenName": "Kate",
      "familyName": "Webbink",
      "email": "kwebbink@fieldmuseum.org",
      "jobTitle": [
        "Information Systems Specialist"
      ],
      "address": {
        "@type": "PostalAddress",
        "streetAddress": []
      },
      "affiliation": {
        "@type": "Organization",
        "name": "Field Museum of Natural History"
      }
    }
  ],
  "description": "The Paleobotany Collection spans 3.8 billion years of history but has its major strengths in the Late Paleozoic and Cretaceous-Paleogene.",
  "license": "http://creativecommons.org/publicdomain/zero/1.0/legalcode",
  "inLanguage": "eng",
  "datePublished": "2017-01-19T21:05:59.717+0000",
  "dateModified": "2017-01-27T20:49:41.278+0000",
  "publisher": {
    "@type": "Organization",
    "name": "Field Museum",
    "url": "http://www.fieldmuseum.org/"
  },
  "provider": {
    "@type": "Organization",
    "name": "GBIF",
    "url": "https://www.gbif.org",
    "logo": "https://www.gbif.org/img/logo/GBIF50.png",
    "email": "info@gbif.org",
    "telephone": "+45 35 32 14 70"
  }
}
</script>

UPDATE
"streetAddress": [] looks wrong

@dnoesgaard
Copy link
Member Author

Yeah, I'm just going over this now, also noticing that only type ORIGINATOR was being mapped. I'll get back to you asap...

@MortenHofft
Copy link
Member

MortenHofft commented May 2, 2019

kk - as said if we choose more roles, then contacts are likely to be duplicated. And the same person will have slightly different contact info for different roles as publishers understandably do not care to fill in the same data 4 times for the same person just because that person has 4 roles. That makes it quite fragile.

The code is here btw: https://github.com/gbif/portal16/blob/master/app/controllers/dataset/key/datasetKey.ctrl.js#L73

@MortenHofft
Copy link
Member

@dnoesgaard regarding the downloads - is it possible to refer to an external document instead of adding 23K isBasedOn to the JSON? Other ideas?

@dnoesgaard
Copy link
Member Author

  1. I want to check with Datacite before going further with downloads.

  2. I realized that I left out a small but important part for the publisher:
    "logo": dataset.publisher.logo

Can we please add that?

  1. re addresses it was an attempt to make it more granular. If you think it makes more sense it can also just be a concatenated string with commas or something, e.g.

"address": contact.address + contact.city etc....

@dnoesgaard
Copy link
Member Author

Issue with description prose filter

matching the query: TaxonKey: Macrolepiota procera (Scop.) Singer, 1948.

Generating a readable text string is not a simple task. Queries can be nested arbitrarily deep and have NOT parts etc.

We generate a string like this for the DOI metadata, perhaps the same method could be employed?

Example: curl https://api.datacite.org/dois/10.15468/dl.2ohxaa | jq '.data.attributes.descriptions'

@dnoesgaard
Copy link
Member Author

Notice that we do not model people but roles. That means that the same first name might appear many times under different roles. I chose to only list contacts with role ORIGNIATOR as authors.

I did a quick check of 500 datasets and found ORIGINATOR to be the most common contact type used. We can't really do a one-to-one mapping of contacts, so I think your approach is sensible.

@dnoesgaard
Copy link
Member Author

A few additional improvements to the dataset metadata schema - based on http://api.gbif.org/v1/dataset/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d as an example:

  1. "temporalCoverage" : "1499-12-23T00:00:00.000+0000/2014-06-12T00:00:00.000+0000"

That is, temporalCoverages.start and .end separated by a forward slash.

  1. "spatialCoverage": { "@type": "Place", "geo": { "@type": "GeoShape", "box": "41.271, 51.174, -5.266, 9.341" } }

with values from geographicCoverages.boundingBox—can also just otherwise be a string:

"spatialCoverage": "France"

I realize that we probably allow several instances of these, but I'll leave that to you to decide how to handle...

@MortenHofft
Copy link
Member

MortenHofft commented May 3, 2019

We generate a string like this for the DOI metadata, perhaps the same method could be employed?

@dnoesgaard The result to dnot look correct to me. For 10.15468/dl.cq5skd this returns A dataset containing 30255 species occurrences available in GBIF matching the query: Year: &gt;= 1467 or &lt;= 2017 or 1000 or 1000 or 1855.

Screenshot 2019-05-03 at 08 54 40

We can of course do a somewhat readable serialization if it is deemed important. But above solution doesn't look correct to me. If all of this already is in that endpoint (https://api.datacite.org/dois/10.15468/dl.2ohxaa) perhaps we could use that somehow?

@dnoesgaard
Copy link
Member Author

I for one doubt the usefulness of that description in the first place. We could even simply do something like

A dataset containing <n> species occurrences available in GBIF. The details of the query used to generate this dataset is available at <DOI>.

Or we could include a minified version of the query predicates, like what we show for complex queries, e.g.

"description": "A dataset containing 30255 species occurrences available in GBIF matching the query:\n{\"type\":\"or\",\"predicates\":[{\"type\":\"and\",\"predicates\":[{\"type\":\"greaterThanOrEquals\",\"key\":\"YEAR\",\"value\":\"1467\"},{\"type\":\"lessThanOrEquals\",\"key\":\"YEAR\",\"value\":\"2017\"},{\"type\":\"equals\",\"key\":\"YEAR\",\"value\":\"1000\"}]},{\"type\":\"equals\",\"key\":\"YEAR\",\"value\":\"1000\"},{\"type\":\"equals\",\"key\":\"YEAR\",\"value\":\"1855\"}]}"

A great deal of escaping required though.

Or even:

"description": "A dataset containing 30255 species occurrences available in GBIF matching the query: {\r\n \"type\": \"or\",\r\n \"predicates\": [\r\n {\r\n \"type\": \"and\",\r\n \"predicates\": [\r\n {\r\n \"type\": \"greaterThanOrEquals\",\r\n \"key\": \"YEAR\",\r\n \"value\": \"1467\"\r\n },\r\n {\r\n \"type\": \"lessThanOrEquals\",\r\n \"key\": \"YEAR\",\r\n \"value\": \"2017\"\r\n },\r\n {\r\n \"type\": \"equals\",\r\n \"key\": \"YEAR\",\r\n \"value\": \"1000\"\r\n }\r\n ]\r\n },\r\n {\r\n \"type\": \"equals\",\r\n \"key\": \"YEAR\",\r\n \"value\": \"1000\"\r\n },\r\n {\r\n \"type\": \"equals\",\r\n \"key\": \"YEAR\",\r\n \"value\": \"1855\"\r\n }\r\n ]\r\n}"

:)

@qgroom
Copy link

qgroom commented May 10, 2019

Great to see this initiative! I can't fault your mapping, but you might consider this...

{
  "@type": "Person",
  "name": [FULL NAME]
  "affiliation": [ORGANIZATION],
  "givenName": contact.firstName,
  "familyName": contact.lastName,
  "email": contact.email,
  "identifier" : contact.userId,
  "telephone": contact.phone,
  "url": contact.homepage
},

Also...
In the IPT there are keywords and version numbers of the dataset, but I don't see them on GBIF.
If they are available it would be good to add them.

@dnoesgaard
Copy link
Member Author

Hey @qgroom, thanks for the input. Can you please clarify which changes/additions to person you're suggesting?

@dnoesgaard
Copy link
Member Author

@MortenHofft - it looks like we have version and keywordCollections. keywords[] (especially for IPT hosted datasets) that could be added top-level as:

"version": datasetKey.dataset.version,
"keywords": (concatenate datasetKey.dataset.keywordCollections.keywords[] using commas)

Make sense?

@qgroom
Copy link

qgroom commented May 10, 2019

I was to add the name as the fullname, rather than the given and family name. I think this is more usual, so I suppose it would make the name more findable.
Of course the ORCID ID under identifier is most important.
Also, to add the affiliation for the person. This is often not the same as the organization associated with a dataset.

@qgroom
Copy link

qgroom commented May 10, 2019

Currently, there is a taxonomicRange property in bioschemas, but this is not available in the schema for dataset.

I think we can make a strong argument for this being added to dataset.
As a temporary measure the taxonomic scope could be added under about, but it is a bit ugly.

This should perhaps be raised as an issue for schema.org.

@stylesm might be interested in this thread

@dnoesgaard
Copy link
Member Author

It looks like we should be doing either name or givenName and familyName - but not both. I'm happy to stick with

"name": contact.firstName contact.lastName

Re. identifier absoluty agree on importance. We should add:

"identifier": contact.userId

For contacts, I believe we're already using the organization of the contact rather than the publisher org, see https://github.com/gbif/portal16/blob/5f07c9d37b5a0937181aa511bca2800d3a2e08b1/app/controllers/dataset/key/datasetKey.ctrl.js#L92

@dnoesgaard
Copy link
Member Author

@MortenHofft - if you're losing track of this, I'm happy to compile all input for you for later... ;)

@dnoesgaard
Copy link
Member Author

With that being said, I'd also like to see a version of this live soon. We can always make additions later on...

@stylesm
Copy link

stylesm commented May 10, 2019

I think the issue with adding taxonomicRange to the schema.org Dataset type is that taxonomies are a very bio specific thing.

The example we were talking about with the Sample type and the reason for renaming the Bioschemas Sample to BioSample is that you could have for example a carpet sample, and so the properties we were talking about on Wednesday (e.g. gender, disease status, etc) wouldn't apply to carpet samples.

However, the taxonomicRange property actually comes from BioChemEntity type (https://bioschemas.org/types/BioChemEntity/). So one of the ways of dealing with this might be to reference BioChemEntity as an additionalType (https://schema.org/additionalType).

So in code this would look a bit like:

{
    "@context": "http://schema.org",
    "@type": "Dataset",
    "@id": "https://doi.org/10.15468/uuvlm6",
    "additionalType": "https://bioschemas.org/types/BioChemEntity/",
    "identifier": [
      {
        "@type": "PropertyValue",
        "propertyID": "doi",
        "value": "https://doi.org/10.15468/uuvlm6"
      },
      {
        "@type": "PropertyValue",
        "propertyID": "UUID",
        "value": "6fd297bb-888a-46a7-a870-1f5af1ab1616"
      }
    ],
    "url": "https://www.gbif.org/dataset/6fd297bb-888a-46a7-a870-1f5af1ab1616",
    "taxonomicRange": "whatever the Text value is.. or this could be a Taxon type value, or array of either of those.. etc",
    "_comment": "rest of JSON truncated!"
}

Thoughts?

@dnoesgaard
Copy link
Member Author

Summarizing outstanding issues that I'd like to see done:

  • add logo to publisher, i.e.:

"logo": dataset.publisher.logo

  •  add temporalCoverage, spatialCoverage, version and keywords to dataset:

"temporalCoverage": dataset.temporalCoverages.start/dataset.temporalCoverages.end, e.g.

"temporalCoverage" : "1499-12-23T00:00:00.000+0000/2014-06-12T00:00:00.000+0000"

"spatialCoverage": { "@type": "Place", "geo": { "@type": "GeoShape", "box": geographicCoverages.boundingBox } }, e.g.

"spatialCoverage": { "@type": "Place", "geo": { "@type": "GeoShape", "box": "41.271, 51.174, -5.266, 9.341" } }

(or simply e.g."spatialCoverage": "France")

"version": datasetKey.dataset.version
"keywords": (concatenate datasetKey.dataset.keywordCollections.keywords[] using commas)

  • change person names to full name and add identifier

"name": contact.firstName + contact.lastName
"identifier": contact.userId

After testing these, I suggest we do the first prod release.

@dnoesgaard
Copy link
Member Author

(as always, apologies for the pseudocode)

@qgroom
Copy link

qgroom commented May 10, 2019

Indeed, most of the properties of dataset are properties of the dataset and not of its contents. I suppose spatialCoverage and temporalCoverage are special consideration, because they are so widely applicable.

about could be used, but wouldn't we loose the ability to search taxonomy hierarchically? So if I wanted for find all the data about the family Asteraceae in the 19th century in Africa. I can restrict by location and date, but I would have then have to try every taxon name of the family (>34,000).

Although taxonomy is very bio, it is also quite fundamental for many datasets.

We could perhaps make more use of identifiers and sameAs, but that is not an easything to suggest in the biological world.

@stylesm
Copy link

stylesm commented May 10, 2019

@qgroom but if you use additionalType property, you get the taxonomicRange property (values of Type Taxon for the hierarchy or simply Text or URL or from the BioChemEntity and the other properties from the Recordset type.

@dnoesgaard
Copy link
Member Author

We have the first version of this live now and would like to close this issue. Please consider raising new issues with additional improvements to the markup.

@dnoesgaard
Copy link
Member Author

Looks like GDS is starting to pick up the metadata now:
https://toolbox.google.com/datasetsearch/search?query=site%3Agbif.org

@csbrown-noaa
Copy link

Does this include JSON-LD for the info in meta.xml? Is anyone working on JSON-LD for the DwC vocab?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants