Google Dataset Search - Getting dataset metadata as JSON-LD Markup (schema.org definition) #1669

dnoesgaard · 2018-11-28T16:48:39Z

This issue has been discussed in various places, including Twitter, Discourse, etc. but I'm not sure we ever had a real issue for it. At least I couldn't find one...

Background: Google has recently launched Dataset Search and through Datacite, all GBIF datasets for which DOIs have been minted, are exposed. However, due to a lack of structured markup, the search engine falls back to the Datacite version (which has a Datacite logo and links to a Datacite search), e.g.

Bernice P. Bishop Museum

This could be improved if the GBIF dataset page included JSON-LD Markup, e.g.

<script type="application/ld+json">
{
  "@context" : "http://schema.org",
  "@type" : "Dataset",
  "name" : "Bernice P. Bishop Museum",
  "description" : "The Bernice Pauahi Bishop Museum, designated the Hawaiʻi State Museum of Natural and Cultural History, is a museum of history and science located in the Kalihi district of Honolulu on the Hawaiian island of O’ahu. Founded in 1889, it is the largest museum in Hawai’i and is home to one of the world’s largest collections of natural history material from the Pacific region, with approximately 21 million specimens. The main collections include Entomology, Malacology, Botany, Ichthyology, Vertebrate…",
  "spatialCoverage" : "Primarily focused on the tropical Indo-Pacific region, with an emphasis on Oceania",
  "identifier" : "10.15468/s6ctus",
  "license" : "CC0 1.0",
  "distribution" : {
    "@type" : "DataDownload",
    "contentUrl" : "https://www.gbif.org/dataset/b929f23d-290f-4e85-8f17-764c55b3b284"
  },
  "sourceOrganization" : "Bernice Pauahi Bishop Museum",
  "datePublished" : "2012-10-23"
}
</script>

Google reference: https://developers.google.com/search/docs/data-types/dataset
Schema.org dataset definition: http://schema.org/Dataset

The text was updated successfully, but these errors were encountered:

MattBlissett · 2018-11-28T16:56:27Z

We should also do this for downloads, using isBasedOn, and add (restore?) the sitemap for datasets

dnoesgaard · 2018-11-28T16:58:03Z

Agreed. I actually meant datasets in the wider context–so anything for which we mint DOIs, including downloads...

dnoesgaard · 2018-11-29T15:04:40Z

I will gladly help map data against the schema.org definitions–when/if this issue is prioritized.

(our own DOI metadata mapping could probably use a reworking too + upgrade to vers 4.x schema)

MortenHofft · 2018-11-30T13:17:30Z

@dnoesgaard : I will gladly help map data against the schema.org definitions–when/if this issue is prioritized.

It is likely trivial to add, so yes please :) If you write up what you believe is the appropriate mapping, then we can likely add it easily

dnoesgaard · 2018-12-03T15:11:07Z

Ok, here's my first attempt at a definition for datasets. I apologize for the pseudo-code, but I'm sure you understand what I'm referring to :)

(we might want to add translations for the hardcoded values in the provider section–your call)

<script type="application/ld+json">{
  "@context": "http://schema.org",
  "@type": "Dataset",
  "@id": "https://doi.org/" + datasetKey.dataset.doi,
  "identifier": [
    {
      "@type": "PropertyValue",
      "propertyID": "doi",
      "value": "https://doi.org/" + datasetKey.dataset.doi
    },
    {
      "@type": "PropertyValue",
      "propertyID": "UUID",
      "value": datasetKey
    }
  ],
  "url": "https://www.gbif.org/dataset/" + datasetKey,
  "name": datasetKey.dataset.title,
  // in the author section, we need an entry for each of the dataset contacts
  "author": [
    // repeat the next element for each of the dataset contacts
    {
      "@type": "Person",
      "givenName": contact.firstName,
      "familyName": contact.lastName,
      "email": contact.email,
      "identifier" : contact.userId,
      "telephone": contact.phone,
      "url": contact.homepage
    },
  	... 
  ],
  "description": datasetKey.dataset.description,
  "license": datasetKey.dataset.license,
  "inLanguage": datasetKey.dataset.dataLanguage,
  "datePublished": datasetKey.dataset.created,
  "dateModified": datasetKey.dataset.modified,
  "publisher": {
      "@type": "Organization",
      "name": publisherKey.publisher.title,
      "url": publisherKey.publisher.homepage[0],
      "logo": publisherKey.publisher.logoUrl,
      "email": publisherKey.publisher.email[0],
      "telephone": publisherKey.publisher.phone[0]
    },
  "provider": {
    "@type": "Organization",
      "name": "GBIF",
      "url": "https://www.gbif.org",
      "logo": "https://www.gbif.org/img/logo/GBIF-2015.png",
      "email": "info@gbif.org",
      "telephone": "+45 35 32 14 70"
  }
}</script>

MattBlissett · 2019-01-25T15:45:02Z

That logo has a lot of white space around it.

Otherwise, can we try this in Dev, @MortenHofft? Also, is there any issue with it appearing on UAT? The UAT registry includes many (older) production downloads, and will be crawled, so it might be necessary to hardcode www.gbif.org rather than let this vary based on the dev/uat/prod site.

dnoesgaard · 2019-01-28T07:09:19Z

Personally, I'd like to see this in action before we allow Google to crawl it.

Why is UAT indexed by Google anyway?

MattBlissett · 2019-01-28T07:29:19Z

User-agent: *
...
Disallow: /occurrence/

So the download pages are already not supposed to be crawled; I think Google's data is just what's in DataCite.

We can update that with

Disallow: /occurrence/1
Disallow: /occurrence/2
Disallow: /occurrence/3
...

once it's ready.

MortenHofft · 2019-01-28T09:04:28Z

@dnoesgaard: Why is UAT indexed by Google anyway?

Because the old site was and I was told that was a deliberate decision. I'd be more than happy to change that.

MortenHofft · 2019-01-28T09:16:33Z

@MattBlissett: We should also [...] add (restore?) the sitemap for datasets

We have had sitemaps for datasets all along. The reference to them was just listed in robots.txt

dnoesgaard · 2019-01-28T09:16:54Z

Re the logo file, it looks like Google are scaling down to 50 pixels in height. This one will work fine, I think: https://gbif.box.com/shared/static/dxxlqeikavxw4zadqrryh0hd7ad0gc5q.png

Mock-up:

dnoesgaard · 2019-04-25T09:13:21Z

Any chance we could move forward with this one? If anything further is needed from my side, please let me know :)

dnoesgaard · 2019-04-25T10:07:52Z

I did a mockup of marked-up metadata for a dataset—this time adding a bit more detail for the contacts. Should be pretty self-explanatory...

<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "Dataset",
  "@id": "https://doi.org/10.15468/6q5vuc",
  "identifier": [
    {
      "@type": "PropertyValue",
      "propertyID": "doi",
      "value": "https://doi.org/10.15468/6q5vuc"
    },
    {
      "@type": "PropertyValue",
      "propertyID": "UUID",
      "value": "005eb8d8-ed94-41be-89cf-e3115a9058e4"
    }
  ],
  "url": "https://www.gbif.org/dataset/005eb8d8-ed94-41be-89cf-e3115a9058e4",
  "name": "Field Museum of Natural History (Zoology) Invertebrate Collection",
  "author": [
    {
      "@type": "Person",
      "givenName": "Sharon",
      "familyName": "Grant",
      "email": "sgrant@fieldmuseum.org",
      "telephone": "3126657203",
      "jobTitle": "Technology Liaison to Science",
      "affiliation": {
        "@type": "Organization",
        "name": "Field Museum of Natural History"
      },
      "address": {
      	"@type": "PostalAddress",
      	"streetAddress": "1400 S Lake Shore Drive",
      	"addressLocality": "Chicago",
      	"postalCode": "60605",
      	"addressRegion": "CA",
      	"addressCountry": "US"
      }
    },
    {
      "@type": "Person",
      "givenName": "Jones",
      "familyName": "Janeen",
      "email": "jjones@fieldmuseum.org",
      "telephone": "",
      "jobTitle": "Assistant Collections Manager",
      "affiliation": {
        "@type": "Organization",
        "name": "Field Museum of Natural History"
      },
      "address": {
      	"@type": "PostalAddress",
      	"streetAddress": "1400 S Lake Shore Drive",
      	"addressLocality": "Chicago",
      	"postalCode": "60605",
      	"addressRegion": "CA",
      	"addressCountry": "US"
      }
    } 
  ],
  "description": "Established in 1938, the Division of Invertebrates is in charge of all invertebrate groups except insects and other non-marine arthropods. The first curator of this Division was Fritz Haas, formerly of the Senckenberg Museum in Frankfurt, Germany. Haas (1938 - 1969) and his successor Alan Solem (1957 - 1990) built massive mollusk collections, particularly strong in unionid bivalves and terrestrial snails, reflecting their respective research interests. Current curators Rüdiger Bieler (1990 -) and Janet Voight (1990 -) focus their research and collection-building on marine molluscan groups. The varied curatorial research interests, the collecting efforts of past and present collections managers (e.g., John Slapcinsky and Jochen Gerber), and acquisitions of private collections and “orphan collections”",
  "license": "http://creativecommons.org/publicdomain/zero/1.0/legalcode",
  "inLanguage": "eng",
  "datePublished": "2012-07-11",
  "dateModified": "2019-02-05",
  "publisher": {
      "@type": "Organization",
      "name": "Field Museum",
      "url": "http://www.fieldmuseum.org/",
      "email": "sgrant@fieldmuseum.org",
      "telephone": "312-665-7957"
    },
  "provider": {
    "@type": "Organization",
      "name": "GBIF",
      "url": "https://www.gbif.org",
      "logo": "https://www.gbif.org/img/logo/GBIF-2015.png",
      "email": "info@gbif.org",
      "telephone": "+45 35 32 14 70"
  } 
}  
</script>

This tests ok with the Structured Data Testing Tool

dnoesgaard · 2019-04-25T12:00:34Z

Here's a mock-up of what markup for a download could look like:

<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "DataSet",
  "@id": "https://doi.org/10.15468/dl.2ohxaa",
  "distribution": {
  	"@type": "DataDownload",
  	"contentUrl": "http://api.gbif.org/v1/occurrence/download/request/0029115-180131172636756.zip",
  	"contentSize": "450889",
  	"encodingFormat": "text/csv",
  	"expires": "2020-03-13"
  },
  "isBasedOn": [
  	{
  		"@type": "DataSet",
  		"@id": "https://doi.org/10.15468/pdlhty"
  	},
  	{
  		"@type": "DataSet",
  		"@id": "https://doi.org/10.15468/xezr5g"
  	}],
  "identifier": [
    {
      "@type": "PropertyValue",
      "propertyID": "doi",
      "value": "https://doi.org/10.15468/dl.2ohxaa"
    },
    {
      "@type": "PropertyValue",
      "propertyID": "UUID",
      "value": "0029115-180131172636756"
    }
  ],
  "url": "https://www.gbif.org/occurrence/download/0029115-180131172636756",
  "name": "GBIF Occurrence Download",
  "description": "A dataset containing 8285 species occurrences available in GBIF matching the query: TaxonKey: Macrolepiota procera (Scop.) Singer, 1948. The dataset includes 8285 records from 146 constituent datasets: ... Data from some individual datasets included in this download may be licensed under less restrictive terms.",
  "license": "http://creativecommons.org/licenses/by/4.0/legalcode",
  "inLanguage": "eng",
  "datePublished": "2018-04-05",
  "provider": {
    "@type": "Organization",
      "name": "GBIF",
      "url": "https://www.gbif.org",
      "logo": "https://www.gbif.org/img/logo/GBIF-2015.png",
      "email": "info@gbif.org"
  } 
}  
</script>

Obviously isBasedOn needs entries for each contributing dataset, and description should/could be expanded to include the full description.

MortenHofft · 2019-05-01T09:44:35Z

Issue with isBasedOn

Obviously isBasedOn needs entries for each contributing dataset, and description should/could be expanded to include the full description.

@dnoesgaard If we add a list of all datasets then this will significantly increase the size (kb) of those pages. 2.7 mb in the example i tried. I'm not keen to do that. What alternatives do we have?

MortenHofft · 2019-05-01T09:47:15Z

Issue with description prose filter

matching the query: TaxonKey: Macrolepiota procera (Scop.) Singer, 1948.

Generating a readable text string is not a simple task. Queries can be nested arbitrarily deep and have NOT parts etc.

MortenHofft · 2019-05-01T09:52:34Z

Datasets now have the described data attached. Downloads need more discussion/consideration I find (see comments above)

MortenHofft · 2019-05-02T08:02:21Z

@dnoesgaard could you please check to see if it is as you imagine it on UAT or staging? Notice that we do not model people but roles. That means that the same first name might appear many times under different roles. I chose to only list contacts with role ORIGNIATOR as authors.

For this dataset this looks like:

<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "Dataset",
  "@id": "https://doi.org/10.15468/uuvlm6",
  "identifier": [
    {
      "@type": "PropertyValue",
      "propertyID": "doi",
      "value": "https://doi.org/10.15468/uuvlm6"
    },
    {
      "@type": "PropertyValue",
      "propertyID": "UUID",
      "value": "6fd297bb-888a-46a7-a870-1f5af1ab1616"
    }
  ],
  "url": "https://www.gbif.org/dataset/6fd297bb-888a-46a7-a870-1f5af1ab1616",
  "name": "Field Museum of Natural History (Geology) Paleobotany Collection",
  "author": [
    {
      "@type": "Person",
      "givenName": "Kate",
      "familyName": "Webbink",
      "email": "kwebbink@fieldmuseum.org",
      "jobTitle": [
        "Information Systems Specialist"
      ],
      "address": {
        "@type": "PostalAddress",
        "streetAddress": []
      },
      "affiliation": {
        "@type": "Organization",
        "name": "Field Museum of Natural History"
      }
    }
  ],
  "description": "The Paleobotany Collection spans 3.8 billion years of history but has its major strengths in the Late Paleozoic and Cretaceous-Paleogene.",
  "license": "http://creativecommons.org/publicdomain/zero/1.0/legalcode",
  "inLanguage": "eng",
  "datePublished": "2017-01-19T21:05:59.717+0000",
  "dateModified": "2017-01-27T20:49:41.278+0000",
  "publisher": {
    "@type": "Organization",
    "name": "Field Museum",
    "url": "http://www.fieldmuseum.org/"
  },
  "provider": {
    "@type": "Organization",
    "name": "GBIF",
    "url": "https://www.gbif.org",
    "logo": "https://www.gbif.org/img/logo/GBIF50.png",
    "email": "info@gbif.org",
    "telephone": "+45 35 32 14 70"
  }
}
</script>

UPDATE
"streetAddress": [] looks wrong

dnoesgaard · 2019-05-02T08:03:27Z

Yeah, I'm just going over this now, also noticing that only type ORIGINATOR was being mapped. I'll get back to you asap...

MortenHofft · 2019-05-02T08:06:05Z

kk - as said if we choose more roles, then contacts are likely to be duplicated. And the same person will have slightly different contact info for different roles as publishers understandably do not care to fill in the same data 4 times for the same person just because that person has 4 roles. That makes it quite fragile.

The code is here btw: https://github.com/gbif/portal16/blob/master/app/controllers/dataset/key/datasetKey.ctrl.js#L73

MortenHofft · 2019-05-02T08:10:22Z

@dnoesgaard regarding the downloads - is it possible to refer to an external document instead of adding 23K isBasedOn to the JSON? Other ideas?

dnoesgaard · 2019-05-02T08:20:41Z

I want to check with Datacite before going further with downloads.
I realized that I left out a small but important part for the publisher:
"logo": dataset.publisher.logo

Can we please add that?

re addresses it was an attempt to make it more granular. If you think it makes more sense it can also just be a concatenated string with commas or something, e.g.

"address": contact.address + contact.city etc....

dnoesgaard · 2019-05-02T08:49:34Z

Issue with description prose filter

matching the query: TaxonKey: Macrolepiota procera (Scop.) Singer, 1948.

Generating a readable text string is not a simple task. Queries can be nested arbitrarily deep and have NOT parts etc.

We generate a string like this for the DOI metadata, perhaps the same method could be employed?

Example: curl https://api.datacite.org/dois/10.15468/dl.2ohxaa | jq '.data.attributes.descriptions'

dnoesgaard · 2019-05-02T09:19:28Z

Notice that we do not model people but roles. That means that the same first name might appear many times under different roles. I chose to only list contacts with role ORIGNIATOR as authors.

I did a quick check of 500 datasets and found ORIGINATOR to be the most common contact type used. We can't really do a one-to-one mapping of contacts, so I think your approach is sensible.

dnoesgaard · 2019-05-02T14:14:26Z

A few additional improvements to the dataset metadata schema - based on http://api.gbif.org/v1/dataset/75956ee6-1a2b-4fa3-b3e8-ccda64ce6c2d as an example:

"temporalCoverage" : "1499-12-23T00:00:00.000+0000/2014-06-12T00:00:00.000+0000"

That is, temporalCoverages.start and .end separated by a forward slash.

"spatialCoverage": { "@type": "Place", "geo": { "@type": "GeoShape", "box": "41.271, 51.174, -5.266, 9.341" } }

with values from geographicCoverages.boundingBox—can also just otherwise be a string:

"spatialCoverage": "France"

I realize that we probably allow several instances of these, but I'll leave that to you to decide how to handle...

MortenHofft · 2019-05-03T06:56:30Z

We generate a string like this for the DOI metadata, perhaps the same method could be employed?

@dnoesgaard The result to dnot look correct to me. For 10.15468/dl.cq5skd this returns A dataset containing 30255 species occurrences available in GBIF matching the query: Year: >= 1467 or <= 2017 or 1000 or 1000 or 1855.

We can of course do a somewhat readable serialization if it is deemed important. But above solution doesn't look correct to me. If all of this already is in that endpoint (https://api.datacite.org/dois/10.15468/dl.2ohxaa) perhaps we could use that somehow?

dnoesgaard · 2019-05-03T07:23:26Z

I for one doubt the usefulness of that description in the first place. We could even simply do something like

A dataset containing <n> species occurrences available in GBIF. The details of the query used to generate this dataset is available at <DOI>.

Or we could include a minified version of the query predicates, like what we show for complex queries, e.g.

"description": "A dataset containing 30255 species occurrences available in GBIF matching the query:\n{\"type\":\"or\",\"predicates\":[{\"type\":\"and\",\"predicates\":[{\"type\":\"greaterThanOrEquals\",\"key\":\"YEAR\",\"value\":\"1467\"},{\"type\":\"lessThanOrEquals\",\"key\":\"YEAR\",\"value\":\"2017\"},{\"type\":\"equals\",\"key\":\"YEAR\",\"value\":\"1000\"}]},{\"type\":\"equals\",\"key\":\"YEAR\",\"value\":\"1000\"},{\"type\":\"equals\",\"key\":\"YEAR\",\"value\":\"1855\"}]}"

A great deal of escaping required though.

Or even:

"description": "A dataset containing 30255 species occurrences available in GBIF matching the query: {\r\n \"type\": \"or\",\r\n \"predicates\": [\r\n {\r\n \"type\": \"and\",\r\n \"predicates\": [\r\n {\r\n \"type\": \"greaterThanOrEquals\",\r\n \"key\": \"YEAR\",\r\n \"value\": \"1467\"\r\n },\r\n {\r\n \"type\": \"lessThanOrEquals\",\r\n \"key\": \"YEAR\",\r\n \"value\": \"2017\"\r\n },\r\n {\r\n \"type\": \"equals\",\r\n \"key\": \"YEAR\",\r\n \"value\": \"1000\"\r\n }\r\n ]\r\n },\r\n {\r\n \"type\": \"equals\",\r\n \"key\": \"YEAR\",\r\n \"value\": \"1000\"\r\n },\r\n {\r\n \"type\": \"equals\",\r\n \"key\": \"YEAR\",\r\n \"value\": \"1855\"\r\n }\r\n ]\r\n}"

:)

qgroom · 2019-05-10T06:42:04Z

Great to see this initiative! I can't fault your mapping, but you might consider this...

{
  "@type": "Person",
  "name": [FULL NAME]
  "affiliation": [ORGANIZATION],
  "givenName": contact.firstName,
  "familyName": contact.lastName,
  "email": contact.email,
  "identifier" : contact.userId,
  "telephone": contact.phone,
  "url": contact.homepage
},

Also...
In the IPT there are keywords and version numbers of the dataset, but I don't see them on GBIF.
If they are available it would be good to add them.

dnoesgaard · 2019-05-10T07:35:03Z

Hey @qgroom, thanks for the input. Can you please clarify which changes/additions to person you're suggesting?

dnoesgaard · 2019-05-10T07:36:22Z

@MortenHofft - it looks like we have version and keywordCollections. keywords[] (especially for IPT hosted datasets) that could be added top-level as:

"version": datasetKey.dataset.version,
"keywords": (concatenate datasetKey.dataset.keywordCollections.keywords[] using commas)

Make sense?

qgroom · 2019-05-10T08:08:51Z

I was to add the name as the fullname, rather than the given and family name. I think this is more usual, so I suppose it would make the name more findable.
Of course the ORCID ID under identifier is most important.
Also, to add the affiliation for the person. This is often not the same as the organization associated with a dataset.

qgroom · 2019-05-10T08:19:44Z

Currently, there is a taxonomicRange property in bioschemas, but this is not available in the schema for dataset.

I think we can make a strong argument for this being added to dataset.
As a temporary measure the taxonomic scope could be added under about, but it is a bit ugly.

This should perhaps be raised as an issue for schema.org.

@stylesm might be interested in this thread

dnoesgaard · 2019-05-10T08:32:51Z

It looks like we should be doing either name or givenName and familyName - but not both. I'm happy to stick with

"name": contact.firstName contact.lastName

Re. identifier absoluty agree on importance. We should add:

"identifier": contact.userId

For contacts, I believe we're already using the organization of the contact rather than the publisher org, see https://github.com/gbif/portal16/blob/5f07c9d37b5a0937181aa511bca2800d3a2e08b1/app/controllers/dataset/key/datasetKey.ctrl.js#L92

dnoesgaard · 2019-05-10T08:34:03Z

@MortenHofft - if you're losing track of this, I'm happy to compile all input for you for later... ;)

dnoesgaard · 2019-05-10T08:35:02Z

With that being said, I'd also like to see a version of this live soon. We can always make additions later on...

stylesm · 2019-05-10T10:13:17Z

I think the issue with adding taxonomicRange to the schema.org Dataset type is that taxonomies are a very bio specific thing.

The example we were talking about with the Sample type and the reason for renaming the Bioschemas Sample to BioSample is that you could have for example a carpet sample, and so the properties we were talking about on Wednesday (e.g. gender, disease status, etc) wouldn't apply to carpet samples.

However, the taxonomicRange property actually comes from BioChemEntity type (https://bioschemas.org/types/BioChemEntity/). So one of the ways of dealing with this might be to reference BioChemEntity as an additionalType (https://schema.org/additionalType).

So in code this would look a bit like:

{
    "@context": "http://schema.org",
    "@type": "Dataset",
    "@id": "https://doi.org/10.15468/uuvlm6",
    "additionalType": "https://bioschemas.org/types/BioChemEntity/",
    "identifier": [
      {
        "@type": "PropertyValue",
        "propertyID": "doi",
        "value": "https://doi.org/10.15468/uuvlm6"
      },
      {
        "@type": "PropertyValue",
        "propertyID": "UUID",
        "value": "6fd297bb-888a-46a7-a870-1f5af1ab1616"
      }
    ],
    "url": "https://www.gbif.org/dataset/6fd297bb-888a-46a7-a870-1f5af1ab1616",
    "taxonomicRange": "whatever the Text value is.. or this could be a Taxon type value, or array of either of those.. etc",
    "_comment": "rest of JSON truncated!"
}

Thoughts?

dnoesgaard · 2019-05-10T11:37:38Z

Summarizing outstanding issues that I'd like to see done:

add logo to publisher, i.e.:

"logo": dataset.publisher.logo

add temporalCoverage, spatialCoverage, version and keywords to dataset:

"temporalCoverage": dataset.temporalCoverages.start/dataset.temporalCoverages.end, e.g.

"temporalCoverage" : "1499-12-23T00:00:00.000+0000/2014-06-12T00:00:00.000+0000"

"spatialCoverage": { "@type": "Place", "geo": { "@type": "GeoShape", "box": geographicCoverages.boundingBox } }, e.g.

"spatialCoverage": { "@type": "Place", "geo": { "@type": "GeoShape", "box": "41.271, 51.174, -5.266, 9.341" } }

(or simply e.g."spatialCoverage": "France")

"version": datasetKey.dataset.version
"keywords": (concatenate datasetKey.dataset.keywordCollections.keywords[] using commas)

change person names to full name and add identifier

"name": contact.firstName + contact.lastName
"identifier": contact.userId

After testing these, I suggest we do the first prod release.

dnoesgaard · 2019-05-10T11:38:39Z

(as always, apologies for the pseudocode)

qgroom · 2019-05-10T11:53:37Z

Indeed, most of the properties of dataset are properties of the dataset and not of its contents. I suppose spatialCoverage and temporalCoverage are special consideration, because they are so widely applicable.

about could be used, but wouldn't we loose the ability to search taxonomy hierarchically? So if I wanted for find all the data about the family Asteraceae in the 19th century in Africa. I can restrict by location and date, but I would have then have to try every taxon name of the family (>34,000).

Although taxonomy is very bio, it is also quite fundamental for many datasets.

We could perhaps make more use of identifiers and sameAs, but that is not an easything to suggest in the biological world.

stylesm · 2019-05-10T13:51:41Z

@qgroom but if you use additionalType property, you get the taxonomicRange property (values of Type Taxon for the hierarchy or simply Text or URL or from the BioChemEntity and the other properties from the Recordset type.

dnoesgaard · 2019-05-14T09:34:36Z

We have the first version of this live now and would like to close this issue. Please consider raising new issues with additional improvements to the markup.

dnoesgaard · 2019-05-16T14:53:00Z

Looks like GDS is starting to pick up the metadata now:
https://toolbox.google.com/datasetsearch/search?query=site%3Agbif.org

csbrown-noaa · 2025-01-30T17:41:35Z

Does this include JSON-LD for the info in meta.xml? Is anyone working on JSON-LD for the DwC vocab?

MortenHofft added portal Secretariat idea labels Nov 30, 2018

MortenHofft self-assigned this Dec 5, 2018

MortenHofft added a commit to gbif/portal16 that referenced this issue May 1, 2019

relates to gbif/portal-feedback#1669

6ad606e

MortenHofft assigned dnoesgaard and unassigned MortenHofft May 1, 2019

dnoesgaard mentioned this issue May 14, 2019

Suggested improvements to dataset metadata markup #1949

Closed

3 tasks

dnoesgaard closed this as completed May 14, 2019

albenson-usgs mentioned this issue Aug 31, 2022

Document which DCAT profile and version is used gbif/ipt#1817

Open

albenson-usgs mentioned this issue Oct 27, 2022

Keywords section of the metadata #4390

Closed

7yl4r mentioned this issue Jan 30, 2025

[GSoC Project Proposal]: Combined DwCA/Croissant JSON-LD examples and tools ioos/gsoc#70

Closed

MortenHofft mentioned this issue Jan 31, 2025

Metadata (opengraph + JsonLD etc) gbif/gbif-web#913

Open

Google Dataset Search - Getting dataset metadata as JSON-LD Markup (schema.org definition) #1669

Google Dataset Search - Getting dataset metadata as JSON-LD Markup (schema.org definition) #1669

Comments

dnoesgaard commented Nov 28, 2018

MattBlissett commented Nov 28, 2018

dnoesgaard commented Nov 28, 2018

dnoesgaard commented Nov 29, 2018

MortenHofft commented Nov 30, 2018

dnoesgaard commented Dec 3, 2018 • edited by MattBlissett Loading

MattBlissett commented Jan 25, 2019

dnoesgaard commented Jan 28, 2019

MattBlissett commented Jan 28, 2019 • edited Loading

MortenHofft commented Jan 28, 2019 • edited Loading

MortenHofft commented Jan 28, 2019

dnoesgaard commented Jan 28, 2019

dnoesgaard commented Apr 25, 2019

dnoesgaard commented Apr 25, 2019

dnoesgaard commented Apr 25, 2019

MortenHofft commented May 1, 2019 • edited Loading

MortenHofft commented May 1, 2019 • edited Loading

MortenHofft commented May 1, 2019

MortenHofft commented May 2, 2019 • edited Loading

dnoesgaard commented May 2, 2019

MortenHofft commented May 2, 2019 • edited Loading

MortenHofft commented May 2, 2019

dnoesgaard commented May 2, 2019

dnoesgaard commented May 2, 2019

dnoesgaard commented May 2, 2019

dnoesgaard commented May 2, 2019

MortenHofft commented May 3, 2019 • edited Loading

dnoesgaard commented May 3, 2019

qgroom commented May 10, 2019

dnoesgaard commented May 10, 2019

dnoesgaard commented May 10, 2019

qgroom commented May 10, 2019

qgroom commented May 10, 2019

dnoesgaard commented May 10, 2019

dnoesgaard commented May 10, 2019

dnoesgaard commented May 10, 2019

stylesm commented May 10, 2019 • edited Loading

dnoesgaard commented May 10, 2019

dnoesgaard commented May 10, 2019

qgroom commented May 10, 2019

stylesm commented May 10, 2019

dnoesgaard commented May 14, 2019

dnoesgaard commented May 16, 2019

csbrown-noaa commented Jan 30, 2025

dnoesgaard commented Dec 3, 2018 •

edited by MattBlissett

Loading

MattBlissett commented Jan 28, 2019 •

edited

Loading

MortenHofft commented Jan 28, 2019 •

edited

Loading

MortenHofft commented May 1, 2019 •

edited

Loading

MortenHofft commented May 1, 2019 •

edited

Loading

MortenHofft commented May 2, 2019 •

edited

Loading

MortenHofft commented May 2, 2019 •

edited

Loading

MortenHofft commented May 3, 2019 •

edited

Loading

stylesm commented May 10, 2019 •

edited

Loading