Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Term - digitalSpecimenID #530

Open
wouteraddink opened this issue Dec 17, 2024 · 14 comments
Open

New Term - digitalSpecimenID #530

wouteraddink opened this issue Dec 17, 2024 · 14 comments

Comments

@wouteraddink
Copy link

New DwC term: digitalSpecimenID

  • Submitter: Wouter Addink (Naturalis), Sharif Islam (Naturalis), Claus Weiland (Senckenberg), Kessy Abarenkov (UTartu), Maxime Griveau (Muséum national d’histoire naturelle), Sam Leeflang (Naturalis), Anton Güntsch (BGBM), Ana Casino (CETAF), Gil Nelson (iDigBio), Jutta Buschbom (NHM UK, Statistical Genetics DE), Dag Endresen (University of Oslo)

  • Efficacy Justification (why is this term necessary?):
    DarwinCore is often used to exchange specimen data but it lacks a term dedicated to digital surrogates of physical specimens. These digital surrogates are known as Digital Specimens and enable the vision of a Digital Extended Specimen Network (see: https://doi.org/10.1093/biosci/biac060) in which all specimen related information is connected. International partners united in IPDES (https://des-international.github.io/) are working together since a few years to enable an extensible technical and social infrastructure of data, tools, and working practices to establish Digital Extended Specimens. DiSSCo has been working on a (open source) technical implementation which is planned to go into production this year, the first Digital Specimens have already been created and in the next decade the majority of digitized specimens in the world should have a Digital Specimen on the internet that is extendable to become a Digital Extended Specimen. The first and crucial step towards DES is to give these specimens a digital specimen identifier.

A digital specimen identifier is a persistent identifier that identifies a Digital Specimen, a FAIR digital object on the internet that acts as a surrogate for its physical counterpart. It is NOT a collection catalogue record but a digital object that can contain a collection catalogue record but also a field notebook record, accession record, laboratory information system record, (links to) other records, measurements, models, supplementary media and even software to act as a digital surrogate of a specimen. The digital specimen is a new object on the internet that links to all information that is available on the internet about that specimen including related and derived data such as, but not limited to, DNA sequences, chemical analyses, trait information, agent information, scholarly publications and supporting documentation about history and origin.

The digital specimen identifier has a different purpose from DwC:occurrenceID, DwC:catalogNumber, DwC:materialSampleID or DwC:MaterialEntityID which are used to identify the organism occurrence, physical object or its catalogue record.

The digital specimen identifier serves a few purposes:
It enables the vision of a Digital Extended Specimen Network since this identifier can link the specimen with other specimens and with related and derived information on the internet.
It can be used as a globally unique, resolvable and persistent identifier to refer to a specimen, since a 1:1 relationship between the digital and physical object is kept, where the physical specimen identifier and catalogue number are included in the metadata of the digital specimen identifier record. This solves the problem of persistently referring to a specimen, since physical specimen identifiers are often not globally unique, resolvable and persistent, not machine actionable and cannot be changed to meet these requirements for practical reasons.
It is a versioned identifier that can both be used to refer to the latest, or to a specific version of the digital specimen, allowing the digital specimen to be a mutable, community curated object to make more efficient use of scarce human resources and machine assisted curation solutions.

Inclusion of the digital specimen identifier in DwC is required to refer to a digital specimen, but is also useful to include for uniquely identifying the specimen in cases where a physical specimen identifier cannot fulfill this purpose. The inclusion of this new term in DwC will make it possible to share this identifier with infrastructures and journal systems that support DwC for material citations.

Who guarantees the persistence of these identifiers and who can mint them? The use of DOI-based Persistent Identifiers (PIDs) in this context is DiSSCo's approach to achieving compliance with FAIR principles and the FAIR Digital Object (FDO) framework. Other PID systems may also meet FAIR and FDO requirements. Digital Specimen identifiers are special in that they include metadata in the PID record itself, so even if the registration agency no longer exists, this metadata will still be available. This makes it also worthwhile for institutions to have these DOIs for their specimens even if they cannot provide digital specimens yet. The registration agency for Digital Specimen DOIs is DataCite and in principle every DataCite member can create them, however because of their required FDO capabilities currently only DiSSCo has infrastructure ready to create them.

  • Demand Justification (name at least two organizations that independently need this term):
    DiSSCo aims to provide Digital Specimen DOIs as a free service for specimen hosting institutions and will create the first millions by 2024/early 2025. All DiSSCo facilities (over 200 institutions in Europe including e.g. BGBM Berlin, MNHN France, UTartu, Senckenberg, CETAF) will use these identifiers and need this term to share their source data with DiSSCo through DwC. Pensoft needs them for material citations made in their Arpha platform, iDigBio will use this term in pursuit of the Digital Extended Specimen and clarifying DwC. Plazi and the European Journal of Taxonomy need them for material citations generated through their automated workflows and platforms for manual user interaction.

  • Stability Justification (what concerns are there that this might affect existing implementations?):
    This is an extra term so it will not affect existing implementations.

  • Implications for dwciri: namespace (does this change affect a dwciri term version)?:
    Because this term refers to a persistent identifier itself, there is no dwciri term needed for it.

Proposed attributes of the new term:

  • Term name (in lowerCamelCase for properties, UpperCamelCase for classes):
    digitalSpecimenID

  • Term label (English, not normative):
    Digital Specimen Identifier

  • Organized in Class (e.g., Occurrence, Event, Location, Taxon):
    We think that this belongs in the Record-level class since this is an identifier at the record level, however, we think the same about catalogNumber which is in the Occurrence class, and there is no Specimen class. Since people currently use the Occurrence class to exchange specimen data, it could also be put in the Occurrence class, however that seems wrong since digital Specimens are Information Artefacts and not physical things.

  • Definition of the term (normative):
    A persistent, FAIR Digital Object compliant, identifier for a Digital Specimen digital object on the internet.

  • Usage comments (recommendations regarding content, etc., not normative):
    Use this term to uniquely and persistently reference a specimen through its Digital Specimen identifier. Do NOT use this term for identifiers that identify the physical specimen, any material entity or its collection catalogue record, such as an ISGN, International Global Sample Number or CETAF stable identifier. Either the latest version of the Digital Specimen can be referenced (default) or a specific version if the digital object is versioned.

  • Examples (not normative):
    https://doi.org/10.3535/M42-Z4P-DRD,
    https://doi.org/10.3535/M42-Z4P-DRD?urlappend=/1,
    doi:10.3535/M42-Z4P-DRD

  • Refines (identifier of the broader term this term refines; normative):

  • Replaces (identifier of the existing term that would be deprecated and replaced by this term; normative):
  • ABCD 2.06 (XPATH of the equivalent term in ABCD or EFG; not normative):
    In ABCD 2.06 there is the generic term /DataSets/DataSet/Units/Unit/UnitGUID that could be used, but we propose a new term /DataSets/DataSet/Units/Unit/SpecimenUnit/digitalSpecimenID to make it distinguishable from URIs that identify the physical object or catalogue record.
@Maxime-Griveau
Copy link

Maxime-Griveau commented Feb 26, 2025

This is a really important term that should be added. Maxime GRIVEAU (MNHN)

@Jegelewicz
Copy link

@mkoo @dustymc you might find this interesting.

@tucotuco
Copy link
Member

Please bear with me as I try to frame my understanding of this proposal and its context. My hope is that it will help others to understand it as well.

I understand the abstract concept of the Digital Extended Specimen—a nexus for linking information related to a specimen. I understand the importance of resolvable global unique identifiers that enable that linking, and the services that resolve those identifiers to provide access to the extended specimen-related data from potentially multiple sources—those all make sense.

I think it is important to understand how this term would function within a structured Darwin Core where "record-level" no longer has any meaning. In other words, the record-level terms would need to be attached to specific terms and have semantics based on the context. What would a digitalSpecimenID be attached to? Should there be a DigitalSpecimen class? If so, what additional properties might it have beyond digitalSpecimenID? How would DigitalSpecimen relate to other Darwin Core classes? It seems that a DigitalSpecimen would only connect to dwc:MaterialEntity, and that the only relationship would be 'digital representation of'. If that is its entire purpose, a separate class seems an unnecessary abstraction. And if that is the case, then digitalSpecimenID would make sense as a property of dwc:MaterialEntity.

I have tried to figure out why dwc:materialEntityID couldn't be populated with a globally unique, resolvable identifier from a Digital Specimen Service? I can't think of a reason why it couldn't.

I have also tried to figure out why a separate term for digitalSpecimenID would actually be needed. I think the answer is that a dwc:MaterialEntity might already have a dwc:materialEntityID that globally uniquely identifies the material itself and that needs to persist, whatever the reason might be.

Have I understood this correctly?

@Jegelewicz
Copy link

I see this as finally understanding that the physical object and the data about it are two different things....

@wouteraddink
Copy link
Author

@tucotuco I think creating a separate class would add unnecessary complexity, we had a similar discussion in openDS. Catching the relationship 'digital representation of' is what is needed. digitalSpecimenID as a property of dwc:MaterialEntity seems fine to me, however for processing a DwC archive it would be beneficial if we could put a Unique constraint on a digitalSpecimenID in the dataset hence my preference to put it at the record level. That way we can simply check there are no duplicates in the dataset without having to rely on physical specimen identifiers which may not be unique nor persistent. One DwC record maps to one specimen record, while a materialEntity can map to the specimen but also to a part of a specimen (ods:SpecimenPart). So there can be multiple material entities in a DwC specimen record with the same digitalSpecimenID.

@Cycloderes
Copy link

I think the digitalSpecimenID is needed to make sense of a huge amount of infos in the Inernet, and to avoid losing the view of the fisical collection specimens as source and nexus of those infos.

@tucotuco
Copy link
Member

tucotuco commented Mar 3, 2025

@tucotuco I think creating a separate class would add unnecessary complexity, we had a similar discussion in openDS. Catching the relationship 'digital representation of' is what is needed.

Yay! I agree that adding a DigitalSpecimen class would add unnecessary complexity.

digitalSpecimenID as a property of dwc:MaterialEntity seems fine to me, however for processing a DwC archive it would be beneficial if we could put a Unique constraint on a digitalSpecimenID in the dataset hence my preference to put it at the record level. That way we can simply check there are no duplicates in the dataset without having to rely on physical specimen identifiers which may not be unique nor persistent.

A unique constraint on digitalSpecimenID is problematic for multiple reasons. The foremost is that Darwin Core term definitions to not include constraints other than recommendations, the term definitions are expected to be independent of the integrity constraint, which belong in implementations.

It is easy to imagine one implementation in which the term is used with a uniqueness constraint on a specimen (whatever a specimen is), while in another implementation the term is used as a foreign key with no such uniqueness constraint. I can't imagine a uniqueness constraint being viable without the digitalSpecimenID uniquely identifying something. Without a DigitalSpecimen class or a Specimen class (neither of which exists or is being proposed for Darwin Core) a uniqueness constraint seems inappropriate. In fact, it would be counter to what is trying to be achieved with the DigitalSpecimen concept. You want to be able to say that two "records" are associated with a Digital Specimen. With a unique constraint on digitalSpecimenID, you couldn't. I think this is exactly what you are trying to get at below. Uniqueness will not allow "multiple entities in a DwC specimen record with the same digitalSpecimenID".

One DwC record maps to one specimen record, while a materialEntity can map to the specimen but also to a part of a specimen (ods:SpecimenPart). So there can be multiple material entities in a DwC specimen record with the same digitalSpecimenID.

Anticipating structured Darwin Core, there will be no such thing as a "record-level". Terms will have to be used in specific contexts. If there is no plan for a DigitalSpecimen class, or a Specimen class (both of which seem superfluous), the only remaining class to have a digitalSpecimen a property of is the MaterialEntity. I would put it there directly in anticipation so as not to have to change it later.

@tucotuco
Copy link
Member

tucotuco commented Mar 3, 2025

I am concerned about the proposed definition, "A persistent, FAIR Digital Object compliant, identifier for a Digital Specimen digital object on the internet."

What is a 'Digital Specimen'? It looks like that is a proper noun. If there is a formal definition of one, it could be in the definition, but perhaps it would be better in the usage comments in order to keep the definition succinct. From just the definition, it seems that a 'Digital Specimen' is a type of 'Digital Object'. What is a 'Digital Object'? In the first instance, it looks like it is a proper noun. In the second instance in the definition, it looks descriptive rather than a proper noun. If the missing reference to what a 'Digital Specimen' is defines the relationship to a 'Digital Object', it wouldn't be necessary to mention 'Digital Object' here at all - neither in the definition nor in the usage comments. The one reference to 'Digital Specimen' would be sufficient. If instead 'Digital Object' is meant to indicate that the content is expected to be a Digital Object Identifier (DOI), then I would be explicit about that in the recommendations in the usage comments, and not put it in the definition.

To the extent possible, I would model the term on materialEntityID, which already contains the reference back to the concept of a digital specimen.

Definition: An identifier for a particular instance of a Digital Specimen.

Usage Comments: A Digital Specimen is [definition and/or reference to definition]. A dwc:digitalSpecimenID is intended to uniquely and persistently identify a Digital Specimen. Recommended best practice is to use a persistent, globally unique identifier. The identifier is for a digital information artifact (the Digital Specimen) as opposed to an identifier for a specific instance of a dwc:MaterialEntity.

@wouteraddink
Copy link
Author

@tucotuco yes a uniqueness constraint would be in an implementation not in the standard. In DiSSCo we would add such a constraint similar to the constraint GBIF is putting on OccurrenceID since we aim to maintain a 1:1 relationship between the physical specimen and its digital counterpart and require one record per specimen in a DwC dataset.

@wouteraddink
Copy link
Author

wouteraddink commented Mar 7, 2025

@tucotuco with structured DwC in mind putting the term in MaterialEntity seems the best option. There is a potental issue in that the MaterialEntity is not necessarily a specimen, it can also be a specimen part (or sample), and there is no term currently included to make the distinction. In the GBIF draft model there is material_entity_type which would solve that and there is the objectType proposal #517

@wouteraddink
Copy link
Author

Definition: An identifier for a particular instance of a Digital Specimen.

Usage Comments: A Digital Specimen as defined in https://doi.org/10.3897/rio.7.e67379. A dwc:digitalSpecimenID is intended to uniquely and persistently identify a Digital Specimen. Recommended best practice is to use a DOI with machine readable metadata in the DOI record that uses a community agreed metadata profile (also known as FDO profile) for a Digital Specimen. For an example see: https://doi.org/10.3535/N75-CR4-0SM?noredirect. The identifier is for a digital information artifact (the Digital Specimen) as opposed to an identifier for a specific instance of a dwc:MaterialEntity.

@tucotuco
Copy link
Member

@wouteraddink This last proposal covers all of my concerns. Thank you.

@ben-norton
Copy link
Member

@wouteraddink I think a couple use cases would be very helpful, especially those that answer the following questions:
When would someone use dwc:digitalSpecimenID and not dwc:materialEntityID?
When is a dwc:digitalSpecimenID not also a dwc:materialEntityID?
Provided that dwc:digitalSpecimenID belongs to the dwc:MaterialEntity class, its use should not obstruct or hinder the distinction between a compound object and its constituent parts. A primary motivation for developing a compound model is to ensure that extended specimen network connections may be established with the individual parts of a compound object. Incorporating dwc:digitalSpecimenID into the dwc:MaterialEntity class enables assignment to the parent compound object or its components, and therefore enhances the connectivity of extended specimen networks to the parts of a compound object.
With all that said, I can't help but be a bit hesitant about adding another identifier field. Answers to my questions above would help data providers to better understand the distinction between dwc:materialEntityID, dwc:digitalSpecimenID, and dwc:materialSampleID. I doubt many source data management systems will include the level of granularity needed to distinguish between these three identifiers. This means it will be a human curated activity and therefore challenging to implement.

@tucotuco
Copy link
Member

@ben-norton I know you asked @wouteraddink specifically, but I would like to try to answer the questions to see if I understand correctly, and let @wouteraddink correct me if I get something wrong.

@wouteraddink I think a couple use cases would be very helpful, especially those that answer the following questions: When is a dwc:digitalSpecimenID not also a dwc:materialEntityID?

Always. That is, a digitalSpecimenID is never a dwc:materialEntityID. The latter is for a dwc:MaterialEntity, the former is for an abstract concept that binds information about a Specimen (a concept that is not instantiated nor proposed to be instantiated as a class).

When would someone use dwc:digitalSpecimenID and not dwc:materialEntityID?

For an Occurrence record in Simple Darwin Core, a digitalSpecimenID would enable the capabilities inherent to a Digital Specimen. There would be a Digital Specimen instance around which anyone could add specimen information without even needing to provide a dwc:materialEntityID for any dwc:MaterialEntity associated with the specimen.

In structured Darwin Core, with the digitalSpecimenID as a property of a dwc:MaterialEntity as proposed, one would never use a digitalSpecimenID without a dwc:MaterialEntityID, as the latter is needed to instantiate a MaterialEntity so that it can have a digitalSpecimen as a property.

Provided that dwc:digitalSpecimenID belongs to the dwc:MaterialEntity class, its use should not obstruct or hinder the distinction between a compound object and its constituent parts.

I think that statement could be read in multiple ways. If it means, "With digitalSpecimen as property of a dwc:MaterialEntity, there will be no hindrance to distinguishing a compound object and its constituent parts." This is true as long as the compound object is the "specimen" the Digital Specimen refers to. Indeed, there will also be no hindrance to distinguishing an object from other objects derived from (physically separated from) it, as long as the source and the parts taken from it are all considered as belonging to the same Digital Specimen. If they become distinct Digital Specimens by separating them physically, one or both of the resulting MaterialEntities would have to change its digitalSpecimenID, depending what was meant by "specimen".

A primary motivation for developing a compound model is to ensure that extended specimen network connections may be established with the individual parts of a compound object. Incorporating dwc:digitalSpecimenID into the dwc:MaterialEntity class enables assignment to the parent compound object or its components, and therefore enhances the connectivity of extended specimen networks to the parts of a compound object.

Yes, it would make it possible in Simple Darwin Core to link records about the same "specimen". It would also make it easier to do so in structured Darwin Core, to find all the dwc:MaterialEntities associated with a "specimen", assuming the dwc:MaterialEntities were had their digitalSpecimenIDs populated, of course.

With all that said, I can't help but be a bit hesitant about adding another identifier field. Answers to my questions above would help data providers to better understand the distinction between dwc:materialEntityID, dwc:digitalSpecimenID, and dwc:materialSampleID. I doubt many source data management systems will include the level of granularity needed to distinguish between these three identifiers. This means it will be a human curated activity and therefore challenging to implement.

A dwc:MaterialSample is a special case of dwc:MaterialEntity and the distinction is already not particularly useful. In structured Darwin Core, dwc:MaterialSample will not be used. That leaves us with just the two properties to consider, digitalSpecimenID and dwc:materialEntityID, and the distinctions are what was discussed above.

The human curation would consist of acquiring and assigning a digitalSpecimenID to all of the "things" in their data management system that correspond to that Digital Specimen and propagating that with the information about the MaterialEntities when sharing via Darwin Core. They would do it if the benefits outweighed the costs, as with any commitment to curatorial effort.

Sorry @wouteraddink if I created more work for you than if I hadn't tried to answer @ben-norton's questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants