taxonomy summary? #1817

dustymc · 2018-11-26T15:51:31Z

There are now lots of recent Issues+comments to change the fundamental nature of the taxonomy model in Arctos. Many of those issues seem to be self-conflicting to me - A can't exist if we do B, so I'm not anxious to tackle A while B is still under discussion.

The current taxonomy "metadata" model is ultra-normalized; it will hold about anything, but it's computationally expensive to - well, everything. That is required to simultaneously normalize names and accept anything as classifications of those names. If we're giving up on all or part of that goal, perhaps we can do whatever it is that we want to do in a simpler model.

The idea of allowing author data (=="whatever someone wants to type") in names is fairly central to many of those discussions. That isn't compatible with the current link between taxa and specimens, and I think it's going to alter the nature of taxa-based links with other sources (such as GenBank and GlobalNames).

I don't think that leaves anything of the current model behind, so we can treat this as a clean-slate exercise. This seems like it would change how everyone uses Arctos in fairly fundamental ways. I can't follow the Issues, and I doubt the people who need to make this decision can either. Can someone please summarize the issues, or present a proposal in lieu of the Issues, or otherwise shape this into a form that can be discussed?

Jegelewicz · 2018-11-30T02:53:48Z

My "simple" solution

"Namestrings" are (more or less) formal taxa produced by publication. "Sorex cinereus" is a namestring. "Sorex sp." and "Sorex sp. nov. 41" are not. Namestrings are rankless - "Animalia" is acceptable. Namestrings are not tied to singular classifications - namestring "Diptera" refers to insects and plants and no duplication is necessary.

I think that we need to rethink this part of that statement, "Namestrings are not tied to singular classifications". If we DID tie namestrings to singular classifications (or at least offer a method for doing so) we could solve some problems. These "namestrings" are our equivalent of "scientific names" and the Darwin Core definition of scientific name:

scientificName

Identifier | http://rs.tdwg.org/dwc/terms/scientificName

Definition | The full scientific name, with authorship and date information if known. When forming part of an Identification, this should be the name in lowest level taxonomic rank that can be determined. This term should not contain identification qualifications, which should instead be supplied in the IdentificationQualifier term.

Examples
Coleoptera (order)
Vespertilionidae (family)
Manis (genus)
Ctenomys sociabilis (genus + specificEpithet)
Ambystoma tigrinum diaboli (genus + specificEpithet + infraspecificEpithet)
Roptrocerus typographi (Györfi, 1952) (genus + specificEpithet + scientificNameAuthorship)
Quercus agrifolia var. oxyadenia (Torr.) J.T. Howell (genus + specificEpithet + taxonRank + infraspecificEpithet + scientificNameAuthorship).

If we would simply relax the rules for what can be in the "namestring" to include:

more than one capital letter
the following symbols ( ) ,
numbers 0-9

then I would be able to create these "namestrings", which are all allowed under the Darwin Core definition of scientific name:

Cepolidae
Cepolidae Rafinesque, 1810
Cepolidae Ihering, 1909

A specimen with the identification of Cepolidae would be a crapshoot - you might get the classification for a fish, you might get the classification for a snail, you might get both depending upon what is assigned to that name. If you select Cepolidae Rafinesque, 1810 as your identification, then your fish specimen will be classified as a fish and won't get caught up in searches of Gastropoda.

Cepolidae Rafinesque, 1810
Classification:
Animalia (kingdom)
Chordata (phylum)
Actinopterygii (class)
Perciformes (order)
Cepolidae (family)

If you select Cepolidae Ihering, 1909 as your identification, then your snail specimen will be classified as a snail and won't get caught up in searches of Chordata.

Cepolidae Ihering, 1909
Classification:
Animalia (kingdom)
Mollusca (phylum)
Gastropoda (class)
Stylommatophora (order)
Cepolidae (family)

Note that this is the method the Field Museum is using to pass identifications to iDigBio.
https://www.idigbio.org/portal/records/bef2121c-f062-412d-9766-b0114f557e79

As does Naturalis
https://www.idigbio.org/portal/records/fa4799fe-9e74-4c5d-a595-7303c7ba7ece

What would this break?

We could strip the scientificNameAuthorship when data goes out to aggregators (not that we need to as shown above). We could also be on the lookout for near duplicates (just as we do with Agents) and resolve them with similar relationships ("not the same as", along with the other relationships we already use between names).

Now for idea number two....

Jegelewicz · 2018-11-30T03:18:50Z

Mariel and I spent an afternoon discussing this and she has a more complicated, but probably more elegant solution. Here goes.

Right now we have two valid classifications assigned to the namestring Cepolidae.

When making an identification with the name Cepolidae, it would be useful if Arctos could recognize that there are two classifications and offer the option of one or the other by the author text.

We are guessing this would mean that we will need an additional "identification" field (perhaps Taxon_Author) so that we can create the link to the appropriate classification. We don't suggest this should be required, except in cases where there are multiple classifications related to a name.

As for the bulkloader, if a name has more than one classification, a record in a bulkload using that name could throw an error like "author name is required" if one is not provided.

Mariel has also asked if each classification could have a unique number (as collecting events do) so that the number could be used in a bulkloader to eliminate spelling errors and such.

So, is this even possible? What would it break?

dustymc · 2018-11-30T18:19:56Z

The first option is isolating, denormalizing, and I think functionally similar to free-text. It immediately un-does all the work we've just invested to talk to WoRMS, for example. I think it would be a major step in the wrong direction.

If the goal is to share and display "display_name," that's fairly trivial without model changes.

I think it also confounds taxonomy with identification, and management systems with exchange standards.

two valid classifications

Even if the administrative stuff hasn't yet been resolved and even if they're in widespread usage, homonyms within a Code cannot all be "valid". Those got that way because fish-people and snail-people (or whatever those are) don't much talk to each other. They don't have to now either. Simply splitting the classification source to better reflect how the folks who create these names and manage the specimens divide themselves up solves most of these problems. It's not a perfect solution, but it is practical, it is what we had in mind when we designed this, and it would eliminate a huge amount of the "problems" that come up. I can't quite grasp why we seem so resistant to that while simultaneously struggling with the problems it was designed to address. Is this an entirely theoretical situation, or are we trying to solve problems which currently exist?

The second option has come up a few times, and I think it has some merit - with "adjustments."

First, the author-text thing is limiting. With it, at best you can get at one cruddy approximation of a "taxon concept" (the original description, if you can somehow turn the author-year string into a publication - and that's often non-trivial). With the internal identifier of a classification instead, you can get to the author-year taxon concept, or any other that you can create/define or link to. That would make Arctos capable of fully embracing taxon concepts (and we have a bit of $$ from a project to look at this) without forcing us into the eternal "what's a taxon concept?" debate or trying to build a taxon concept "homonym" resolver or any of that fun stuff. This makes us more connected (if anyone ever builds a functional taxon concept service, anyway), and provides the ability to link identifications to very specific interpretations of taxa. (Note that we currently have some of the same functionality, albeit down a very different pathway with different benefits and limitations, in "ID sensu". Specimens which were cited in a publication that defines a taxon concept serve as a sort of "concept types," so there's a bit of a feedback loop between these two models as well.)

Second, and the problem I haven't quite been able to get around (I think it's what lead to collections pre-selecting classification sources when we moved into the current model), is the bulkloader (or more generally string-based input). With "internal" things, a user could specify the second Cepolidae/Arctos classification (eg, by clicking it in a pick window) and then just pass around 6A6D65CA-E4AB-7B67-4466BB24A47A0985 (that classification's unique ID). For string-based things like the bulkloader, "Cepolidae" isn't adequate - it resolves to two classifications. (In this case they're very different taxa, in a taxon concepts model the distinction could be much more subtle - perhaps revisions of a field guide.) Perhaps 90% of our specimen records go though the bulkloader. I don't think throw an error like "author name is required" is an acceptable approach, and I think that would strongly discourage new collections from considering Arctos.

There are lots of details to work out and I'm very receptive to wildly different alternatives, but a hybrid solution of doing what we do now with bulkloader.taxon_name OR accepting classification_id (which can be found by anything that can talk to Arctos - like the data entry screen) could possibly work, at least in the short term. We do something similar with localities - one can provide higher_geog and spec_locality and such, or just locality_name. With the current data, taxon_name + classification source leads to a single "concept" most of the time, with a few outliers like Cepolidae in which case the selection is ambiguous and something arbitrary happens. If we really embrace taxon concepts Cepolidae is likely to accumulate dozens of "concepts" and a string-input "Cepolidae" is likely to become much more ambiguous, or explicitly ambiguous anyway. Perhaps that won't matter and we can just expect to never see that level of precision in string-based input (which generally isn't capable of carrying that level of detail), or perhaps we'd have to address that at some point in the future.

Perhaps this should even be modeled more like we've done with parts to events, so that all identifications link to taxa (refined to the level of a collection's preferred source) and one can optionally and additionally link to concepts (specific classifications) within that taxon. That would avoid the bulkloader issues altogether ("Cepolidae" would still mean "... as defined in the collection's preferred source's classification(s)") and change nothing in the core model, although the redundancy aspect bothers me for reasons I can't quite articulate.

In any hybrid model, the "default" concept-from-name handling would still rely on a collection's preferred classification source, so partitioning classification sources to avoid Cepolidae referring to multiple very different taxa would still have some practical benefit. ("Cepolidae" would still be ambiguous if there are multiple concepts, but the choices would - in theory anyway - all be subsets of, or similar to, one concept, ie the original description.)

campmlc · 2018-12-01T00:07:55Z

Simply splitting the classification source is problematic because we have collections (Host, Para, Paleo, Inv) with multiple phyla. Host can and will have both Osteichthyes and Mollusca. I would like to find a solution that can be applied in the shared classification, if at all possible. Or, make it possible for collections to use different classifications for different taxa.

…

On Fri, Nov 30, 2018 at 11:20 AM dustymc ***@***.***> wrote: The first option <#1817 (comment)> is isolating, denormalizing, and I think functionally similar to free-text. It immediately un-does all the work we've just invested to talk to WoRMS, for example. I think it would be a major step in the wrong direction. If the goal is to share and display "display_name," that's fairly trivial without model changes. I think it also confounds taxonomy with identification, and management systems with exchange standards. two valid classifications Even if the administrative stuff hasn't yet been resolved and even if they're in widespread usage, homonyms within a Code cannot all be "valid". Those got that way because fish-people and snail-people (or whatever those are) don't much talk to each other. They don't have to now either. Simply splitting the classification source to better reflect how the folks who create these names and manage the specimens divide themselves up solves most of these problems. It's not a perfect solution, but it is practical, it is what we had in mind when we designed this, and it would eliminate a huge amount of the "problems" that come up. I can't quite grasp why we seem so resistant to that while simultaneously struggling with the problems it was designed to address. Is this an entirely theoretical situation, or are we trying to solve problems which currently exist? The second option <#1817 (comment)> has come up a few times, and I think it has some merit - with "adjustments." First, the author-text thing is limiting. With it, at best you can get at one cruddy approximation of a "taxon concept" (the original description, if you can somehow turn the author-year string into a publication - and that's often non-trivial). With the internal identifier of a classification instead, you can get to the author-year taxon concept, or any other that you can create/define or link to. That would make Arctos capable of fully embracing taxon concepts (and we have a bit of $$ from a project to look at this) without forcing us into the eternal "what's a taxon concept?" debate or trying to build a taxon concept "homonym" resolver or any of that fun stuff. This makes us more connected (if anyone ever builds a functional taxon concept service, anyway), and provides the ability to link identifications to very specific interpretations of taxa. (Note that we currently have some of the same functionality, albeit down a very different pathway with different benefits and limitations, in "ID *sensu*". Specimens which were cited in a publication that defines a taxon concept serve as a sort of "concept types," so there's a bit of a feedback loop between these two models as well.) Second, and the problem I haven't quite been able to get around (I think it's what lead to collections pre-selecting classification sources when we moved into the current model), is the bulkloader (or more generally string-based input). With "internal" things, a user could specify the second Cepolidae/Arctos classification (eg, by clicking it in a pick window) and then just pass around 6A6D65CA-E4AB-7B67-4466BB24A47A0985 (that classification's unique ID). For string-based things like the bulkloader, "Cepolidae" isn't adequate - it resolves to two classifications. (In this case they're very different taxa, in a taxon concepts model the distinction could be much more subtle - perhaps revisions of a field guide.) Perhaps 90% of our specimen records go though the bulkloader. I don't think *throw an error like "author name is required"* is an acceptable approach, and I think that would strongly discourage new collections from considering Arctos. There are lots of details to work out and I'm very receptive to wildly different alternatives, but a hybrid solution of doing what we do now with bulkloader.taxon_name OR accepting classification_id (which can be found by anything that can talk to Arctos - like the data entry screen) could possibly work, at least in the short term. We do something similar with localities - one can provide higher_geog and spec_locality and such, or just locality_name. With the current data, taxon_name + classification source leads to a single "concept" most of the time, with a few outliers like Cepolidae in which case the selection is ambiguous and something arbitrary happens. If we really embrace taxon concepts Cepolidae is likely to accumulate dozens of "concepts" and a string-input "Cepolidae" is likely to become much more ambiguous, or explicitly ambiguous anyway. Perhaps that won't matter and we can just expect to never see that level of precision in string-based input (which generally isn't capable of carrying that level of detail), or perhaps we'd have to address that at some point in the future. Perhaps this should even be modeled more like we've done with parts to events, so that all identifications link to taxa (refined to the level of a collection's preferred source) and one can optionally and additionally link to concepts (specific classifications) within that taxon. That would avoid the bulkloader issues altogether ("Cepolidae" would still mean "... as defined in the collection's preferred source's classification(s)") and change nothing in the core model, although the redundancy aspect bothers me for reasons I can't quite articulate. In any hybrid model, the "default" concept-from-name handling would still rely on a collection's preferred classification source, so partitioning classification sources to avoid Cepolidae referring to multiple very different taxa would still have some practical benefit. ("Cepolidae" would still be ambiguous if there are multiple concepts, but the choices would - in theory anyway - all be subsets of, or similar to, one concept, ie the original description.) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1817 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AOH0hFEZmyKlxVMVTt8carWn0-QSUnb6ks5u0XbOgaJpZM4YzTfR> .

dustymc added the Help wanted I have a question on how to use Arctos label Nov 26, 2018

dustymc added this to the Need More Information milestone Nov 26, 2018

This was referenced Nov 26, 2018

Actinopterygii in Hierarchical Classification Editor #1809

Closed

Add dagger to fossil taxa #1810

Closed

Add describer, year, & taxon status to taxonomy search results #1821

Closed

Jegelewicz mentioned this issue Dec 13, 2018

Taxon Concepts as a data model #1852

Closed

dustymc closed this as completed Jan 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

taxonomy summary? #1817

taxonomy summary? #1817

dustymc commented Nov 26, 2018

Jegelewicz commented Nov 30, 2018

Jegelewicz commented Nov 30, 2018

dustymc commented Nov 30, 2018

campmlc commented Dec 1, 2018 via email

taxonomy summary? #1817

taxonomy summary? #1817

Comments

dustymc commented Nov 26, 2018

Jegelewicz commented Nov 30, 2018

scientificName

Jegelewicz commented Nov 30, 2018

dustymc commented Nov 30, 2018

campmlc commented Dec 1, 2018 via email