-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
taxonomy summary? #1817
Comments
My "simple" solution
I think that we need to rethink this part of that statement, "Namestrings are not tied to singular classifications". If we DID tie namestrings to singular classifications (or at least offer a method for doing so) we could solve some problems. These "namestrings" are our equivalent of "scientific names" and the Darwin Core definition of scientific name: scientificName
If we would simply relax the rules for what can be in the "namestring" to include:
then I would be able to create these "namestrings", which are all allowed under the Darwin Core definition of scientific name: Cepolidae A specimen with the identification of Cepolidae would be a crapshoot - you might get the classification for a fish, you might get the classification for a snail, you might get both depending upon what is assigned to that name. If you select Cepolidae Rafinesque, 1810 as your identification, then your fish specimen will be classified as a fish and won't get caught up in searches of Gastropoda. Cepolidae Rafinesque, 1810 If you select Cepolidae Ihering, 1909 as your identification, then your snail specimen will be classified as a snail and won't get caught up in searches of Chordata. Cepolidae Ihering, 1909 Note that this is the method the Field Museum is using to pass identifications to iDigBio. As does Naturalis What would this break? We could strip the scientificNameAuthorship when data goes out to aggregators (not that we need to as shown above). We could also be on the lookout for near duplicates (just as we do with Agents) and resolve them with similar relationships ("not the same as", along with the other relationships we already use between names). Now for idea number two.... |
Mariel and I spent an afternoon discussing this and she has a more complicated, but probably more elegant solution. Here goes. Right now we have two valid classifications assigned to the namestring Cepolidae. When making an identification with the name Cepolidae, it would be useful if Arctos could recognize that there are two classifications and offer the option of one or the other by the author text. We are guessing this would mean that we will need an additional "identification" field (perhaps Taxon_Author) so that we can create the link to the appropriate classification. We don't suggest this should be required, except in cases where there are multiple classifications related to a name. As for the bulkloader, if a name has more than one classification, a record in a bulkload using that name could throw an error like "author name is required" if one is not provided. Mariel has also asked if each classification could have a unique number (as collecting events do) so that the number could be used in a bulkloader to eliminate spelling errors and such. So, is this even possible? What would it break? |
The first option is isolating, denormalizing, and I think functionally similar to free-text. It immediately un-does all the work we've just invested to talk to WoRMS, for example. I think it would be a major step in the wrong direction. If the goal is to share and display "display_name," that's fairly trivial without model changes. I think it also confounds taxonomy with identification, and management systems with exchange standards.
Even if the administrative stuff hasn't yet been resolved and even if they're in widespread usage, homonyms within a Code cannot all be "valid". Those got that way because fish-people and snail-people (or whatever those are) don't much talk to each other. They don't have to now either. Simply splitting the classification source to better reflect how the folks who create these names and manage the specimens divide themselves up solves most of these problems. It's not a perfect solution, but it is practical, it is what we had in mind when we designed this, and it would eliminate a huge amount of the "problems" that come up. I can't quite grasp why we seem so resistant to that while simultaneously struggling with the problems it was designed to address. Is this an entirely theoretical situation, or are we trying to solve problems which currently exist? The second option has come up a few times, and I think it has some merit - with "adjustments." First, the author-text thing is limiting. With it, at best you can get at one cruddy approximation of a "taxon concept" (the original description, if you can somehow turn the author-year string into a publication - and that's often non-trivial). With the internal identifier of a classification instead, you can get to the author-year taxon concept, or any other that you can create/define or link to. That would make Arctos capable of fully embracing taxon concepts (and we have a bit of $$ from a project to look at this) without forcing us into the eternal "what's a taxon concept?" debate or trying to build a taxon concept "homonym" resolver or any of that fun stuff. This makes us more connected (if anyone ever builds a functional taxon concept service, anyway), and provides the ability to link identifications to very specific interpretations of taxa. (Note that we currently have some of the same functionality, albeit down a very different pathway with different benefits and limitations, in "ID sensu". Specimens which were cited in a publication that defines a taxon concept serve as a sort of "concept types," so there's a bit of a feedback loop between these two models as well.) Second, and the problem I haven't quite been able to get around (I think it's what lead to collections pre-selecting classification sources when we moved into the current model), is the bulkloader (or more generally string-based input). With "internal" things, a user could specify the second Cepolidae/Arctos classification (eg, by clicking it in a pick window) and then just pass around 6A6D65CA-E4AB-7B67-4466BB24A47A0985 (that classification's unique ID). For string-based things like the bulkloader, "Cepolidae" isn't adequate - it resolves to two classifications. (In this case they're very different taxa, in a taxon concepts model the distinction could be much more subtle - perhaps revisions of a field guide.) Perhaps 90% of our specimen records go though the bulkloader. I don't think throw an error like "author name is required" is an acceptable approach, and I think that would strongly discourage new collections from considering Arctos. There are lots of details to work out and I'm very receptive to wildly different alternatives, but a hybrid solution of doing what we do now with bulkloader.taxon_name OR accepting classification_id (which can be found by anything that can talk to Arctos - like the data entry screen) could possibly work, at least in the short term. We do something similar with localities - one can provide higher_geog and spec_locality and such, or just locality_name. With the current data, taxon_name + classification source leads to a single "concept" most of the time, with a few outliers like Cepolidae in which case the selection is ambiguous and something arbitrary happens. If we really embrace taxon concepts Cepolidae is likely to accumulate dozens of "concepts" and a string-input "Cepolidae" is likely to become much more ambiguous, or explicitly ambiguous anyway. Perhaps that won't matter and we can just expect to never see that level of precision in string-based input (which generally isn't capable of carrying that level of detail), or perhaps we'd have to address that at some point in the future. Perhaps this should even be modeled more like we've done with parts to events, so that all identifications link to taxa (refined to the level of a collection's preferred source) and one can optionally and additionally link to concepts (specific classifications) within that taxon. That would avoid the bulkloader issues altogether ("Cepolidae" would still mean "... as defined in the collection's preferred source's classification(s)") and change nothing in the core model, although the redundancy aspect bothers me for reasons I can't quite articulate. In any hybrid model, the "default" concept-from-name handling would still rely on a collection's preferred classification source, so partitioning classification sources to avoid Cepolidae referring to multiple very different taxa would still have some practical benefit. ("Cepolidae" would still be ambiguous if there are multiple concepts, but the choices would - in theory anyway - all be subsets of, or similar to, one concept, ie the original description.) |
Simply splitting the classification source
is problematic because we have collections (Host, Para, Paleo, Inv) with
multiple phyla. Host can and will have both Osteichthyes and Mollusca. I
would like to find a solution that can be applied in the shared
classification, if at all possible. Or, make it possible for collections to
use different classifications for different taxa.
…On Fri, Nov 30, 2018 at 11:20 AM dustymc ***@***.***> wrote:
The first option
<#1817 (comment)>
is isolating, denormalizing, and I think functionally similar to free-text.
It immediately un-does all the work we've just invested to talk to WoRMS,
for example. I think it would be a major step in the wrong direction.
If the goal is to share and display "display_name," that's fairly trivial
without model changes.
I think it also confounds taxonomy with identification, and management
systems with exchange standards.
two valid classifications
Even if the administrative stuff hasn't yet been resolved and even if
they're in widespread usage, homonyms within a Code cannot all be "valid".
Those got that way because fish-people and snail-people (or whatever those
are) don't much talk to each other. They don't have to now either. Simply
splitting the classification source to better reflect how the folks who
create these names and manage the specimens divide themselves up solves
most of these problems. It's not a perfect solution, but it is practical,
it is what we had in mind when we designed this, and it would eliminate a
huge amount of the "problems" that come up. I can't quite grasp why we seem
so resistant to that while simultaneously struggling with the problems it
was designed to address. Is this an entirely theoretical situation, or are
we trying to solve problems which currently exist?
The second option
<#1817 (comment)>
has come up a few times, and I think it has some merit - with "adjustments."
First, the author-text thing is limiting. With it, at best you can get at
one cruddy approximation of a "taxon concept" (the original description, if
you can somehow turn the author-year string into a publication - and that's
often non-trivial). With the internal identifier of a classification
instead, you can get to the author-year taxon concept, or any other that
you can create/define or link to. That would make Arctos capable of fully
embracing taxon concepts (and we have a bit of $$ from a project to look at
this) without forcing us into the eternal "what's a taxon concept?" debate
or trying to build a taxon concept "homonym" resolver or any of that fun
stuff. This makes us more connected (if anyone ever builds a functional
taxon concept service, anyway), and provides the ability to link
identifications to very specific interpretations of taxa. (Note that we
currently have some of the same functionality, albeit down a very different
pathway with different benefits and limitations, in "ID *sensu*".
Specimens which were cited in a publication that defines a taxon concept
serve as a sort of "concept types," so there's a bit of a feedback loop
between these two models as well.)
Second, and the problem I haven't quite been able to get around (I think
it's what lead to collections pre-selecting classification sources when we
moved into the current model), is the bulkloader (or more generally
string-based input). With "internal" things, a user could specify the
second Cepolidae/Arctos classification (eg, by clicking it in a pick
window) and then just pass around 6A6D65CA-E4AB-7B67-4466BB24A47A0985 (that
classification's unique ID). For string-based things like the bulkloader,
"Cepolidae" isn't adequate - it resolves to two classifications. (In this
case they're very different taxa, in a taxon concepts model the distinction
could be much more subtle - perhaps revisions of a field guide.) Perhaps
90% of our specimen records go though the bulkloader. I don't think *throw
an error like "author name is required"* is an acceptable approach, and I
think that would strongly discourage new collections from considering
Arctos.
There are lots of details to work out and I'm very receptive to wildly
different alternatives, but a hybrid solution of doing what we do now with
bulkloader.taxon_name OR accepting classification_id (which can be found by
anything that can talk to Arctos - like the data entry screen) could
possibly work, at least in the short term. We do something similar with
localities - one can provide higher_geog and spec_locality and such, or
just locality_name. With the current data, taxon_name + classification
source leads to a single "concept" most of the time, with a few outliers
like Cepolidae in which case the selection is ambiguous and something
arbitrary happens. If we really embrace taxon concepts Cepolidae is likely
to accumulate dozens of "concepts" and a string-input "Cepolidae" is likely
to become much more ambiguous, or explicitly ambiguous anyway. Perhaps that
won't matter and we can just expect to never see that level of precision in
string-based input (which generally isn't capable of carrying that level of
detail), or perhaps we'd have to address that at some point in the future.
Perhaps this should even be modeled more like we've done with parts to
events, so that all identifications link to taxa (refined to the level of a
collection's preferred source) and one can optionally and additionally link
to concepts (specific classifications) within that taxon. That would avoid
the bulkloader issues altogether ("Cepolidae" would still mean "... as
defined in the collection's preferred source's classification(s)") and
change nothing in the core model, although the redundancy aspect bothers me
for reasons I can't quite articulate.
In any hybrid model, the "default" concept-from-name handling would still
rely on a collection's preferred classification source, so partitioning
classification sources to avoid Cepolidae referring to multiple very
different taxa would still have some practical benefit. ("Cepolidae" would
still be ambiguous if there are multiple concepts, but the choices would -
in theory anyway - all be subsets of, or similar to, one concept, ie the
original description.)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1817 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AOH0hFEZmyKlxVMVTt8carWn0-QSUnb6ks5u0XbOgaJpZM4YzTfR>
.
|
There are now lots of recent Issues+comments to change the fundamental nature of the taxonomy model in Arctos. Many of those issues seem to be self-conflicting to me - A can't exist if we do B, so I'm not anxious to tackle A while B is still under discussion.
The current taxonomy "metadata" model is ultra-normalized; it will hold about anything, but it's computationally expensive to - well, everything. That is required to simultaneously normalize names and accept anything as classifications of those names. If we're giving up on all or part of that goal, perhaps we can do whatever it is that we want to do in a simpler model.
The idea of allowing author data (=="whatever someone wants to type") in names is fairly central to many of those discussions. That isn't compatible with the current link between taxa and specimens, and I think it's going to alter the nature of taxa-based links with other sources (such as GenBank and GlobalNames).
I don't think that leaves anything of the current model behind, so we can treat this as a clean-slate exercise. This seems like it would change how everyone uses Arctos in fairly fundamental ways. I can't follow the Issues, and I doubt the people who need to make this decision can either. Can someone please summarize the issues, or present a proposal in lieu of the Issues, or otherwise shape this into a form that can be discussed?
The text was updated successfully, but these errors were encountered: