Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taxon Concepts as a data model #1852

Closed
Jegelewicz opened this issue Dec 13, 2018 · 51 comments
Closed

Taxon Concepts as a data model #1852

Jegelewicz opened this issue Dec 13, 2018 · 51 comments
Labels
Function-Taxonomy/Identification Priority-Normal (Not urgent) Normal because this needs to get done but not immediately.

Comments

@Jegelewicz
Copy link
Member

Jegelewicz commented Dec 13, 2018

A thread for exploring the idea.

Related issues/comments include
#1136
#1809
#1817
#735 (comment)
#912 (comment)
#1803 (comment)
#1805 (comment)
#983 (comment)
#1609 (comment)

@Jegelewicz
Copy link
Member Author

@dustymc Could you give us your ideas about how this type of model would look in Arctos?

@dustymc
Copy link
Contributor

dustymc commented Jan 12, 2019

The model I envision is pretty simple - identification_taxonomy (links identifications to names/taxa, which are refined to a source by a collection's preferences) becomes identification_classification - identifications would link to specific classifications/concepts rather than names.

We'd have to preserve classification_id and maybe limit how things can change and such, but that's details.

Using that with something that can talk to Arctos is pretty simple, if perhaps a lot more labor intensive. (Think #1136, but with maybe-hundreds of options for every name, in addition to the "which name should we use?" thing that Issue's focused on.) Right now, you type "Echidna" into a pick, get one match (and maybe some species and stuff that you don't have enough information to care about, related names, etc.), select it, and the details work themselves out from collection's source preference. In a concept model, you'd type "Echidna" and get (along with all of the species-and-junk, including those in Bitis and Tachyglossus and whatever other synonyms might exists) something that looks very much like http://arctos.database.museum/name/Echidna - I think the full classification in the context of all other classifications is the minimal amount of information you'd need to pick one. Scroll around, find the thing you want, click "use this one," voilà. There are ~14 "concepts" that include Muraenidae on there at the moment, so working out which ones to pick under which situations would be left to ya'll. I'd expect that to grow rapidly (if anyone decides to really embrace this idea, anyway) - any source might include the original publication, then 231 years of publications refining circumscriptions, and publications rejecting publications that tried to refine circumscriptions, and groups of those, and groups excluding certain publications, and field guides, and all the other normal noise. (And for homonyms like Echidna, perhaps the same sort of information for viruses and moths and monotremes and such.) I think that's easily hundreds of "concepts," and some - or most - of them may be different ways of saying the same thing (https://academic.oup.com/sysbio/article/65/4/561/1753624).

Accessing that level of complexity with something that cannot talk to Arctos is a great mystery to me. Most specimen data come from spreadsheets and such - things that cannot talk to Arctos. Much/most/all of that is by people who don't even KIND OF have the resources for figure out what definition of eel whoever entered the data a few decades ago might have had in mind. That's "just" a usability problem and there's certainly a way around it. I suspect finding that pathway will involve having the right people in the same room for a few days.

Maybe it's as simple as having some sort of default concept (eg, the original description), although I'm not sure how the details of that could work. I think the vast majority of the time we don't think in taxon concepts, so some sort of "just use the name" default might be necessary anyway. ("It's a moose, obviously" - that's all we know or really care about most of the time.)

Note that this completely avoids the issue of defining taxon concepts. If you can stuff it into an optionally-ordered key-value array, or stuff some sort of summary into that structure and link to the "real" concept (what we do with WoRMS), you could use it in identifications.

@DerekSikes
Copy link

DerekSikes commented Jan 14, 2019 via email

@dustymc
Copy link
Contributor

dustymc commented Jan 14, 2019

@DerekSikes I agree, but that involves tracking down ~3 million "most recently published concepts."

Maybe "it's just a name, we're not asserting anything" is a better default - although that might require a NULL classification or something equally weird.

I think this comes down to what ya'll are willing to do. I'm operating on the assumption that most data entry is going to involve a label/spreadsheet/whatever that just says "Somegenus somespecies" and Curators who are OK with that level of information (eg because they don't have the resources to do anything else). I'd love to require something more specific, I just don't see how we can pull it off.

@campmlc
Copy link

campmlc commented Jan 14, 2019 via email

@dustymc
Copy link
Contributor

dustymc commented Jan 14, 2019

Implemented is easy. Used - maybe not so much...

There is no "preferred" in the model. Maybe we can figure out some default or something as above, but the "core" would be explicitly picking a concept (unless someone comes up with something clever...).

I think it would be generally finer-grained than you've described. You'd have M. gapperi (limited to WHATEVER because SOME PUBLICATION or something - maybe something about range or morphology or DNA or karyotype or song or parasites or ...), and M. gapperi (potentially with exactly the same hierarchy) limited to SOMETHING ELSE because SOME OTHER PUBLICATION (and maybe hundreds of other concepts, potentially all with exactly the same hierarchy). Knowing phylum or class won't help very much - that'll get rid of Myodes-the-bug, but still (probably) won't get you to the single classification/concept you need to create an ID.

You're probably right that we do think in "fuzzy taxon concepts," and Derek is probably right in that those aren't really THAT fuzzy, it's just that we don't record anything useful so someone in the future has to guess who's idea of M. rutilus was used, and how close that concept was adhered to, when the ID was applied.

@Jegelewicz
Copy link
Member Author

tracking down ~3 million "most recently published concepts."

Wouldn't each concept have a publication date associated with it? Even just a year? Then you just pick the date closest to today's date?

@Jegelewicz
Copy link
Member Author

It sounds like we agree that this is the way to go, but we need to get together in a room and work out the details. I would suggest we do this at SPNHC, but I think I would rather make this a meeting about just one thing without the distractions of of other presentations and ideas. Thoughts? Who really wants to be included in the in-person meeting?

@dustymc
Copy link
Contributor

dustymc commented Jan 16, 2019

Wouldn't each concept have a publication date associated with it?

No, that would take us into defining taxon concepts. And the "closest to today" 'concept' might be a list of publications all refuting your publications or something....

@DerekSikes
Copy link

DerekSikes commented Jan 16, 2019 via email

@dustymc
Copy link
Contributor

dustymc commented Jan 17, 2019

need a massive publications database

Only if you want to base your concepts off of publications. "Whatever WoRMS does with aphiaid=12345" is a perfectly valid (if mostly useless...) "concept" in the model I'm proposing, for example.

concept = 'unknown'

It wouldn't necessarily be completely unknown - the default would (hopefully) be something like we do now, eg http://arctos.database.museum/name/Sorex%20vagrans#Arctos. We're definitely not talking about Sorex vagrans (the jellyfish) because we have class and such in there, but there's not much detail either.

most recent publication

Or group of publications, or most recent not including THAT publication, or ...

confidence of the association

That's identification - you're not qualifying the taxon in any way, just how the specimen fits in to it.

user is certain they don't know what concept was applied

I don't think that's quite accurate. They're certain (or not) that it's a moose, they're just applying some waffly definition/circumscription/whatever - concept - of "moose."

outstanding questions

I think just how we use it, if nobody can find holes in the idea of replacing...

UAM@ARCTOS> desc identification_taxonomy
 Name								   Null?    Type
 ----------------------------------------------------------------- -------- --------------------------------------------
 IDENTIFICATION_ID						   NOT NULL NUMBER
 TAXON_NAME_ID							   NOT NULL NUMBER
 VARIABLE							   NOT NULL CHAR(1)

taxon_name_id (foreign key-->taxon_name.taxon_name_id) in that with a foreign key to taxon_term.classification_id (and we can figure out how to preserve that identifier and still use the hierarchical editor and have enough processors to do stuff with this and all that jazz).

@mbprondzinski
Copy link

https://www.loc.gov/standards/sourcelist/index.html
Is this of any help? I doubt I have anything to offer.

@Jegelewicz
Copy link
Member Author

HMMMMMMM...this is interesting. Perhaps we should have "arctos" in https://www.loc.gov/standards/sourcelist/taxonomic.html

How many of these could provide WoRMS-like data to our taxonomy?

@dustymc
Copy link
Contributor

dustymc commented Jan 18, 2019

How many of these could provide WoRMS-like data to our taxonomy?

One purpose of GlobalNames is to make the answer, "who cares?" With GN, we write to one API and get everything they have. Without that abstraction, we'd need to write to 182ish (https://resolver.globalnames.org/data_sources) APIs. There's no obvious standardization among those sources - I doubt much code could be reused. Re-creating what GN provides would require a tremendous amount of resources.

For whatever reason GN doesn't contain all of the information from WoRMS and isn't updated very often, we couldn't get them (or WoRMS, or something) to fix that, there's some additional complexity (we need specific classifications in some cases), so we wrote code to WoRMS. We can do that for other sources too if there's some compelling reason (eg, someone's going to use it to catalog), but for most things I think the best path is to encourage the source and GN to deal with it.

@dustymc
Copy link
Contributor

dustymc commented Jan 18, 2019

Perhaps we should have "arctos" in https://www.loc.gov/standards/sourcelist/taxonomic.html

I have always seen Arctos more as a consumer than an authority, even though that's not entirely where we've found ourselves. I would like to see more WoRMS-like connections, and less local editing/"authority building." Ideally Curators (and/or their representatives) who want to could put their taxonomist hat on, log in to something like WoRMS, make changes there, and see them magically appear in Arctos. Curators who don't want to wear that hat could just passively use data from some source, or if we go to some more concept-like model from any source by picking individual "classifications" from the local cache. Even more ideally, GN would become more active in keeping things complete and current and we'd just maintain one API to do that. I don't know how realistic any of that is, but I do think it's the best model for everyone involved.

@Jegelewicz
Copy link
Member Author

Jegelewicz commented Jan 29, 2019

for review and discussion: http://ubio.org/ @dustymc promising?

@dustymc
Copy link
Contributor

dustymc commented Jan 29, 2019

uBio initially intended to implement the Ballew thesaurus, and we spent a lot of time talking to them before they started writing code. If they'd have done what they set out to do, we'd likely just use them. (Or they'd have killed us all by sucking up every electron on the planet trying to build a transitive closure table....)

What they actually implemented is a "curated view" much like ITIS and everyone else, which is not very useful to Arctos as a whole. It could be useful to individual collections. I don't think there's anything that could remotely be described as taxon concepts in the data they have, but I haven't looked closely in a long time and I could be missing something.

@Jegelewicz
Copy link
Member Author

@dustymc
Copy link
Contributor

dustymc commented Mar 20, 2019

thoughts on usability

  • keep collection source preference
  • add optional metadata taxon_term taxon_concept_id

Default action from the specimen bulkloader (and other string-based tools) would be to use the classification/concept within the collection's preferred source which does not have a taxon_concept_id. (And nothing does now, so this would essentially be no changes that a user would notice.)

The specimen bulkloader could be modified to somehow accept taxon_concept_id - eg, Some species (someID) would use the concept under Some species with an ID of someID, and error if that doesn't exist or if there are multiple matches.

taxon_concept_id would be a string so "Conus Linnaeus, 1758" or "http://www.marinespecies.org/aphia.php?p=taxdetails&id=137813" would be acceptable values.

There is not a 1:1 name-concept relationship, so a unique key would be difficult or impossible - it would likely be possible to create many concepts "named" Conus Linnaeus, 1758 under one taxon, in which case references using the name+ID would be ambiguous so would error. I think this requires careful users and good documentation.

For internal forms, we can pass around data objects and none of the above is a concern. Would anyone want to load specimens with concepts? (Seems inevitable...)

It would be possible to link identifications to concepts outside the collection's preferred source - it would be possible to eg, use a concept under WoRMS (via Arctos) (or anything else) for a collection which generally prefers the Arctos Plants (or whatever) source.

Perhaps concepts should even be managed in their own source(s), which would keep "normal" sources cleaner/easier to manage. E.g., that could be exploited to disallow someone adding taxon_concept_id to a "default" classification.

Used concepts cannot change; "changes" create new concepts.

It would be exceptionally useful to have a consistent "backbone" in support of search. This doesn't have to be a Source (or sources - perhaps it's easier to manage at phylum/kingdom/etc.) that anyone uses, it would just facilitate search to compensate for (perhaps purposefully) inconsistent data with "concepts." This could be managed as a hierarchy and excluded from the single-record editor.

Concepts may be purposefully inconsistent and will be managed singly; it's likely safe to proceed with change requests to the single-record editor.

Example: http://dx.doi.org/10.1093/zoolinnean/zlx040 split a taxon/created a new concept and could serve as a most-basic test case.

This was referenced Apr 3, 2019
@Jegelewicz
Copy link
Member Author

I assume that this way we could have taxon concepts that would be preferred by different institutions - e.g. Myodes gapperi in a classification that prefers this version of the genus and the family Cricetidae subfamily Arvicolinae could be preferred by MSB, and Clethrionomys gapperi in a classification that prefers the genus Clethrionomys in the family Muridae would be preferred by MVZ, and Myodes as a beetle could be preferred by an insect collection If we could display the taxon concept preferences for each institution when there are more than one possible, then we could filter these so students doing data entry choose the right one.

This and other comments keep bringing me back to the "preferred by" solution. Perhaps the easiest thing would be that any time there is more than one classification in a source, collections using that source are notified and can select the classification they wish to use for all identifications in their collection. We could report this in the Low Quality Data section where collections could find a list of all the taxa for which there are more than one classification but they have not selected a preference. This would take the decision about taxonomy out of the hands of students entering data.

I think we would want to start off making whatever classification a collection is using right now their preference, so that when Derek comes along and starts adding classifications no one will suddenly need to choose preferred classifications for 100's of names. Otherwise, collections who don't choose a preference will end up with what we have now - a mash-up of all classifications associated with the name. This means their stuff will be found (although sometimes in error).

The main challenge I see to this is how do we record preference and ensure that someone can't go around changing the preferences for someone else's collection? I don't think throwing it into the Classification Metadata would work, but maybe Dusty could fix it so that I can only add/delete preference for collections to which I have access.

Also, there will be the need to track changes in the preferred classification. Somehow, when I decide to change from one preference to another, I would like to create a notation in the identification section of the specimen record.

@dustymc
Copy link
Contributor

dustymc commented Apr 15, 2019

select the classification they wish to use for all identifications in their collection

If that's the only objective, the model we are currently in accomplishes it much better than taxon concepts could. If you're trying to be more precise than taxonomy allows in identifications, you might NEED taxon concepts. If you're trying to sort beetles from mice, you're almost certainly going to absolutely hate the extra workload and I never see the benefits.

This would take the decision about taxonomy out of the hands of students entering data.

This also seems to suggest that we don't need a taxon concept model. What's the point if the people who do most of the work can't access the complexity??

Also, there will be the need to track changes in the preferred classification. Somehow, when I decide to change from one preference to another, I would like to create a notation in the identification section of the specimen record.

I don't think taxon concepts can change and remain anything recognizable as a taxon concept. I think we have to find some sort of "default" or "preferred" to be able to use this, but that's just help in selecting a concept when you have no preference - it's procedure, not data. Changing the preference could not change existing data.

@dustymc
Copy link
Contributor

dustymc commented Apr 16, 2019

@Jegelewicz
Copy link
Member Author

This also seems to suggest that we don't need a taxon concept model. What's the point if the people who do most of the work can't access the complexity??

Exactly.

What I am suggesting is that we keep the model we have. This way, data entry is easy - students (or anyone entering data) just have to pick the name.

The change I suggest is that as long as only one classification exists in any taxonomy "source", everyone uses it. Let's take Diplura, which in GBIF includes:

image
image
image

If all of these classifications were included in source Arctos, then anything identified as Diplura would get a crazy mash-up of higher taxa. I propose that COLLECTIONS should be able to tag one of these as preferred so that only that classification will be applied to Diplura in the associated collection.

Again, this will not solve Derek's issue and isn't really using "taxon concepts" BUT it does allow us to maintain homonyms in a single taxonomy source. I see it as a baby step.

@dustymc
Copy link
Contributor

dustymc commented Apr 16, 2019

If you're suggesting that CollectionA can pick Diplura-the-spider for one specimen and Diplura-the-butterfly for another, then the only way I see to do that is #1852 (comment) - move the pointer from taxonomy to classifications. It's a taxon concept model, even if the concepts are flaky. (And no model excludes the possibility of flaky data.)

If you're suggesting CollectionA has to preemptively go say "All our Diplura are spiders" then this just looks like a really complicated way to split classification sources. It also obfuscates the path between names and "my" classification, unless the "this is mine" bit lives in the classification itself or something.

I don't think I'm quite understanding something.

@DerekSikes
Copy link

DerekSikes commented Apr 16, 2019 via email

@dustymc
Copy link
Contributor

dustymc commented Apr 16, 2019

the lowest name in the classification is the name in taxonomy

That's not a requirement, although I can't think of a reason it shouldn't be true.

merge the two

Explain please.

@DerekSikes
Copy link

DerekSikes commented Apr 16, 2019 via email

@dustymc
Copy link
Contributor

dustymc commented Apr 16, 2019

So basically taxon concepts (identifications<-->classifications), but without the authoritative "anchor" tieing related stuff together?

And I found a classification (1067358 of them, actually...) that doesn't end with the name: http://arctos.database.museum/name/Poecilophis#ArctosRelationships

@sharpphyl
Copy link

Could we have the option of searching on (or entering) the display name or the name string which include the author instead of just the taxon name?

display_name: Poecilophis Kaup, 1856
display_name: Echidna Forster, 1788

Or add to the taxon status "hemihomonym" and "homonym" and for those allow the addition of author to differentiate? Wouldn't deal with everything but would take care of most of what I see.

@dustymc
Copy link
Contributor

dustymc commented Apr 17, 2019

searching on

Yea, if it's in a classification somewhere I can search it.

entering

You're going to have to be a LOT more specific before I can answer that.

taxon status "hemihomonym" and "homonym"

I don't have any objections, but that sounds like a lot of work to make redundant data.

allow the addition of author

If you mean to the namestring, #1803 with a built-in extra-randomizer does not sounds like fun to me.

take care of

PLEASE, elaborate. What is it that we're trying to take care of?

@Jegelewicz
Copy link
Member Author

Jegelewicz commented Apr 17, 2019

If you're suggesting that Collection A can pick Diplura-the-spider for one specimen and Diplura-the-butterfly for another

Nope - suggesting collection A picks one or the other and sticks with it.

If you're suggesting Collection A has to preemptively go say "All our Diplura are spiders" then this just looks like a really complicated way to split classification sources. It also obfuscates the path between names and "my" classification, unless the "this is mine" bit lives in the classification itself or something.

I am trying to avoid very single collection having it's own source, which defeats the purpose of a collaborative system. And yes, I assumed the "this is mine" thing would live in the classification metadata. I thought I knew what we were doing with "taxon concepts" but every time I think I know, someone says something that makes me think it will not work. I have been looking for something to help us deal with multiple classifications related to a single name when they occur. We have talked in circles about this for over a year and are getting nowhere - perhaps we need some fresh viewpoints.

@Jegelewicz
Copy link
Member Author

@Jegelewicz @campmlc to pick a collection and work with Dusty to create a new taxonomy source that pulls from Arctos.

@dustymc
Copy link
Contributor

dustymc commented Apr 18, 2019

Avoiding Taxon Concepts

There are significant usability issues surrounding taxon concepts - it's just more complex data, so it's more difficult to use in most every way. The model generally seems like overkill for the kinds of problems we're trying to solve.

Most of those problems involve homonyms, and there is a reluctance to split classifications in order to share data/updates.

Potential not-concepts solution: create "dynamic" sources which are based on collection-defined criteria and auto-refresh themselves periodically. Selection could cross sources, include things like taxon_status or various ranks, etc. Data would be managed in the shared (eg, "Arctos") Source(s) and the dynamic source would be refreshed from updates.

Outstanding questions and concerns:

I still don't have an example of an actual problem. I can think of two potentials:

  1. Cataloging two different type specimens which share a name in the same collection. That seems exceedingly remote, and all other homonyms can (in theory...) be dealt with by following the Codes.
  2. Cataloging hemihomonyms in the same collection. This seems more likely (e.g., should someone catalog 'stuff found in bird nests' at sufficient detail), but I don't think we have any collections which might actually do that.

https://arctos.database.museum/name/Diplura comes up from time to time, but at least the "Arctos" data are likely just wrong - surely the term isn't actually both a class and order for the same individuals?

#1936 (and similar) - we are aggressively pushing things that are very likely to cause problems into shared classifications. I don't think there's anything to share between eg, taxa used by a bird collection and taxa used by a nautiloid collection; those taxa are created by very different user groups, and are probably best managed by different user groups. There is much more to share between eg a modern mammal collection and a paleo collection cataloging lots of Pleistocene material. I'm not sure where to draw any lines, but I suspect there are some in there and we're pushing them in directions that create unnecessary work.

Would anyone use taxon concepts as a way to disambiguate taxa (not names) in a way that can't be accomplished with "ID sensu"? If so, perhaps we should figure out how to mitigate the usability issues. If not, perhaps an alternative approach makes sense until we're forced into the more complex/precise model.

@Jegelewicz
Copy link
Member Author

I work with the U Alaska herbarium, and Steffi Ickert-Bond and I got an
NSF grant to work on Taxon Concepts for Alaska plants (see
http://alaskaflora.org/). Included in the grant are some funds to offer
to Dusty to implement a taxon concept data model in Arctos; we’ll be
generating Taxon Concept data, and would like to be able to feed it back
into Arctos.

We talked extensively with Dusty about this in mid-2017, and feel now is
the time to start spending that money, if he, and the larger Arctos
community, are willing. I’ve just emailed Dusty and hope to chat with
him soon, but I thought it would be good to contact you directly too,
since you are obviously interested in this. If you have time for a chat
next week, please let me know.

Best,

Cam Webb

@Jegelewicz
Copy link
Member Author

Notes from meeting: https://docs.google.com/document/d/19cbpGwfQJ52mt89fCag5VU-kh2Q6wuEddzpD_zw1BvE/edit#

Cam's plan is to have an additional identification field linked to taxon concepts. He will send Dusty some data and Dusty will evaluate for implementation.

Adding to Taxonomy Committee Agenda.

@Jegelewicz
Copy link
Member Author

Jegelewicz commented Sep 18, 2019

Cam
Grant - taxon concepts and concept mappings for AK flora. Taxon concepts and mapping relate the intersection between a name and a publication. Cam currently has a stand-alone DB for this that we could bring into Arctos.
Dusty
Enhance the link between taxon names and concepts/publications. Add an ID field and a management tool for taxon concepts/maps (two tables)
Derek
How does this handle differences of opinion on validity? Add to taxonomy metadata?
Dusty
Add like the relationships source. Use in addition to the sensu field.
Derek
A pick would be useful - by pub author
Dusty
baby steps - will get it set up then work toward the pick
Derek
There isn't enough money or people or time to do it all!
Cam
This will just be the facility to store the data if available. TDWG group has a test DwC plug-in to capture this stuff - we could be the test case!
Derek
Use the sensu field?
Cam
Use current sensu to populate taxon concepts when the tables are there.
Teresa
Who has data in sensu fields? @dustymc will open an issue to look at it.
Cam to send graphical stuff to new issue.

Let's do this! Committee says let's set it up.

@mbprondzinski
Copy link

mbprondzinski commented Sep 18, 2019 via email

@Jegelewicz
Copy link
Member Author

Closing this as dupe of #2267

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Function-Taxonomy/Identification Priority-Normal (Not urgent) Normal because this needs to get done but not immediately.
Projects
None yet
Development

No branches or pull requests

9 participants