Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkage of Parts to Collecting Events #1545

Closed
campmlc opened this issue May 30, 2018 · 81 comments
Closed

Linkage of Parts to Collecting Events #1545

campmlc opened this issue May 30, 2018 · 81 comments
Labels
Enhancement I think this would make Arctos even awesomer! Priority-Critical (Arctos is broken) Critical because it is breaking functionality.

Comments

@campmlc
Copy link

campmlc commented May 30, 2018

Per conversation with John Wieczorek, we need to seek funding to implement changes to the event model so that parts = material samples are linked to events = occurrences. This would resolve a deficiency in our current data model and would alleviate the problems Arctos has with serving data to GGBN.
Perhaps this can be accomplished with GGBN funding.

@campmlc campmlc added the Enhancement I think this would make Arctos even awesomer! label May 30, 2018
@dustymc
Copy link
Contributor

dustymc commented May 30, 2018

Sorry I missed that call!

If I'm understanding, this should work for multiple Occurrences resulting from multiple Encounters. I don't see how it would work for multiple Occurrences resulting from multiple equally-valid opinions. It absolutely will not work for a single part having been involved in lots of Occurrences (eg, most everything in any cultural collection).

What exactly are we trying to fix?

@campmlc
Copy link
Author

campmlc commented May 31, 2018

In the Arctos model, there is no way to associate a specific material sample with an occurrence. We don't have a way to associate a part in a record that has multiple collecting events with a specific collecting event. We do this manually by adding in accession number and tissue number to Collecting Event remarks and to part attributes. But this is awkward and artificial. There needs to be a way to associate each part = material sample with a specific date and locality of collection.
This is related to the inability to track more than one accession within a single cataloged record. When there are multiple collecting events, there can be/must be multiple accessions as the only means of tracking which parts were collected when and where.

@dustymc
Copy link
Contributor

dustymc commented May 31, 2018

This is mostly a duplicate of #602.

Specimen events are NOT Occurrences, they're just events that link to specimens. Some of them may eventually be mapped to DWC:Occurrences, but that can't/shouldn't drive our model.

must be multiple accessions as the only means of tracking which parts were collected

That does not sound like a stable (or particularly useful) pathway. You could name the events and use that or something - not elegant, but it is stable.

Option One: We could flip cataloged item and events. That will very likely drastically change how we see events, and I suspect it will increase the workload significantly. Eg, now when you get a georeference back from some external source (or add a corrected Event but want to keep the old or etc.) you just dump it in as another Event, and perhaps eventually make it accepted (or unaccepted) as time allows. Under this, I think you'd have to preemptively make that determination and move all parts, collectors, accns, otherIDs, attributes, etc. to the new event (or duplicate them or something weird). I'm not sure how you'd get at "this event, which is no longer accepted, was once accepted for THOSE parts-and-such" (esp. when there are multiple accepted and unaccepted part-producing events under a specimen).

Cultural items (parts) commonly go through a bunch of events - the one part is manufactured, used, etc. (And maybe so are the subjects of mark-recapture and similar, depending how you want to look at things - I think we often focus on the bit of blood, but the wolf is the real item of interest and was present at all of the events.) I think there would be a split of some sort between "this thing has been through multiple events" and "this thing is comprised of parts which originated at multiple events."

That still seems more or less CORRECT to me, but I don't think it's something we can approach without significant funding, and I'm not sure how usable we can make it even with funding.

Option Two: we could go all @tucotuco on this thing, embrace an actual event-based model, and make EVERYTHING into an event. (ID a specimen? Event. Add a part? Event. Agent relationship? Event.) I think this is the most powerful option - I certainly can't think of anything it's not capable of doing "correctly." AFAIK nothing remotely like this exists; development would require serious resources (I think this is a new-everything approach; eg, we'd want a couple months and/or a good consultant just to explore DB options).

Option Three: We could back up and reexamine how we're using cataloged item. Cataloged items are explicitly arbitrary, so I'm not sure there's anything inherently wrong with cataloging item-at-events rather than individuals. I think that's what every other system does, including some specimens in Arctos (eg, we can't PREVENT this even if there's another pathway - think tissues and bones in different collections), and there is comfort in numbers. There's lots of flexibility in how we present data (eg, a "one of many" cataloged item could pull from it's relatives and look a lot like it does now), although I'm not sure how far we can push that in downloads and DWC and such. This probably still requires some funding, but I think is almost certainly the lowest-impact option. It may also be the best reflection of how the data (field notes and such) are arranged.

I think Option Three may be the most approachable at the moment. It shouldn't require any back-end changes, just UI. The split is clear: "this thing has been through multiple events" is one cataloged items with multiple events, "this thing is comprised of parts which originated at multiple events" is multiple related cataloged items (perhaps with multiple events eg to reflect uncertainty). It's a complete reversal of our current approach - this would "require" (ish) un-merging all of those wolves you merged at MSB (sorry!), but should require no changes for "normal" specimens. The most concerning aspect is probably the chance of bits of one animal being treated as independent samples, but perhaps there's something we can do to clarify that (REQUIRE relationships in downloads is probably a good first step).

I'm sure there are more options - I'm not trying to constraint this, those are just all I can think of at the moment.

#1357 should probably go on the back burner until there's some commitment to something, so I'm flagging this critical.

@dustymc dustymc added this to the Needs Discussion milestone May 31, 2018
@dustymc dustymc added the Priority-Critical (Arctos is broken) Critical because it is breaking functionality. label May 31, 2018
@Jegelewicz
Copy link
Member

I agree that option 3 is the best at this time and I like the idea of "same as" relationships as a required part of downloads. Does this information get translated properly at the GBIF/iDigBio aggregator level?

@AJLinn
Copy link

AJLinn commented Jun 14, 2018

As long as the changes don't impact how we're tracking the events associated with the cultural objects, I'll defer to the affected collections. It's vitally important that we are able to continue to clearly document that cultural objects are sometimes made/used/collected in 3 different places and times (or time ranges) or sometimes it's all the same; the object has a life (like a biological creature) and is involved in historical events sometimes as a result. It is that life we document thru these multiple events.

@campmlc
Copy link
Author

campmlc commented Jun 14, 2018 via email

@KyndallH
Copy link

I also do not like option 3.

What if we did option 2ish if it means "event" as in adding a date. Pretty much when you add the occurrence, it is tied to a date. What if we added a date of collection onto parts? For most, it would simply be the date of death which is super simple in our current system. Yet, when you add to a record you can add the occurrence that has a date and it will be tied to the part via the date on the part. This would be nice too for loans - subsample a tissue, date recorded (and ideally who did it/modified that part of the record - but that may just be a personal preference). Also, subsampled this insect on this date but the DNA extraction I'm adding wasn't done until this date. This may also help the cultural collections too. The object made on this date and location. Object modified on this date and in X location.

All occurrence tied to the catalog number. Different occurrences tied together by date within the catalog number.

Would this help solve the initial problem? In GGBN, you could have Cat. No. with date. So UAM:XXX 05-05-2018, UAM:XXX 02-02-2017, etc.

@campmlc
Copy link
Author

campmlc commented Jun 14, 2018 via email

@campmlc
Copy link
Author

campmlc commented Jun 14, 2018 via email

@dustymc
Copy link
Contributor

dustymc commented Jun 14, 2018

I can't say I'm thrilled with it either, but I don't see a better approach within our reach.

What others do is easy: They ignore it, or maybe put something in some remarks field.

Nothing much should change at aggregators - we provide them Occurrences now, and we would with any other model that does what we need to do. They would be able to link those Occurrences to specific parts/attributes/etc. in any model we might end up in, and can't now.

The citation guidance is "cite the object of scientific interest." You don't really have one here - some folks are looking at the wolf, some are looking at the wolf when it was THERE, THEN. I think that's a wash.

We'll always have to deal with this, I think - someone looks at a skull and cites https://arctos.database.museum/guid/DMNS:Mamm:12344 instead of http://arctos.database.museum/guid/MSB:Mamm:233616. What I'm proposing is basically making this...

screen shot 2018-06-14 at 2 22 33 pm

... look more like this....

screen shot 2018-06-14 at 2 22 53 pm

eg, treat those two (or 50 or whatever) records more like part of the same thing (scientific viewpoint) rather than as distinct things that share some stuff (an administrative viewpoint).

date of collection onto parts?

The issue includes parts, attributes, otherIDs, collectors, media, encumbrances, and probably some other stuff. That's a lot of digital duct tape - I'd rather find a structurally-defensible solution.

I also don't see how we'd maintain referential integrity there. Event date=X, use that to link all that other junk, oops, actually the event date was Y - now what? Yes it's a "soft" linkage - it's not enforced (eg, there are not shared keys, just strings) - you can break it with a typo or by changing something or etc. (Think same data in MSB and DGR collections.)

@dustymc
Copy link
Contributor

dustymc commented Jun 14, 2018

subsample a tissue, date recorded (and ideally who did it/modified that part of the record - but that may just be a personal preference). Also, subsampled this insect on this date but the DNA extraction I'm adding wasn't done until this date.

That's always been in the model, but it's not very exposed. Changes in the last round of GGBN brought it up a bit - you can explicitly create "subsamples" now. Who and when are pulled from user environment - that's easy enough to change, if ya'll are willing to provide those data when you create/modify parts.

UAM@ARCTOS> desc specimen_part
 Name								   Null?    Type
 ----------------------------------------------------------------- -------- --------------------------------------------
 COLLECTION_OBJECT_ID						   NOT NULL NUMBER
 PART_NAME							   NOT NULL VARCHAR2(255)
 SAMPLED_FROM_OBJ_ID							    NUMBER
 DERIVED_FROM_CAT_ITEM						   NOT NULL NUMBER

UAM@ARCTOS> desc coll_object
 Name								   Null?    Type
 ----------------------------------------------------------------- -------- --------------------------------------------
 COLLECTION_OBJECT_ID						   NOT NULL NUMBER
 COLL_OBJECT_TYPE						   NOT NULL CHAR(2)
 ENTERED_PERSON_ID						   NOT NULL NUMBER
 COLL_OBJECT_ENTERED_DATE					   NOT NULL DATE
 LAST_EDITED_PERSON_ID							    NUMBER
 LAST_EDIT_DATE 							    DATE
 COLL_OBJ_DISPOSITION						   NOT NULL VARCHAR2(20)
 LOT_COUNT							   NOT NULL NUMBER
 CONDITION							   NOT NULL VARCHAR2(4000)
 FLAGS									    VARCHAR2(20)

UAM@ARCTOS> 

Extractions and bits you lop off into new parts and such should have a SAMPLED_FROM_OBJ_ID pointing to the part from which they were removed. All parts (which are collection objects) have entered and edited metadata.

@dustymc
Copy link
Contributor

dustymc commented Jun 14, 2018

The co-cataloged thing actually present another problem - we need to distinguish between "Occurrences" created by admin decisions (eg, where to catalog stuff) and those created by repeated sampling. Date is probably close enough most of the time, but I think we should be explicit via a new relationship. "Same individual as" should be split into two terms, one which means "same critter, same event" and one which means "same critter, distinct event."

@dustymc
Copy link
Contributor

dustymc commented Jun 15, 2018

There is a VERY quick-n-dirty demo of Option Three in test - http://arctos-test.tacc.utexas.edu/SpecimenDetail_MultiOccurrence.cfm?collection_object_id=12

i_am_tester should have access to all of the collections involved. You can create more of these in the normal way, but you'll need the collection_object_id (from flat, or I can help with that) to see them in the demo page.

I created a new relationship "occurrence of" (that can change, I just think it needs to be more explicit than "same individual as") and related some (unrelated) records to each other through it

At the top of the (potential replacement) specimendetail page, Arctos checks for related occurrences - if it finds any it tries to pull data (from filtered_flat - this can't expose potentially-restricted data without explicit AWG approval), provides a link if it can't do that, and just displays the relationship/otherID data if it can't do that. (If we go here, maybe we should only allow occurrence of to link to things with resolvable identifiers - or not, zoo animals probably have bits cataloged in Arctos and in things that we can't talk to).

Formatting and details are obviously very preliminary, this is just a demonstration testing if we can create a useful representation of individuals from cataloged Occurrences. The data could be displayed differently in the "occurrence area" or mixed in with data from the cataloged item you're on, or WHATEVER. The only point of this is to make a big-picture decision regarding repeated sampling of individuals.

screen shot 2018-06-15 at 11 51 29 am

@campmlc
Copy link
Author

campmlc commented Jun 18, 2018 via email

@campmlc
Copy link
Author

campmlc commented Jun 18, 2018 via email

@dustymc
Copy link
Contributor

dustymc commented Jun 18, 2018

what has changed on the display

The new table - screenshot in #1545 (comment) - vs. just the current red "there's another specimen that's sorta the same as this one" thing.

pop-up that only shows if the link is clicked

That's basically what the current links do. The new table thing gets the data on the same page. A popup is super-easy (eg, just stuff the related specimendetail page in a popup).

GBBN

It's basically what we're providing them now - a bit simpler to query perhaps. I don't think it would help us line up with GBIF (not-GGBN) OccurrenceIDs since we'd still have to have the "part" component for (only) GGBN.

@ccicero
Copy link

ccicero commented Jun 19, 2018

I, like others, don't really like option 3. That seems like it's taking us backwards. Folks have spent a lot of time recataloging and organizing so that every part of a specimen has a single catalog number. We don't want to go back to recataloging them separately, and then down the line with more funding move to an event-based model which is really what we need (and what we've been discussing for at least 10+ years). We should think about how to fund option #2 without sacrificing what we have.

I'm not really sure what I'm looking at in the test example, but it's confusing and I don't think that the general user will understand this.

We have two distinct challenges here: (1) Internal - how to make Arctos work for these kinds of data - linking parts, attributes, identifiers, etc. to a collecting event. (2) external - publishing occurrence data to aggregators including GGBN. For #1, we should work on what will be the most robust solution which seems to be pointing to an event based model that will require more funding. For #2, what if we exported the multiple occurrence data from one cataloged item as one occurrence record, with localities and dates concatenated in a way that's similar to how we concatenate parts and attributes? It won't be a clear 1:1 relationship, but anyone who comes across such a record and wants more information should be encouraged to contact the original source. for clarification.

The number of records in Arctos with multiple occurrences that represent actual multiple accepted events is probably relatively small? I'm curious what % of records have multiple accepted events across all collections.

@dustymc
Copy link
Contributor

dustymc commented Jun 19, 2018

taking us backwards

Or we've been taking us away from the "correct" model. Nobody has ever done anything quite like this. It's not really surprising that we had to throw some real-world data at it to see a problem.

lot of time recataloging and organizing so that every part of a specimen has a single catalog number.

2 angles:

  1. They've failed (and will continue to do so as long as administrators are involved) - https://arctos.database.museum/guid/DMNS:Mamm:12344 vs http://arctos.database.museum/guid/MSB:Mamm:233616. We WILL have "same specimen" data with multiple catalog numbers. If we have to deal with it anyway, maybe we can use that to do something clever here.
  2. "Specimen"="cataloged item" (http://handbook.arctosdb.org/documentation/catalog.html) and "cataloged item"="whatever someone felt like cataloging, preferably the item of scientific interest." I can see nothing "wrong" about cataloging Occurrences - that's what everyone else does (from necessity, but still...), and they are the "item of scientific interest" from some perspectives/for some users (like GBIF).

The only "sacrifice" I see involves citations - given two records and identifiers, researchers WILL do weird things (including crappy science). MAYBE we can do more with relationships or something, but this is a real problem "in the wild." It's not much of a problem locally - we can force them to see the related data (demo above), but we can't force GBIF to give them that perspective or not let them delete that big messy column full of links, whatever it was, from their downloads.

In any case, I won't intentionally do anything that loses data; nothing will prevent us from funding a better model. (And I'm actually not so sure how it would fix this - ya'll are still going to want to catalog things, and you'll still pick either the wolf or the occurrence of the wolf to catalog, and here we'll be again. An event-based model might not be constrained by structure, but that won't necessarily help with "tradition.")

Under the current model, you find a wolf (with a single Occurrence/Event), cite it, then someone dumps 20 more events in and what was a good citation when you made it is now ambiguous - I'm not convinced that we're doing such a great job with citations now.

but it's confusing and I don't think that the general user will understand this.

Everything else they've ever seen, including Arctos samples in GBIF, treat every "Occurrence" as a separate THING. If we're trying to reduce confusion, joining the herd probably makes some sense. (But see above...)

Anything that works for Arctos will work for DWC. The problem is that these data are being provided - entered and stored - in a nonrecoverable format.

concatenate parts and attributes

Those are things we make up - GBIF can't tell a concatenated part from our data (neither can I!), and DWC apparently never anticipated the possibility of multiple parts. GBIF (and I) can absolutely tell a list of states from a state, and DWC dictates that the item of scientific interest is the Occurrence, not individual. I think this would be more confusing, not less.

We do have a "JSON Locality" option - I'm not sure how GBIF et al. would respond to that though.

screen shot 2018-06-19 at 8 37 55 am

anyone who comes across such a record and wants more information should be encouraged to contact the original source

We don't share most of our data via DWC - this is and always will be true for everything. (And since this is mostly about citations, has ANYONE successfully traced a GBIF citation back to an Arctos specimen? I can't. I'm not sure that what we catalog is our most pressing citation-related problem!)

number of records in Arctos with multiple occurrences

Around 50000.

that represent actual multiple accepted events

Who knows - this is why we need "occurrence of" and "same individual as" (whatever we call them). "Dozens, maybe hundreds" probably.

And FWIW that's still doesn't leave us with Occurrences - https://arctos.database.museum/guid/DMNS:Mamm:12344 and http://arctos.database.museum/guid/MSB:Mamm:233616 are one Occurrence, http://arctos.database.museum/guid/MVZ:Egg:10972 is at least two, etc. "Occurrences" are something we map to when we can, not really something we natively have.

THANKS!! - this is very useful.

I still don't much like Option One - I think it would have serious usability issues.

I do like Option Two, but it's a major project and I'm not sure it solves anything all by itself. What WOULD you catalog in a pure event-based model anyway?

If anyone has an Option Four, this would be a really great time to throw it out there!

If we do go with Option Three, perhaps we can do more with https://arctos.database.museum/info/ctDocumentation.cfm?table=CTCATALOGED_ITEM_TYPE. The existing data are useless - we have "observation" which basically means "something that didn't get cataloged in the main collection" but contains things that would be cataloged in SOME real collections, and "specimen" which we've explicitly defined as "something someone felt like cataloging."

https://arctos.database.museum/info/ctDocumentation.cfm?table=CTCOLL_OBJECT_TYPE also exists, although I'm not really sure what any of that stuff is or what it's supposed to do - I just use CI (cataloged item) and SP (specimen part) because it's there, not because it does anything for me.

@dustymc
Copy link
Contributor

dustymc commented Jun 19, 2018

Also re:

Folks have spent a lot of time recataloging and organizing

I can magic some/most of that, if we do end up unraveling it. I'm looking at http://arctos.database.museum/guid/MSB:Mamm:292063. The dates in part remarks as an unambiguous format would be REALLY helpful - I can probably get from "Coll: 15 June 2014" to "Event Date: 2014-06-15" but it's also somewhat likely to be messy.

Those accession numbers can probably be resolved to transactions.

comma-space-NK-space-integer seems to lead to an otherID.

The data in part attributes is better, but seems very inconsistent (eg, doesn't exist for most parts). In general, I think I'd rather deal with messy data in one place instead of less-messy data in a bunch of places - I'm not sure this is useful unless it's consistently applied.

If I can turn strings into data objects, I should be able to handle the conversion. If I can't do that, it's because the data don't support it - eg, it's evidence that this model is insufficient.

I don't really see this component as much of a problem, or at least not as a fatal problem.

Having a half-dozen primary identifiers (catalog numbers) which all mean "that wolf" looks to me like the biggest problem with that approach. (It's also what we have in GBIF now, assuming folks generally cite Occurrences and not IndividualID, and it's significantly less confusing than what we're giving to GGNB.)

To be clear, I'm not really advocating anything at this point, I'm just trying to understand the possibilities and what they mean for everything else.

@campmlc
Copy link
Author

campmlc commented Jun 19, 2018 via email

@campmlc
Copy link
Author

campmlc commented Jun 19, 2018 via email

@campmlc
Copy link
Author

campmlc commented Jun 19, 2018 via email

@dustymc
Copy link
Contributor

dustymc commented Jun 19, 2018

Again, I'm not advocating for anything, I'm just trying to respond to your needs as I see them.

Big-picture, there are three possibilities:

  1. Do nothing-ish. What we have now sorta works, and it works just fine for most of what's in Arctos. I don't really see how you could avoid using strings as keys, which as you say comes with transcription errors and such. I don't see this as a defensible long-term solution, but that doesn't mean we HAVE to stop (yet, anyway).
  2. Do something with our current resources - eg, we KNOW Option Three "works" because it's what we've done in the past (and what we'd continue to do with the "first cataloged Occurrence" and etc.), we just need to figure out if SOMEHOW pulling related "records" together - by creating relationships, displaying things differently, creating new kinds of data objects (eg, new classes of cataloged items and/or collection objects) to carry that information, or something of the sort - is a useful approach.
  3. Do something big. Event-based models might do something useful, but as above I'm not sure it's a silver bullet either.

I don't think what I'm "proposing" (such a strong word for this stage!) undoes anything - if the data are clean it's just a more stable representation of them, and if they're not they're not. Eg,

This method was standardly applied, and could theoretically be undone.

is true - IF those data can be pulled apart into "Occurrences" then what I'm "proposing" just adds value (eg, explicit links between parts and place-time and identifications and etc. for individuals) If the standards were less standard than hoped, then the data aren't very accessible now and won't be very accessible in any other model.

We came up with a hypothesis and I think ya'll are telling me it's becoming unsatisfactory as we throw enough data at it for details to emerge - science!

the ability to link all derivative data together in a single record[VIEW] to compare data over time and examine for patterns and errors

"A record" is an arbitrary thing in Arctos (and any other deeply-relational structure). There are many thousands of "records" in Arctos that have multiple IDs. Those cover

  • something's funky - we're not so good at identifying or transcribing or interpreting our marks or something
  • second opinion - looks like A to you and B to me.
  • taxonomic weirdness - "let's all call this Myodes Clethrionomys Myodes Clethrionomys Myodes Clethrionomys Myodes."
  • Probably some other stuff.

I sort of think explicitly separating the "something's funky" case (eg, linking the IDs to two data objects - cataloged items perhaps) is a useful approach - I think it's probably easier to bring those together than it is to try to separate out two "views" from one "record."

@campmlc
Copy link
Author

campmlc commented Aug 27, 2018

Some comments on the test interface:

  1. Specimens that only have one event should automatically be linked to that event, correct? This one is not: http://arctos-test.tacc.utexas.edu/guid/MSB:Para:20609

  2. Specimens with multiple events that have the same info in part remarks and in event remarks should be linkable with a script at some point, so we don't have to do this manually? e.g.
    http://arctos-test.tacc.utexas.edu/guid/MSB:Mamm:157068

  3. For the multi-event record above, I have the following recommendations:
    a. please make the font smaller or adjust the popup that appears when clicking "pick event' so that all of the event info is visible in one screen without having to scroll over far to the right to access the remarks and link button.
    b. It is going to be really painful to do this one at a time - again, we need a good script for when it is possible to use one.
    c. Please change the highlight color on highlight linked components - the current highlight is not obvious.
    d. Any way we can choose to change how our parts are displayed so that we have all parts from a single event together, rather than by part type?

@dustymc
Copy link
Contributor

dustymc commented Aug 27, 2018

Specimens that only have one event should automatically be linked to that event, correct? This one is not: http://arctos-test.tacc.utexas.edu/guid/MSB:Para:20609

We can discuss, but I don't think so - it adds some unnecessary complexity and doesn't DO anything. This is a new EXTRA link, not a new pathway. We may make the link implicitly for various reasons (exporting DWC).

Specimens with multiple events that have the same info in part remarks and in event remarks should be linkable with a script at some point, so we don't have to do this manually? e.g.
http://arctos-test.tacc.utexas.edu/guid/MSB:Mamm:157068

#1545 (comment)

I may be able to make more links, but that should be sufficient for testing.

following recommendations

#1545 (comment)

I think we're looking for "functional" rather than polished at this point.

I think the order of priorities should be about...

  1. Does this DO what you need; is it big-picture useful? (There should be enough to answer that now.)
  2. Can we use it (eg, for GGBN)? (I think so, but I'd like to play with that a bit more before we get too deep.)
  3. refinements - recover existing links from remarks-and-such, shuffle things around on the screen, etc.
  4. new data - how's this work from anywhere except specimendetail?

@campmlc
Copy link
Author

campmlc commented Aug 27, 2018

To answer #1) and 2) does this do what we need and can we use it for GGBN - can we see a demo of what one of these linked occurrences would look like in the GGBN export, with http://arctos-test.tacc.utexas.edu/guid/MSB:Mamm:157068 as the example? This one has some linked parts to event 1, some linked parts to event 2, and some unlinked parts in test.

@dustymc
Copy link
Contributor

dustymc commented Aug 27, 2018

I still think this needs to be more sequential - does it do whatever you're trying to do with the part attributes and remarks and accn and whatever else?

Assuming that answer is yes....

If that all works, linked parts will (eventually, somehow) go out with their "occurrence." Unlinked parts will, lacking better ideas, go out with the "priority" specimen event - https://github.com/ArctosDB/DDL/blob/master/functions/getPrioritySpecimenEvent.sql. http://arctos-test.tacc.utexas.edu/guid/MSB:Mamm:157068 would go out (to GBIF-and-such) as two Occurrences:

These are the asserted links:


UAM@ARCTOSTE> select specimen_event_id, part_id  from specimen_event_links where collection_object_id=2791376;

SPECIMEN_EVENT_ID    PART_ID
----------------- ----------
	  3232421   26094176
	  3232421   26094250
	  3232421   26094124
	  3232421   26094334
	    54243   21967548
	    54243    2791380

and unlinked parts...


UAM@ARCTOSTE> select specimen_part.COLLECTION_OBJECT_ID from specimen_part where derived_from_cat_item=2791376 and COLLECTION_OBJECT_ID not in (select part_id from specimen_event_links where COLLECTION_OBJECT_ID=2791376) ;

COLLECTION_OBJECT_ID
--------------------
	    26094514
	    21967547
	    26094000
	    26093729
	    26093858
	    26094559
	    26094387
	    26094459
	     2791381
	    25901180
	     2791379
	    26093938
	     2791378
	     2791377
	    26093786
	    26094750
	    26094693
	    26094598
	    26094641

19 rows selected.


would get lumped in with the "priority" event, which is


UAM@ARCTOSTE> select getPrioritySpecimenEvent(2791376) from dual;

GETPRIORITYSPECIMENEVENT(2791376)
---------------------------------
			  3232421

I think that'll work the same way it does now, where parts are smooshed into a string.


UAM@ARCTOSTE> select parts from flat where collection_object_id=2791376;

PARTS
------------------------------------------------------------------------------------------------------------------------
postcranial skeleton; skull; skin; blood (EDTA); blood (EDTA); blood (EDTA); blood (EDTA); blood (EDTA); kidney (frozen)
; blood (EDTA); blood (EDTA); blood (EDTA); blood (EDTA); blood (frozen); blood (frozen); blood (frozen); blood serum (f
rozen); blood serum (frozen); muscle (frozen); heart (frozen); blood (EDTA); blood (EDTA); liver (frozen); blood (EDTA)

In this case one "parts" value will be short (2 parts) and the other long (23 parts).

For GGBN, which...

GGBN expects to have one MaterialSample per record in its Darwin Core Occurrence archives

each part will define an "Occurrence" (which won't be an Occurrence at all, but we're stuck with the vocabulary).

select 
	specimen_part.COLLECTION_OBJECT_ID,
	nvl(
		specimen_event_links.specimen_event_id,
		getPrioritySpecimenEvent(specimen_part.derived_from_cat_item)
	) pieceOfTheOccurrenceID
from 
	specimen_part,
	specimen_event_links 
where 
	specimen_part.COLLECTION_OBJECT_ID=specimen_event_links.part_ID (+) and
	 specimen_part.derived_from_cat_item=2791376
;
COLLECTION_OBJECT_ID PIECEOFTHEOCCURRENCEID
-------------------- ----------------------
	    26094176		    3232421
	    26094250		    3232421
	    26094124		    3232421
	    26094334		    3232421
	    21967548		      54243
	     2791380		      54243
	    26094514		    3232421
	    21967547		    3232421
	    26094000		    3232421
	    26093729		    3232421
	    26093858		    3232421
	    26094559		    3232421
	    26094387		    3232421
	    26094459		    3232421
	     2791381		    3232421
	    25901180		    3232421
	     2791379		    3232421
	    26093938		    3232421
	     2791378		    3232421
	     2791377		    3232421
	    26093786		    3232421
	    26094750		    3232421
	    26094693		    3232421
	    26094598		    3232421
	    26094641		    3232421

25 rows selected.

so that specimen would have 25 "Occurrences" each with one part until GGBN can support one Occurrence having multiple MaterialSample "children." (Or I suppose we could send out a single part - "kidney (frozen) + liver (frozen) + ..." - or something, but that would be contrary to GGBN's part-centric outlook. In any case SOMETHING has to be denormalized to support their model.)

@campmlc
Copy link
Author

campmlc commented Aug 27, 2018 via email

@dustymc
Copy link
Contributor

dustymc commented Aug 28, 2018

send a list

create table temp_multi_evt as select collection_object_id from specimen_event where verificationstatus!='unaccepted' having count(*) > 1 group by collection_object_id;
alter table temp_multi_evt add num_parts number;
update temp_multi_evt set num_parts=(select count(*) from specimen_part where specimen_part.derived_from_cat_item=temp_multi_evt.collection_object_id);
alter table temp_multi_evt add num_linked_parts number;
update temp_multi_evt set num_linked_parts=(select count(*) from specimen_event_links where specimen_event_links.collection_object_id=temp_multi_evt.collection_object_id);
alter table temp_multi_evt add guid varchar2(255);
update temp_multi_evt set guid=(select guid from flat where flat.collection_object_id=temp_multi_evt.collection_object_id);
-- get rid of some stuff we know 
delete from temp_multi_evt where guid like 'UAM:EH%';
delete from temp_multi_evt where guid like 'UAMb:Herb:%';
select substr(guid,1,instr(guid,':',1,2)) || ' @ ' || count(*) from temp_multi_evt group by  substr(guid,1,instr(guid,':',1,2));
SUBSTR(GUID,1,INSTR(GUID,':',1,2))||'@'||COUNT(*)
------------------------------------------------------------------------------------------------------------------------
MSB:Mamm: @ 6244
UTEP:Herb: @ 123
CHAS:Bird: @ 11024
DMNS:Inv: @ 7
UTEP:Herp: @ 1318
UWBM:Herp: @ 21
MVZ:Mamm: @ 5
UAMObs:Ento: @ 1
KWP:Ento: @ 3
UAM:Ento: @ 2
UCM:Fish: @ 1279
UAM:Mamm: @ 1
MSB:Bird: @ 2
MVZ:Bird: @ 5
CHAS:Egg: @ 474
MVZ:Herp: @ 1
UTEP:HerpOS: @ 139
DMNS:Bird: @ 8
CHAS:Mamm: @ 1
DMNS:Mamm: @ 4
UTEP:Inv: @ 1121

temp_multi_evt.csv.zip

scripts

I think I've recovered what I can from date in part remarks. Let me know (examples are useful) what else might be used as a link and I'll script what I can.

iDigBio and GBIF

MAYBE we'll eventually send them slightly more accurate part concatenations, but nothing else will change.

Nothing about the format of what we're sending to GGBN will change, the event data will just be a bit better-targeted for those parts which have explicit links.

@campmlc
Copy link
Author

campmlc commented Aug 28, 2018 via email

@dustymc
Copy link
Contributor

dustymc commented Aug 28, 2018

NK

Example?

@dustymc
Copy link
Contributor

dustymc commented Aug 29, 2018

There is a new table digir_query.msb_mamm_ggbn_tissue_tbl containing MSB:Mamm tissue not-really-Occurrences for GGBN.

DDL is https://github.com/ArctosDB/DDL/blob/master/flat/ggbn_flat_tissue.sql


UAM@ARCTOS> select count(*) from digir_query.msb_mamm_ggbn_tissue_tbl;

  COUNT(*)
----------
    482101

1 row selected.

Elapsed: 00:00:00.79
UAM@ARCTOS> select occurrenceID from digir_query.msb_mamm_ggbn_tissue_tbl having count(*) > 1 group by occurrenceID;

no rows selected

Elapsed: 00:00:01.07
UAM@ARCTOS> select occurrenceID from digir_query.msb_mamm_ggbn_tissue_tbl where rownum<5;

OCCURRENCEID
------------------------------------------------------------------------------------------------------------------------
http://arctos.database.museum/guid/MSB:Mamm:69674?pid=27256939
http://arctos.database.museum/guid/MSB:Mamm:196496?pid=21375444
http://arctos.database.museum/guid/MSB:Mamm:85721?pid=2185666
http://arctos.database.museum/guid/MSB:Mamm:53231?pid=2140369

4 rows selected.


UAM@ARCTOS> select occurrenceID,preparationType,eventDate from digir_query.msb_mamm_ggbn_tissue_tbl where references='http://arctos.database.museum/guid/MSB:Mamm:292063' order by eventDate
  2  ;

OCCURRENCEID
------------------------------------------------------------------------------------------------------------------------
PREPARATIONTYPE
------------------------------------------------------------------------------------------------------------------------
EVENTDATE
------------------------------------------------------------------------------------------------------------------------
http://arctos.database.museum/guid/MSB:Mamm:292063?pid=25994537
blood (EDTA)
2014-06-15

http://arctos.database.museum/guid/MSB:Mamm:292063?pid=25994538
blood serum (frozen)
2014-06-15

http://arctos.database.museum/guid/MSB:Mamm:292063?pid=25994545
blood (EDTA)
2014-06-30

http://arctos.database.museum/guid/MSB:Mamm:292063?pid=25994541
blood (EDTA)
2014-06-30

http://arctos.database.museum/guid/MSB:Mamm:292063?pid=25994542
blood (EDTA)
2014-06-30

http://arctos.database.museum/guid/MSB:Mamm:292063?pid=27923980
blood (EDTA)
2014-06-30

http://arctos.database.museum/guid/MSB:Mamm:292063?pid=25994546
blood serum (frozen)
2014-06-30

http://arctos.database.museum/guid/MSB:Mamm:292063?pid=25994544
blood (EDTA)
2014-06-30

http://arctos.database.museum/guid/MSB:Mamm:292063?pid=25994543
blood (EDTA)
2014-06-30

http://arctos.database.museum/guid/MSB:Mamm:292063?pid=25994547
blood (EDTA)
2014-07-21

http://arctos.database.museum/guid/MSB:Mamm:292063?pid=25994550
blood (EDTA)
2014-07-21

http://arctos.database.museum/guid/MSB:Mamm:292063?pid=25994549
blood (EDTA)
2014-07-21

http://arctos.database.museum/guid/MSB:Mamm:292063?pid=25994551
blood (EDTA)
2014-07-21

http://arctos.database.museum/guid/MSB:Mamm:292063?pid=25994548
blood (EDTA)
2014-07-21

http://arctos.database.museum/guid/MSB:Mamm:292063?pid=25994552
blood serum (frozen)
2014-07-21

http://arctos.database.museum/guid/MSB:Mamm:292063?pid=25988964
blood serum (frozen)
2014-10-19

http://arctos.database.museum/guid/MSB:Mamm:292063?pid=26966404
blood (EDTA)
2014-10-19

http://arctos.database.museum/guid/MSB:Mamm:292063?pid=25988963
blood (EDTA)
2014-10-19

http://arctos.database.museum/guid/MSB:Mamm:292063?pid=25988965
blood serum (frozen)
2014-10-19


19 rows selected.


@dustymc
Copy link
Contributor

dustymc commented Aug 29, 2018

I found a bunch more by NK, but I also found a bunch with conflicting data, attached.

PID: partID
BARCODE: what it sounds like
PART_REMARK: what it sounds like
EID1: linked event (from dates)
EID2: event that NK suggests should be linked
BDn: began_date from event
SE_REMARKn: event remarks

temp_multi_link.csv.zip

@campmlc
Copy link
Author

campmlc commented Aug 29, 2018 via email

@dustymc
Copy link
Contributor

dustymc commented Sep 6, 2018

Can this be closed?

@tucotuco
Copy link

tucotuco commented Sep 6, 2018

Maybe let me make a GGBN resource using that view first before closing?

@campmlc
Copy link
Author

campmlc commented Sep 6, 2018 via email

@tucotuco
Copy link

tucotuco commented Sep 6, 2018 via email

@campmlc
Copy link
Author

campmlc commented Sep 6, 2018 via email

@dustymc
Copy link
Contributor

dustymc commented Sep 6, 2018

Wilco.

This is for GGBN and only GGBN. It's vastly different than what we're sending to everyone else.

@campmlc
Copy link
Author

campmlc commented Nov 16, 2018

The process of linking parts to specimen events is working, but it is fairly complicated and needs some help to be more functional and user friendly. Specifically, in the specimen record, we need the specimen events to display in order by date, and we need the parts to display in the same order, at least within part type. In the current system, events seem to be in order of when they were created (?), which for this type of legacy data where multiple records are being consolidated, results in random order by date. Then, linking to parts that also appear to end up, within part type, in random order, becomes very difficult. Can we have events and parts present in some consistent ordering scheme? I would prefer most recent event first, with the associated parts in the same order.
This would make the process of finding and linking associated parts and events much easier and more intelligible.
As a corollary to this, identifiers also appear to get added in some random order. For these multiple events, parts from each event have an shared unique identifier (NK). Currently, these display in somewhat random order - it would be helpful to have them display in numeric order, smallest to largest, in the identifiers box of the specimen record.
This also applies to all Arctos pages and displays in general, including object tracking. How exactly does Oracle decide to order things, and can we choose to have a consistent order throughout?

http://arctos.database.museum/guid/MSB:Mamm:306166

@campmlc
Copy link
Author

campmlc commented Nov 16, 2018

To make things event more interesting, in turns out that specimens display in a different order when you click "edit" than they do on the main display page. This is problematic for entering any edits to the correct event.

@dustymc
Copy link
Contributor

dustymc commented Feb 13, 2019

@campmlc it's 2 clicks please elaborate on "complicated."

There is no date associated with parts. I can't sort by things that don't exist.

Events are sorted by a complicated thing that considers verificationstatus and type and such. I'm not sure if there's enough information to replicate that in the pick, but I'll check.

Sorting identifiers could be an Issue, but likely needs linguistic indexes.

Oracle does not order things. We can request sorting anywhere, but it's not free. That is also a new Issue.

Can this be closed?

@campmlc
Copy link
Author

campmlc commented Feb 13, 2019 via email

@tucotuco
Copy link

tucotuco commented Mar 11, 2019 via email

@tucotuco
Copy link

Test resource now ready for testing at http://ipt.vertnet.org:8080/ipt/resource.do?r=msbmammalggbntest

@campmlc
Copy link
Author

campmlc commented Mar 13, 2019 via email

@tucotuco
Copy link

tucotuco commented Mar 14, 2019 via email

@dustymc
Copy link
Contributor

dustymc commented Apr 2, 2019

Can this be closed?

@tucotuco
Copy link

tucotuco commented Jul 10, 2019 via email

@dustymc dustymc closed this as completed Jul 10, 2019
@Jegelewicz Jegelewicz removed this from the Active Development milestone Nov 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement I think this would make Arctos even awesomer! Priority-Critical (Arctos is broken) Critical because it is breaking functionality.
Projects
None yet
Development

No branches or pull requests