Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TG1 - Develop a Vocabulary based on the Framework for a TDWG Data Standard #164

Open
ArthurChapman opened this issue Sep 4, 2018 · 25 comments
Assignees

Comments

@ArthurChapman
Copy link
Collaborator

Discussion under #152 has strongly suggested that a separate Vocabulary should be developed for the terms used in the Tests and Assertions to the Vocabulary being developed under the Framework on Data Quality. Relevant terms in the Tests Vocabulary will refer to the definition used in the Framework Vocabulary. See discussion under #152

@ArthurChapman ArthurChapman changed the title TG1 - Develop a Vocabulary based on the Framework for TDWG Data Standard TG1 - Develop a Vocabulary based on the Framework for a TDWG Data Standard Sep 4, 2018
@ArthurChapman
Copy link
Collaborator Author

Some of the definitions in the Framework Vocabulary will need re-evaluating in the light of TG2 use and circumscription.

@chicoreus chicoreus added the TG1 label Sep 6, 2018
@chicoreus
Copy link
Collaborator

We've got at least three candidates for vocabularies:

  1. The data quality DIMENSION values (Completeness, Conformance, Reliability, Consistency, Likelyhood, Resolution).
  2. The controlled vocabulary values for Response.Result in a Data Quality Report (COMPLETE, NOT_COMPLETE, COMPLIANT, NOT_COMPLIANT, PROBLEM, NOT_PROBLEM).
  3. The controlled vocabulary values for Response.Status in a Data Quality Report (e.g. RUN_HAS_RESULT, DATA_PREREQUISITES_NOT_MET, EXTERNAL_PREREQUISITES_NOT_MET, FILLED_IN, AMENDED). It is possible that some other values from TG2's analysis go here (e.g. AMBIGUOUS), but that may need further analysis.

@ArthurChapman
Copy link
Collaborator Author

We seem to have several different places where we have the Glossary definitions for the Framework. The one on the GitHub (https://tdwg.github.io/bdq/tg1/site/glossary.html) doesn't include Likelihood which we are using in the TG2 Tests and believe should be included. What document are using @chicoreus? as you have Likelihood mentioned above. We need to have one place where we are getting the definitions for consistency. We also need these published sooner rather than later.

We find the definition of COMPLETENESS in the Glossary as being too specific and I suggest something like "The extent to which data are present and are sufficiently comprehensive for use". As is, the definition can't be used for the Tests and we would need to redefine - something I would prefer to avoid.

There are also a number of terms that are instances (i.e. forms of Non-Compliance) of the DQ Dimensions - I believe it would be good to include these in the Framework Glossary if possible. The ones we use in the Tests are

  • Ambiguous
  • Incomplete
  • Inconsistent
  • Invalid
  • Unlikely
  • Amendment

See definitions in #152

@ArthurChapman
Copy link
Collaborator Author

ArthurChapman commented Oct 1, 2018

BTW - In discussion with the Tests and Assertions - we prefer Likeliness rather than Likelihood and that is what we have used throughout the tests.

@chicoreus
Copy link
Collaborator

@ArthurChapman COMPLETENESS is fundamental to measures, and the definition in the framework likely can't and shouldn't change. Adding the term "are sufficiently comprehensive for use" adds very substantive ambiguity, and would make implementation of measures effectively untenable and indistinguishable from a subset of the validations.

@chicoreus
Copy link
Collaborator

@ArthurChapman not using a document which contains definitions of the glossary terms. We do have the owl representation of the framework, with definitions of framework concepts at: https://github.com/kurator-org/kurator-ffdq/blob/master/competencyquestions/rdf/ffdq.owl

@ArthurChapman
Copy link
Collaborator Author

@chicoreus The definition of COMPLETENESS as given in the Glossary is "Measure the extent to which every meaningful and necessary data are present and sufficient for use in a specific Use Case" This definition is not very clear and is not a good definition. My suggestion (based on definitions of data completeness on an online search) I don't believe adds ambiguity that is not already there. Perhaps you can suggest some wording that is useful for both purposes. All the other definitions there are satisfactory, but I don't believe we can use this one for either purpose as is.

@ArthurChapman
Copy link
Collaborator Author

@chicoreus the owl link you mention doesn't include a definition of either Completeness or Liklihood

@chicoreus
Copy link
Collaborator

@ArthurChapman That's right, the owl is primarily a representation of classes and relationships, it isn't a comprehensive treatement of all of the vocabularies of values in the framework.

@chicoreus
Copy link
Collaborator

@ArthurChapman @Tasilee @allankv Yes, definitely an issue with multiple copies, in multiple versions. Need to converge.

If a definition of Completeness used for measures that return complete/incomplete includes "sufficient for use", then measures become hard to distinguish from validations. However, if that is a definition of completeness as a data quality dimension, then there isn't a problem with either the definition in the glossary or the definition you propose. Issue is probably confusion between a Measure Result Value of COMPLETE and a Data Quality Dimension of Completeness.

@ArthurChapman
Copy link
Collaborator Author

Thanks @chicoreus - that makes sense - As I understand it, and the way the Glossary is laid out by @allankv , the DQ Dimension of Completeness refers to more than just the MEASURES, and I think my new suggested definition doesn't change the overall thrust of definition in that way at all. I have just removed reference to Use Case and changed "sufficient for use" to "sufficiently comprehensive" to come more in line with definitions used elsewhere.

@allankv
Copy link
Collaborator

allankv commented Oct 7, 2018

Here is a first draft version of a controlled vocabulary for discussing and improving.

#REF Term Vocabulary Definition
1 DQ_DIMENSION COMPLETENESS The extent to which data are present and are sufficiently comprehensive for use
2 CONFORMANCE Conforms to a format, syntax, type, range, standard or to the own nature of the information element.
3 CONSISTENCY Agreement among related information elements in the data.
4 RELIABILITY Measure of how the data values agree with an identified source of truth. The degree to which data correctly describes the truth (object, event or any abstract or real "thing").
5 RESOLUTION Refer to the data have sufficient detail information. Measure the granularity of the data. Smallest measurable increment.
6 LIKELINESS Probability of data having the expected value. The likelihood of a data having true values rather than having false values.
7 RESULT COMPLIANT Refer to data that were validated as compliant with a DQ Criterion
8 NOT_COMPLIANT Refer to data that were validated as not compliant with a DQ Criterion
9 PROBLEM Refer to data that have some specific DQ Problem
10 NOT_PROBLEM Refer to data that not have some specific DQ Problem
11 RESULT_STATUS RUN_HAS_RESULT The result was correctly generated
12 INTERNAL_PREREQUISITES_NOT_MET A Response was not generated because an internal prerequisite was not met (e.g. a field required to run a test is missing or empty).
13 EXTERNAL_PREREQUISITES_NOT_MET A Response was not generated because an external prerequisite was not met (e.g. a targeted source authority could not be found)
14 FILLED_IN Data were altered by filled in value(s)
15 AMENDED Data were amended by modification or addition of a value or values following defined criteria.

@ArthurChapman
Copy link
Collaborator Author

Is there a need for PROBLEM and NOT_PROBLEM - aren't these just cases of COMPLIANT and NOT_COMPLIANT or am I misreading something? We have used COMPLIANT and NOT_COMPLIANT within the Tests.

@ArthurChapman
Copy link
Collaborator Author

Should DATA_PREREQUISITES_NOT_MET be INTERNAL_PREREQUISITES_NOT_MET? for consistency?

@allankv
Copy link
Collaborator

allankv commented Oct 14, 2018

Is there a need for PROBLEM and NOT_PROBLEM - aren't these just cases of COMPLIANT and NOT_COMPLIANT or am I misreading something? We have used COMPLIANT and NOT_COMPLIANT within the Tests.

There is a need for PROBLEM and NOT_PROBLEM when we have the concept of DQ Problem (the opposite sense of Data Quality).

Should DATA_PREREQUISITES_NOT_MET be INTERNAL_PREREQUISITES_NOT_MET? for consistency?

Maybe, I think INTERNAL_PREREQUISITES could be better than DATA_PREREQUISITES in this context.
What do you think @chicoreus?

chicoreus added a commit to kurator-org/kurator-ffdq that referenced this issue Oct 19, 2018
…DESCRIPTION: Adding Dimensions Conformance, Consistency, Likelyhood, Resolution (used in BDQ TG2 test descriptions) to the ffdq model.
@ArthurChapman
Copy link
Collaborator Author

Where are we with this and are we likely to get it published before too long @allankv ? We need to be able to refer to these in the Tests and not have to duplicate the definitions. Everyone please comment on the above if necessary - or if happy - give a thumbs up to @allankv 's post above.. Are there any terms you see, @Tasilee that are missing? Any that are non framework that are missing can be added to the TG2 Vocabulary at #152

@ArthurChapman
Copy link
Collaborator Author

I am not sure we are using the Concept of DQ Problem - I think we have written most of the tests to get around that, so we now have just COMPLIANT or NOT_COMPLIANT - I think PROBLEM and NOT_PROBLEM can be deleted

@ArthurChapman
Copy link
Collaborator Author

Looking at these again and the definitions above
PROBLEM is a synonym of NOT_COMPLIANT and NOT_PROBLEM is a synonym of COMPLIANT. Perhaps they could be defined as synonyms under COMPLIANT and NOT_COMPLIANT

@chicoreus
Copy link
Collaborator

@ArthurChapman the distinction that @allankv makes between DQ Criterion and DQ Problem is key. PROBLEM and NOT_COMPLIANT have similar meanings, but in distinctly different contexts.

@ArthurChapman
Copy link
Collaborator Author

@chicoreus - then the definition needs to refer to the Context - the definitions in This issue (#164) don't.

@chicoreus
Copy link
Collaborator

The context of PROBLEM/NOT_PROBLEM/POSSIBLE_PROBLEM as a DQ Problem (think Issue). The context of COMPLIANT and NOT_COMPLIANT is DQ Criterion (ValidationPolicy) (think Validation). Also are COMPLETE/NOT_COMPLETE as response.result values for Measures, DQ MeasurementPolicy.

@ArthurChapman
Copy link
Collaborator Author

Thanks @chicoreus. I am not sure that they all fit together under RESULT then - seems you need another term (ISSUE) at the level of RESULT in the above table of definitions. Now explain to me what a POSSIBLE_PROBLEM is please (with example). I have been thinking on this for several days - and for all intents if something was identified as a POSSIBLE_PROBLEM (or ISSUE etc.) then wouldn't it be a PROBLEM that needed to be identified as such? I think somewhere you also used the term POTENTIAL_PROBLEM (may have been in error) - but I can't think of an example that wouldn't already be identified as a PROBLEM of ISSUE. That would sound like a temporal issue. It is NOT_PROBLEM now when I look or run a test, etc. but next time I run it, it is a PROBLEM. So at any one time is is either a PROBLEM or it is NOT a PROBLEM.

@chicoreus
Copy link
Collaborator

@ArthurChapman There are 4 test types in the framework (ignoring the formal differences in terminology between the DQ Needs layer and the DQ Report layer) they are Validation, Measure, Amendment, and Issue. Issue was @allankv 's addition to handle tests phrased in the negative sense, it is Validations with specifications phrased to find things that fail rather than things that pass.

A data quality Report, at the DQ Report layer consists of a set of what we seem to have settled on calling Responses. A response consists of three parts, as we've been calling them Response.status, Response.result, and Response.comment, we've got a proposal on the table for a 4th optional element something on the line of Response.qualifier. At various times we have called the Response a result, and the Response.result response.value or value, but we seem to be settling on Response.status, Response.result, and Response.comment. Formally in the framework, Responses are typed, so that the response for a validatation has a different set of allowed values for response.status and response.result than a the response for an amendment.

Validation
Response.status (RUN_HAS_RESULT, EXTERNAL_PREREQISITES_NOT_MET, INTERNAL_PREREQISITES_NOT_MET)
Response.result(COMPLIANT,NOT_COMPLIANT)
Response.comment{any human readable explanation of the response}

Measure
Response.status (RUN_HAS_RESULT, EXTERNAL_PREREQISITES_NOT_MET, INTERNAL_PREREQISITES_NOT_MET)
Response.result({a numeric value}, COMPLETE,NOT_COMPLETE)
Response.comment{any human readable explanation of the response}

Amendment
Response.status (AMENDED, NOT_AMENDED, EXTERNAL_PREREQISITES_NOT_MET, INTERNAL_PREREQISITES_NOT_MET)
Response.result {a list of key:value pairs specifying for which terms changes are proposed, and what the proposed new values are}
Response.comment{any human readable explanation of the response}

Issue
Response.status (RUN_HAS_RESULT, EXTERNAL_PREREQISITES_NOT_MET, INTERNAL_PREREQISITES_NOT_MET)
Response.result(PROBLEM,NOT_PROBLEM)
Response.comment{any human readable explanation of the response}

This gives the context for POSSIBLE_PROBLEM/POTENTIAL_PROBLEM, some third allowed value for Response.result for an Issue.

In the tests we have described "Notifications". There has seemed to be no place in the framework for these, and I haven't seen anyway to fit them in to the framework. POSSIBLE_PROBLEM gives us a way to fit these in. We phrase the Notifications as Issues, and phrase their specification (expected response in the github issue tables) so as to return POSSIBLE_PROBLEM when the issue they are looking for (e.g. dwc:dataGeneralizations is not EMPTY) is found. Under quality assurance, consumers of the data quality report could choose to filter out, or chose to not filter out the singlerecords that had a POSSIBLE_PROBLEM, understanding this as distinctly different from filtering out NOT_COMPLIANT records and PROBLEM records, where the presence of that response.result explicitly excludes the data as unfit for their uses. POSSIBLE_PROBLEM give a way to note things that may or may not be fit for their uses under CORE data quality needs. Under quality control POSSIBLE_PROBLEM flags a distinct set of potential problems for review, including in combination with other tests, dataGeneralizations not EMPTY may flag distinct sets of NOT_COMPLIANT issues in spatial data. This is what notifications are intended for, flagging potential but not certain to be so issues in the data, and adding some variant on POSSIBLE_PROBLEM and phrasing all of the Notifications as Issues seems a logical way to work the things we've been calling Notifications into the framework.

@chicoreus
Copy link
Collaborator

For a specific example:

We could rephrase #72 from:

"72","13d5a10e-188e-40fd-a22c-dbaa87b91df2","NOTIFICATION_DATAGENERALIZATIONS_NOTEMPTY","Space, Time, Name","Record_level Terms","dwc:dataGeneralizations","","REPORT if dwc:dataGeneralizations is not EMPTY; otherwise NOT_REPORTED","#72 Notification SingleRecord Resolution: datageneralizations notempty","Notification","SingleRecord","Resolution"

To:

"72","13d5a10e-188e-40fd-a22c-dbaa87b91df2","ISSUE_DATAGENERALIZATIONS_NOTEMPTY","Space, Time, Name","Record_level Terms","dwc:dataGeneralizations","","POSSIBLE_PROBLEM if dwc:dataGeneralizations is not EMPTY; otherwise NOT_PROBLEM","#72 Issue SingleRecord Resolution: datageneralizations notempty","Issue","SingleRecord","Resolution"

@chicoreus
Copy link
Collaborator

Formally, the top level class in the report is Assertion, Issue, Measure, Validation, Amendment are subclasses of Assertion, and Assertions have properties, so we should probably call the thing we've been calling Response an Assertion, with an Assertion having hasStatus, hasDatatypeValue, hasObjectPropertyValue, and hasComment properties....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants