TG1 - Develop a Vocabulary based on the Framework for a TDWG Data Standard #164

ArthurChapman · 2018-09-04T05:47:32Z

Discussion under #152 has strongly suggested that a separate Vocabulary should be developed for the terms used in the Tests and Assertions to the Vocabulary being developed under the Framework on Data Quality. Relevant terms in the Tests Vocabulary will refer to the definition used in the Framework Vocabulary. See discussion under #152

ArthurChapman · 2018-09-06T02:22:32Z

Some of the definitions in the Framework Vocabulary will need re-evaluating in the light of TG2 use and circumscription.

chicoreus · 2018-09-06T15:09:22Z

We've got at least three candidates for vocabularies:

The data quality DIMENSION values (Completeness, Conformance, Reliability, Consistency, Likelyhood, Resolution).
The controlled vocabulary values for Response.Result in a Data Quality Report (COMPLETE, NOT_COMPLETE, COMPLIANT, NOT_COMPLIANT, PROBLEM, NOT_PROBLEM).
The controlled vocabulary values for Response.Status in a Data Quality Report (e.g. RUN_HAS_RESULT, DATA_PREREQUISITES_NOT_MET, EXTERNAL_PREREQUISITES_NOT_MET, FILLED_IN, AMENDED). It is possible that some other values from TG2's analysis go here (e.g. AMBIGUOUS), but that may need further analysis.

ArthurChapman · 2018-10-01T02:13:47Z

We seem to have several different places where we have the Glossary definitions for the Framework. The one on the GitHub (https://tdwg.github.io/bdq/tg1/site/glossary.html) doesn't include Likelihood which we are using in the TG2 Tests and believe should be included. What document are using @chicoreus? as you have Likelihood mentioned above. We need to have one place where we are getting the definitions for consistency. We also need these published sooner rather than later.

We find the definition of COMPLETENESS in the Glossary as being too specific and I suggest something like "The extent to which data are present and are sufficiently comprehensive for use". As is, the definition can't be used for the Tests and we would need to redefine - something I would prefer to avoid.

There are also a number of terms that are instances (i.e. forms of Non-Compliance) of the DQ Dimensions - I believe it would be good to include these in the Framework Glossary if possible. The ones we use in the Tests are

Ambiguous
Incomplete
Inconsistent
Invalid
Unlikely
Amendment

See definitions in #152

ArthurChapman · 2018-10-01T02:17:28Z

BTW - In discussion with the Tests and Assertions - we prefer Likeliness rather than Likelihood and that is what we have used throughout the tests.

chicoreus · 2018-10-01T02:26:32Z

@ArthurChapman COMPLETENESS is fundamental to measures, and the definition in the framework likely can't and shouldn't change. Adding the term "are sufficiently comprehensive for use" adds very substantive ambiguity, and would make implementation of measures effectively untenable and indistinguishable from a subset of the validations.

chicoreus · 2018-10-01T02:28:38Z

@ArthurChapman not using a document which contains definitions of the glossary terms. We do have the owl representation of the framework, with definitions of framework concepts at: https://github.com/kurator-org/kurator-ffdq/blob/master/competencyquestions/rdf/ffdq.owl

ArthurChapman · 2018-10-01T03:02:56Z

@chicoreus The definition of COMPLETENESS as given in the Glossary is "Measure the extent to which every meaningful and necessary data are present and sufficient for use in a specific Use Case" This definition is not very clear and is not a good definition. My suggestion (based on definitions of data completeness on an online search) I don't believe adds ambiguity that is not already there. Perhaps you can suggest some wording that is useful for both purposes. All the other definitions there are satisfactory, but I don't believe we can use this one for either purpose as is.

ArthurChapman · 2018-10-01T03:09:27Z

@chicoreus the owl link you mention doesn't include a definition of either Completeness or Liklihood

chicoreus · 2018-10-01T03:30:33Z

@ArthurChapman That's right, the owl is primarily a representation of classes and relationships, it isn't a comprehensive treatement of all of the vocabularies of values in the framework.

chicoreus · 2018-10-01T03:35:11Z

@ArthurChapman @Tasilee @allankv Yes, definitely an issue with multiple copies, in multiple versions. Need to converge.

If a definition of Completeness used for measures that return complete/incomplete includes "sufficient for use", then measures become hard to distinguish from validations. However, if that is a definition of completeness as a data quality dimension, then there isn't a problem with either the definition in the glossary or the definition you propose. Issue is probably confusion between a Measure Result Value of COMPLETE and a Data Quality Dimension of Completeness.

ArthurChapman · 2018-10-01T04:43:53Z

Thanks @chicoreus - that makes sense - As I understand it, and the way the Glossary is laid out by @allankv , the DQ Dimension of Completeness refers to more than just the MEASURES, and I think my new suggested definition doesn't change the overall thrust of definition in that way at all. I have just removed reference to Use Case and changed "sufficient for use" to "sufficiently comprehensive" to come more in line with definitions used elsewhere.

allankv · 2018-10-07T16:02:58Z

Here is a first draft version of a controlled vocabulary for discussing and improving.

#REF	Term	Vocabulary	Definition
1	DQ_DIMENSION	COMPLETENESS	The extent to which data are present and are sufficiently comprehensive for use
2		CONFORMANCE	Conforms to a format, syntax, type, range, standard or to the own nature of the information element.
3		CONSISTENCY	Agreement among related information elements in the data.
4		RELIABILITY	Measure of how the data values agree with an identified source of truth. The degree to which data correctly describes the truth (object, event or any abstract or real "thing").
5		RESOLUTION	Refer to the data have sufficient detail information. Measure the granularity of the data. Smallest measurable increment.
6		LIKELINESS	Probability of data having the expected value. The likelihood of a data having true values rather than having false values.
7	RESULT	COMPLIANT	Refer to data that were validated as compliant with a DQ Criterion
8		NOT_COMPLIANT	Refer to data that were validated as not compliant with a DQ Criterion
9		PROBLEM	Refer to data that have some specific DQ Problem
10		NOT_PROBLEM	Refer to data that not have some specific DQ Problem
11	RESULT_STATUS	RUN_HAS_RESULT	The result was correctly generated
12		INTERNAL_PREREQUISITES_NOT_MET	A Response was not generated because an internal prerequisite was not met (e.g. a field required to run a test is missing or empty).
13		EXTERNAL_PREREQUISITES_NOT_MET	A Response was not generated because an external prerequisite was not met (e.g. a targeted source authority could not be found)
14		FILLED_IN	Data were altered by filled in value(s)
15		AMENDED	Data were amended by modification or addition of a value or values following defined criteria.

ArthurChapman · 2018-10-07T20:42:18Z

Is there a need for PROBLEM and NOT_PROBLEM - aren't these just cases of COMPLIANT and NOT_COMPLIANT or am I misreading something? We have used COMPLIANT and NOT_COMPLIANT within the Tests.

ArthurChapman · 2018-10-07T20:44:04Z

Should DATA_PREREQUISITES_NOT_MET be INTERNAL_PREREQUISITES_NOT_MET? for consistency?

allankv · 2018-10-14T20:44:59Z

Is there a need for PROBLEM and NOT_PROBLEM - aren't these just cases of COMPLIANT and NOT_COMPLIANT or am I misreading something? We have used COMPLIANT and NOT_COMPLIANT within the Tests.

There is a need for PROBLEM and NOT_PROBLEM when we have the concept of DQ Problem (the opposite sense of Data Quality).

Should DATA_PREREQUISITES_NOT_MET be INTERNAL_PREREQUISITES_NOT_MET? for consistency?

Maybe, I think INTERNAL_PREREQUISITES could be better than DATA_PREREQUISITES in this context.
What do you think @chicoreus?

…DESCRIPTION: Adding Dimensions Conformance, Consistency, Likelyhood, Resolution (used in BDQ TG2 test descriptions) to the ffdq model.

ArthurChapman · 2019-06-04T06:45:59Z

Where are we with this and are we likely to get it published before too long @allankv ? We need to be able to refer to these in the Tests and not have to duplicate the definitions. Everyone please comment on the above if necessary - or if happy - give a thumbs up to @allankv 's post above.. Are there any terms you see, @Tasilee that are missing? Any that are non framework that are missing can be added to the TG2 Vocabulary at #152

ArthurChapman · 2019-06-06T01:02:09Z

I am not sure we are using the Concept of DQ Problem - I think we have written most of the tests to get around that, so we now have just COMPLIANT or NOT_COMPLIANT - I think PROBLEM and NOT_PROBLEM can be deleted

ArthurChapman · 2022-03-09T03:43:02Z

Looking at these again and the definitions above
PROBLEM is a synonym of NOT_COMPLIANT and NOT_PROBLEM is a synonym of COMPLIANT. Perhaps they could be defined as synonyms under COMPLIANT and NOT_COMPLIANT

chicoreus · 2022-03-09T04:26:49Z

@ArthurChapman the distinction that @allankv makes between DQ Criterion and DQ Problem is key. PROBLEM and NOT_COMPLIANT have similar meanings, but in distinctly different contexts.

ArthurChapman · 2022-03-09T04:39:01Z

@chicoreus - then the definition needs to refer to the Context - the definitions in This issue (#164) don't.

chicoreus · 2022-03-09T05:17:57Z

The context of PROBLEM/NOT_PROBLEM/POSSIBLE_PROBLEM as a DQ Problem (think Issue). The context of COMPLIANT and NOT_COMPLIANT is DQ Criterion (ValidationPolicy) (think Validation). Also are COMPLETE/NOT_COMPLETE as response.result values for Measures, DQ MeasurementPolicy.

ArthurChapman · 2022-03-09T05:32:23Z

Thanks @chicoreus. I am not sure that they all fit together under RESULT then - seems you need another term (ISSUE) at the level of RESULT in the above table of definitions. Now explain to me what a POSSIBLE_PROBLEM is please (with example). I have been thinking on this for several days - and for all intents if something was identified as a POSSIBLE_PROBLEM (or ISSUE etc.) then wouldn't it be a PROBLEM that needed to be identified as such? I think somewhere you also used the term POTENTIAL_PROBLEM (may have been in error) - but I can't think of an example that wouldn't already be identified as a PROBLEM of ISSUE. That would sound like a temporal issue. It is NOT_PROBLEM now when I look or run a test, etc. but next time I run it, it is a PROBLEM. So at any one time is is either a PROBLEM or it is NOT a PROBLEM.

chicoreus · 2022-03-09T06:06:32Z

@ArthurChapman There are 4 test types in the framework (ignoring the formal differences in terminology between the DQ Needs layer and the DQ Report layer) they are Validation, Measure, Amendment, and Issue. Issue was @allankv 's addition to handle tests phrased in the negative sense, it is Validations with specifications phrased to find things that fail rather than things that pass.

A data quality Report, at the DQ Report layer consists of a set of what we seem to have settled on calling Responses. A response consists of three parts, as we've been calling them Response.status, Response.result, and Response.comment, we've got a proposal on the table for a 4th optional element something on the line of Response.qualifier. At various times we have called the Response a result, and the Response.result response.value or value, but we seem to be settling on Response.status, Response.result, and Response.comment. Formally in the framework, Responses are typed, so that the response for a validatation has a different set of allowed values for response.status and response.result than a the response for an amendment.

Validation
Response.status (RUN_HAS_RESULT, EXTERNAL_PREREQISITES_NOT_MET, INTERNAL_PREREQISITES_NOT_MET)
Response.result(COMPLIANT,NOT_COMPLIANT)
Response.comment{any human readable explanation of the response}

Measure
Response.status (RUN_HAS_RESULT, EXTERNAL_PREREQISITES_NOT_MET, INTERNAL_PREREQISITES_NOT_MET)
Response.result({a numeric value}, COMPLETE,NOT_COMPLETE)
Response.comment{any human readable explanation of the response}

Amendment
Response.status (AMENDED, NOT_AMENDED, EXTERNAL_PREREQISITES_NOT_MET, INTERNAL_PREREQISITES_NOT_MET)
Response.result {a list of key:value pairs specifying for which terms changes are proposed, and what the proposed new values are}
Response.comment{any human readable explanation of the response}

Issue
Response.status (RUN_HAS_RESULT, EXTERNAL_PREREQISITES_NOT_MET, INTERNAL_PREREQISITES_NOT_MET)
Response.result(PROBLEM,NOT_PROBLEM)
Response.comment{any human readable explanation of the response}

This gives the context for POSSIBLE_PROBLEM/POTENTIAL_PROBLEM, some third allowed value for Response.result for an Issue.

In the tests we have described "Notifications". There has seemed to be no place in the framework for these, and I haven't seen anyway to fit them in to the framework. POSSIBLE_PROBLEM gives us a way to fit these in. We phrase the Notifications as Issues, and phrase their specification (expected response in the github issue tables) so as to return POSSIBLE_PROBLEM when the issue they are looking for (e.g. dwc:dataGeneralizations is not EMPTY) is found. Under quality assurance, consumers of the data quality report could choose to filter out, or chose to not filter out the singlerecords that had a POSSIBLE_PROBLEM, understanding this as distinctly different from filtering out NOT_COMPLIANT records and PROBLEM records, where the presence of that response.result explicitly excludes the data as unfit for their uses. POSSIBLE_PROBLEM give a way to note things that may or may not be fit for their uses under CORE data quality needs. Under quality control POSSIBLE_PROBLEM flags a distinct set of potential problems for review, including in combination with other tests, dataGeneralizations not EMPTY may flag distinct sets of NOT_COMPLIANT issues in spatial data. This is what notifications are intended for, flagging potential but not certain to be so issues in the data, and adding some variant on POSSIBLE_PROBLEM and phrasing all of the Notifications as Issues seems a logical way to work the things we've been calling Notifications into the framework.

chicoreus · 2022-03-09T06:09:45Z

For a specific example:

We could rephrase #72 from:

"72","13d5a10e-188e-40fd-a22c-dbaa87b91df2","NOTIFICATION_DATAGENERALIZATIONS_NOTEMPTY","Space, Time, Name","Record_level Terms","dwc:dataGeneralizations","","REPORT if dwc:dataGeneralizations is not EMPTY; otherwise NOT_REPORTED","#72 Notification SingleRecord Resolution: datageneralizations notempty","Notification","SingleRecord","Resolution"

To:

"72","13d5a10e-188e-40fd-a22c-dbaa87b91df2","ISSUE_DATAGENERALIZATIONS_NOTEMPTY","Space, Time, Name","Record_level Terms","dwc:dataGeneralizations","","POSSIBLE_PROBLEM if dwc:dataGeneralizations is not EMPTY; otherwise NOT_PROBLEM","#72 Issue SingleRecord Resolution: datageneralizations notempty","Issue","SingleRecord","Resolution"

chicoreus · 2022-03-09T06:15:44Z

Formally, the top level class in the report is Assertion, Issue, Measure, Validation, Amendment are subclasses of Assertion, and Assertions have properties, so we should probably call the thing we've been calling Response an Assertion, with an Assertion having hasStatus, hasDatatypeValue, hasObjectPropertyValue, and hasComment properties....

ArthurChapman assigned allankv Sep 4, 2018

ArthurChapman changed the title ~~TG1 - Develop a Vocabulary based on the Framework for TDWG Data Standard~~ TG1 - Develop a Vocabulary based on the Framework for a TDWG Data Standard Sep 4, 2018

chicoreus added the TG1 label Sep 6, 2018

ArthurChapman added the VOCABULARY label Oct 7, 2018

Tasilee mentioned this issue Jun 6, 2019

BDQ Core - VOCABULARY of terms #152

Closed

ArthurChapman mentioned this issue Mar 9, 2022

TG2 OLD (2022-02-20) - Develop a VOCABULARY that covers the terms used within TG2 #194

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TG1 - Develop a Vocabulary based on the Framework for a TDWG Data Standard #164

TG1 - Develop a Vocabulary based on the Framework for a TDWG Data Standard #164

ArthurChapman commented Sep 4, 2018

ArthurChapman commented Sep 6, 2018

chicoreus commented Sep 6, 2018

ArthurChapman commented Oct 1, 2018

ArthurChapman commented Oct 1, 2018 •

edited

Loading

chicoreus commented Oct 1, 2018

chicoreus commented Oct 1, 2018

ArthurChapman commented Oct 1, 2018

ArthurChapman commented Oct 1, 2018

chicoreus commented Oct 1, 2018

chicoreus commented Oct 1, 2018

ArthurChapman commented Oct 1, 2018

allankv commented Oct 7, 2018 •

edited by ArthurChapman

Loading

ArthurChapman commented Oct 7, 2018

ArthurChapman commented Oct 7, 2018

allankv commented Oct 14, 2018

ArthurChapman commented Jun 4, 2019

ArthurChapman commented Jun 6, 2019

ArthurChapman commented Mar 9, 2022

chicoreus commented Mar 9, 2022

ArthurChapman commented Mar 9, 2022

chicoreus commented Mar 9, 2022

ArthurChapman commented Mar 9, 2022

chicoreus commented Mar 9, 2022

chicoreus commented Mar 9, 2022

chicoreus commented Mar 9, 2022

TG1 - Develop a Vocabulary based on the Framework for a TDWG Data Standard #164

TG1 - Develop a Vocabulary based on the Framework for a TDWG Data Standard #164

Comments

ArthurChapman commented Sep 4, 2018

ArthurChapman commented Sep 6, 2018

chicoreus commented Sep 6, 2018

ArthurChapman commented Oct 1, 2018

ArthurChapman commented Oct 1, 2018 • edited Loading

chicoreus commented Oct 1, 2018

chicoreus commented Oct 1, 2018

ArthurChapman commented Oct 1, 2018

ArthurChapman commented Oct 1, 2018

chicoreus commented Oct 1, 2018

chicoreus commented Oct 1, 2018

ArthurChapman commented Oct 1, 2018

allankv commented Oct 7, 2018 • edited by ArthurChapman Loading

ArthurChapman commented Oct 7, 2018

ArthurChapman commented Oct 7, 2018

allankv commented Oct 14, 2018

ArthurChapman commented Jun 4, 2019

ArthurChapman commented Jun 6, 2019

ArthurChapman commented Mar 9, 2022

chicoreus commented Mar 9, 2022

ArthurChapman commented Mar 9, 2022

chicoreus commented Mar 9, 2022

ArthurChapman commented Mar 9, 2022

chicoreus commented Mar 9, 2022

chicoreus commented Mar 9, 2022

chicoreus commented Mar 9, 2022

ArthurChapman commented Oct 1, 2018 •

edited

Loading

allankv commented Oct 7, 2018 •

edited by ArthurChapman

Loading