-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TG1 - Develop a Vocabulary based on the Framework for a TDWG Data Standard #164
Comments
Some of the definitions in the Framework Vocabulary will need re-evaluating in the light of TG2 use and circumscription. |
We've got at least three candidates for vocabularies:
|
We seem to have several different places where we have the Glossary definitions for the Framework. The one on the GitHub (https://tdwg.github.io/bdq/tg1/site/glossary.html) doesn't include Likelihood which we are using in the TG2 Tests and believe should be included. What document are using @chicoreus? as you have Likelihood mentioned above. We need to have one place where we are getting the definitions for consistency. We also need these published sooner rather than later. We find the definition of COMPLETENESS in the Glossary as being too specific and I suggest something like "The extent to which data are present and are sufficiently comprehensive for use". As is, the definition can't be used for the Tests and we would need to redefine - something I would prefer to avoid. There are also a number of terms that are instances (i.e. forms of Non-Compliance) of the DQ Dimensions - I believe it would be good to include these in the Framework Glossary if possible. The ones we use in the Tests are
See definitions in #152 |
BTW - In discussion with the Tests and Assertions - we prefer Likeliness rather than Likelihood and that is what we have used throughout the tests. |
@ArthurChapman COMPLETENESS is fundamental to measures, and the definition in the framework likely can't and shouldn't change. Adding the term "are sufficiently comprehensive for use" adds very substantive ambiguity, and would make implementation of measures effectively untenable and indistinguishable from a subset of the validations. |
@ArthurChapman not using a document which contains definitions of the glossary terms. We do have the owl representation of the framework, with definitions of framework concepts at: https://github.com/kurator-org/kurator-ffdq/blob/master/competencyquestions/rdf/ffdq.owl |
@chicoreus The definition of COMPLETENESS as given in the Glossary is "Measure the extent to which every meaningful and necessary data are present and sufficient for use in a specific Use Case" This definition is not very clear and is not a good definition. My suggestion (based on definitions of data completeness on an online search) I don't believe adds ambiguity that is not already there. Perhaps you can suggest some wording that is useful for both purposes. All the other definitions there are satisfactory, but I don't believe we can use this one for either purpose as is. |
@chicoreus the owl link you mention doesn't include a definition of either Completeness or Liklihood |
@ArthurChapman That's right, the owl is primarily a representation of classes and relationships, it isn't a comprehensive treatement of all of the vocabularies of values in the framework. |
@ArthurChapman @Tasilee @allankv Yes, definitely an issue with multiple copies, in multiple versions. Need to converge. If a definition of Completeness used for measures that return complete/incomplete includes "sufficient for use", then measures become hard to distinguish from validations. However, if that is a definition of completeness as a data quality dimension, then there isn't a problem with either the definition in the glossary or the definition you propose. Issue is probably confusion between a Measure Result Value of COMPLETE and a Data Quality Dimension of Completeness. |
Thanks @chicoreus - that makes sense - As I understand it, and the way the Glossary is laid out by @allankv , the DQ Dimension of Completeness refers to more than just the MEASURES, and I think my new suggested definition doesn't change the overall thrust of definition in that way at all. I have just removed reference to Use Case and changed "sufficient for use" to "sufficiently comprehensive" to come more in line with definitions used elsewhere. |
Here is a first draft version of a controlled vocabulary for discussing and improving.
|
Is there a need for PROBLEM and NOT_PROBLEM - aren't these just cases of COMPLIANT and NOT_COMPLIANT or am I misreading something? We have used COMPLIANT and NOT_COMPLIANT within the Tests. |
Should DATA_PREREQUISITES_NOT_MET be INTERNAL_PREREQUISITES_NOT_MET? for consistency? |
There is a need for PROBLEM and NOT_PROBLEM when we have the concept of DQ Problem (the opposite sense of Data Quality).
Maybe, I think INTERNAL_PREREQUISITES could be better than DATA_PREREQUISITES in this context. |
…DESCRIPTION: Adding Dimensions Conformance, Consistency, Likelyhood, Resolution (used in BDQ TG2 test descriptions) to the ffdq model.
Where are we with this and are we likely to get it published before too long @allankv ? We need to be able to refer to these in the Tests and not have to duplicate the definitions. Everyone please comment on the above if necessary - or if happy - give a thumbs up to @allankv 's post above.. Are there any terms you see, @Tasilee that are missing? Any that are non framework that are missing can be added to the TG2 Vocabulary at #152 |
I am not sure we are using the Concept of DQ Problem - I think we have written most of the tests to get around that, so we now have just COMPLIANT or NOT_COMPLIANT - I think PROBLEM and NOT_PROBLEM can be deleted |
Looking at these again and the definitions above |
@ArthurChapman the distinction that @allankv makes between DQ Criterion and DQ Problem is key. PROBLEM and NOT_COMPLIANT have similar meanings, but in distinctly different contexts. |
@chicoreus - then the definition needs to refer to the Context - the definitions in This issue (#164) don't. |
The context of PROBLEM/NOT_PROBLEM/POSSIBLE_PROBLEM as a DQ Problem (think Issue). The context of COMPLIANT and NOT_COMPLIANT is DQ Criterion (ValidationPolicy) (think Validation). Also are COMPLETE/NOT_COMPLETE as response.result values for Measures, DQ MeasurementPolicy. |
Thanks @chicoreus. I am not sure that they all fit together under RESULT then - seems you need another term (ISSUE) at the level of RESULT in the above table of definitions. Now explain to me what a POSSIBLE_PROBLEM is please (with example). I have been thinking on this for several days - and for all intents if something was identified as a POSSIBLE_PROBLEM (or ISSUE etc.) then wouldn't it be a PROBLEM that needed to be identified as such? I think somewhere you also used the term POTENTIAL_PROBLEM (may have been in error) - but I can't think of an example that wouldn't already be identified as a PROBLEM of ISSUE. That would sound like a temporal issue. It is NOT_PROBLEM now when I look or run a test, etc. but next time I run it, it is a PROBLEM. So at any one time is is either a PROBLEM or it is NOT a PROBLEM. |
@ArthurChapman There are 4 test types in the framework (ignoring the formal differences in terminology between the DQ Needs layer and the DQ Report layer) they are Validation, Measure, Amendment, and Issue. Issue was @allankv 's addition to handle tests phrased in the negative sense, it is Validations with specifications phrased to find things that fail rather than things that pass. A data quality Report, at the DQ Report layer consists of a set of what we seem to have settled on calling Responses. A response consists of three parts, as we've been calling them Response.status, Response.result, and Response.comment, we've got a proposal on the table for a 4th optional element something on the line of Response.qualifier. At various times we have called the Response a result, and the Response.result response.value or value, but we seem to be settling on Response.status, Response.result, and Response.comment. Formally in the framework, Responses are typed, so that the response for a validatation has a different set of allowed values for response.status and response.result than a the response for an amendment. Validation Measure Amendment Issue This gives the context for POSSIBLE_PROBLEM/POTENTIAL_PROBLEM, some third allowed value for Response.result for an Issue. In the tests we have described "Notifications". There has seemed to be no place in the framework for these, and I haven't seen anyway to fit them in to the framework. POSSIBLE_PROBLEM gives us a way to fit these in. We phrase the Notifications as Issues, and phrase their specification (expected response in the github issue tables) so as to return POSSIBLE_PROBLEM when the issue they are looking for (e.g. dwc:dataGeneralizations is not EMPTY) is found. Under quality assurance, consumers of the data quality report could choose to filter out, or chose to not filter out the singlerecords that had a POSSIBLE_PROBLEM, understanding this as distinctly different from filtering out NOT_COMPLIANT records and PROBLEM records, where the presence of that response.result explicitly excludes the data as unfit for their uses. POSSIBLE_PROBLEM give a way to note things that may or may not be fit for their uses under CORE data quality needs. Under quality control POSSIBLE_PROBLEM flags a distinct set of potential problems for review, including in combination with other tests, dataGeneralizations not EMPTY may flag distinct sets of NOT_COMPLIANT issues in spatial data. This is what notifications are intended for, flagging potential but not certain to be so issues in the data, and adding some variant on POSSIBLE_PROBLEM and phrasing all of the Notifications as Issues seems a logical way to work the things we've been calling Notifications into the framework. |
For a specific example: We could rephrase #72 from: "72","13d5a10e-188e-40fd-a22c-dbaa87b91df2","NOTIFICATION_DATAGENERALIZATIONS_NOTEMPTY","Space, Time, Name","Record_level Terms","dwc:dataGeneralizations","","REPORT if dwc:dataGeneralizations is not EMPTY; otherwise NOT_REPORTED","#72 Notification SingleRecord Resolution: datageneralizations notempty","Notification","SingleRecord","Resolution" To: "72","13d5a10e-188e-40fd-a22c-dbaa87b91df2","ISSUE_DATAGENERALIZATIONS_NOTEMPTY","Space, Time, Name","Record_level Terms","dwc:dataGeneralizations","","POSSIBLE_PROBLEM if dwc:dataGeneralizations is not EMPTY; otherwise NOT_PROBLEM","#72 Issue SingleRecord Resolution: datageneralizations notempty","Issue","SingleRecord","Resolution" |
Formally, the top level class in the report is Assertion, Issue, Measure, Validation, Amendment are subclasses of Assertion, and Assertions have properties, so we should probably call the thing we've been calling Response an Assertion, with an Assertion having hasStatus, hasDatatypeValue, hasObjectPropertyValue, and hasComment properties.... |
Discussion under #152 has strongly suggested that a separate Vocabulary should be developed for the terms used in the Tests and Assertions to the Vocabulary being developed under the Framework on Data Quality. Relevant terms in the Tests Vocabulary will refer to the definition used in the Framework Vocabulary. See discussion under #152
The text was updated successfully, but these errors were encountered: