-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Whitespace in authority #978
Comments
Sorry, forgot to say that GBIF did enforce a uniform authorship style. @dhobern maybe this is sth for the taxonomy group to discuss? |
Thanks for pointing that out. I'll fix it in the new World Ferns/World Plants pipeline that we are working on. |
It is a mess and not part of what (at least for ICZN) the code addresses. How easy would it be to canonicalise this in the creation of the COL product? I would expect it to get very messy indeed, particularly with names that may include multiple parts to the "family" name. |
I think that is doable. Let me explain the exact situation a litte. In addition to this parsed model the Name class also keeps the entire authorship as a single String - which can be in contradiction to the actually parsed (or structurally provided by ColDP) information. Initially I developed the system to avoid that and if the name can be parsed trust only that parsed version and reconstruct the full authorship. This leads to changes to whitespace, punctuation We could rather easily go back and reconstruct the authorship in a canonical form based on the parsed version at least when we sync the data into the project. Or even already when we import the sources into CLB. We did this already in the GBIF backbone and parsing has improved since. I might produce a simple overview, listing all names currently in COL with a new column how the canonical authorship would look like. That should help to inform us if such a change would be for the good or bad. |
Here is the names.csv.zip result. Not that the name formatting in some parts depends on the code. E.g. the bacterial code does not use a comma between authors and the year. So I have dumped all COL names with these columns:
|
whitespace seems to get fixed:
|
The only one I can see to go wrong so far is this:
|
from 5.292.279 COL names the authorship of 716.235 names would be different afterwards. |
When performing a dataset comparison between COL25.2 and GBIF, we noticed that the treatment of whitespace in the taxonomic authority differs. COL25.2 is adding a whitespace character after the author's initials. For example:
Conventionally, there is no white space between initials and surname, e.g. F.Muell. is correct, not F. Muell. and C.K.Schneid. is correct, not C. K. Schneid.
We thought we would find reference to this in the International Code of Nomenclature for algae, fungi, and plants but have been unable to.
It is referred to in https://www.stylemanual.gov.au/grammar-punctuation-and-conventions/names-and-terms/plants-and-animals:
Are there plans for COL to remove the additional whitespace?
The text was updated successfully, but these errors were encountered: