Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whitespace in authority #978

Open
nigelcharman opened this issue Mar 12, 2025 · 9 comments
Open

Whitespace in authority #978

nigelcharman opened this issue Mar 12, 2025 · 9 comments

Comments

@nigelcharman
Copy link

When performing a dataset comparison between COL25.2 and GBIF, we noticed that the treatment of whitespace in the taxonomic authority differs. COL25.2 is adding a whitespace character after the author's initials. For example:

Image

Conventionally, there is no white space between initials and surname, e.g. F.Muell. is correct, not F. Muell. and C.K.Schneid. is correct, not C. K. Schneid.

We thought we would find reference to this in the International Code of Nomenclature for algae, fungi, and plants but have been unable to.

It is referred to in https://www.stylemanual.gov.au/grammar-punctuation-and-conventions/names-and-terms/plants-and-animals:

Image

Are there plans for COL to remove the additional whitespace?

@mdoering
Copy link
Member

COL does not enforce a uniform authorship standard. We use whatever the sources use for their parts in COL.
Personally I would wish we would enforce a standard format to increase consistency.

For example the fern Abrodictyum cumingii C. Presl
is given to us by World Ferns with a space in C.<space>Presl:
https://www.checklistbank.org/dataset/1140/taxon/Hymenophyllales-Hymenophyllaceae-Trichomanoideae-Abrodictyum-cumingii-C.%20Presl

@gdower although this might be an artifact of the ColDP converter? The original website does not have that space!
https://www.worldplants.de/world-plants-complete-list/complete-plant-list/?name=Abrodictyum-cumingii#plantUid-2007

Neither does IPNI: https://ipni.org/n/17000070-1
https://www.checklistbank.org/dataset/2006/taxon/17000070-1

WCVP does not use space either:
https://www.checklistbank.org/dataset/308133/taxon/64KXD

WFO does not use space:
https://www.checklistbank.org/dataset/308133/taxon/63Z6T

ITIS does use space:
https://www.checklistbank.org/dataset/308133/taxon/TB2X
https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=1004013#null

Geranium uses space:
https://www.checklistbank.org/dataset/308133/taxon/6KCD7

Fossil Gingkos don't seem to use any initials but use the full surname:
https://www.checklistbank.org/dataset/308133/names?nomCode=botanical&rank=species&sectorDatasetKey=1201&status=accepted

Radiolaria seems to use the zoological style, although the names are considered to belong to the botanical code:
https://www.checklistbank.org/dataset/308133/names?nomCode=botanical&rank=species&sectorDatasetKey=1109&status=accepted
@yroskov we might want to make these zoological records to reflect the authorship style? It is an ambiregnal group.
I have opened an issue: #977

@mdoering mdoering transferred this issue from CatalogueOfLife/checklistbank Mar 12, 2025
@mdoering
Copy link
Member

mdoering commented Mar 12, 2025

Sorry, forgot to say that GBIF did enforce a uniform authorship style. @dhobern maybe this is sth for the taxonomy group to discuss?

@gdower
Copy link
Collaborator

gdower commented Mar 12, 2025

Thanks for pointing that out. I'll fix it in the new World Ferns/World Plants pipeline that we are working on.

@dhobern
Copy link

dhobern commented Mar 13, 2025

It is a mess and not part of what (at least for ICZN) the code addresses. How easy would it be to canonicalise this in the creation of the COL product? I would expect it to get very messy indeed, particularly with names that may include multiple parts to the "family" name.

@mdoering
Copy link
Member

I think that is doable. Let me explain the exact situation a litte.
The Name data model captured rather well parsed authorships. So we split the authorship into basionym and combination authorships, nomenclatural notes (e.g. nom.illeg.), taxon concept notes (e.g. sensu XYZ) and any other unparsable "name phrase" being added. Each of the two Authorship instances in turn is a list of author strings, a list of ex author strings and a year (string). The single authors are kept as a string only, but we have routines to parse them further in most cases - but definitely not all.

In addition to this parsed model the Name class also keeps the entire authorship as a single String - which can be in contradiction to the actually parsed (or structurally provided by ColDP) information. Initially I developed the system to avoid that and if the name can be parsed trust only that parsed version and reconstruct the full authorship. This leads to changes to whitespace, punctuation & vs et and some normalisation to the notes included in the authorships. @yroskov then did not like to see any changes to the authorship string as provided by the source. So we modified CLB to strictly keep the original string in parallel with a parsed version that might slightly differ.

We could rather easily go back and reconstruct the authorship in a canonical form based on the parsed version at least when we sync the data into the project. Or even already when we import the sources into CLB. We did this already in the GBIF backbone and parsing has improved since.

I might produce a simple overview, listing all names currently in COL with a new column how the canonical authorship would look like. That should help to inform us if such a change would be for the good or bad.

@mdoering
Copy link
Member

Here is the names.csv.zip result. Not that the name formatting in some parts depends on the code. E.g. the bacterial code does not use a comma between authors and the year. So I have dumped all COL names with these columns:

  • name type
  • isParsed
  • code
  • rank
  • scientificName
  • authorship
  • rebuildAuthorship

@mdoering
Copy link
Member

mdoering commented Mar 13, 2025

whitespace seems to get fixed:

SCIENTIFIC,true,BOTANICAL,SPECIES,Orumbella macounii,(J. M. Coult. & Rose) J. M. Coult. & Rose,(J.M.Coult. & Rose) J.M.Coult. & Rose

(Pamp.) Z. M. Tan & X. Liang Zhang
vs
(Pamp.) Z.M.Tan & X.Liang Zhang

@mdoering
Copy link
Member

The only one I can see to go wrong so far is this:

SCIENTIFIC,true,BOTANICAL,SPECIES,Elphidium fijiense,"Hayward, 1997 in Hayward, Hollis & Grenfell, 1997","Hayward, 1997,1997"

@mdoering
Copy link
Member

from 5.292.279 COL names the authorship of 716.235 names would be different afterwards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants