Skip to content

Add tsv-duplicates.py script #33

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Add tsv-duplicates.py script #33

wants to merge 3 commits into from

Conversation

goodmami
Copy link
Collaborator

@goodmami goodmami commented Apr 6, 2023

This pull request adds a script for detecting potential duplicate lemmas in OMW .tab files. We can also use this PR to fix the duplicate issues.

There are 3 kinds of duplicates detected:

  • underscores, e.g., canard colvert and canard_colvert
  • case differences, e.g., Renaissance and renaissance
  • diacritics, e.g., nanoséconde and nanoseconde

For all of the above, duplicates are only detected by normalizing the lemma forms within a single synset. There may be duplicate synsets, but the script does not test for these. But here's an example of two synsets that may be duplicates have redundant lemmas:

WARNING:tsv-duplicates:duplicate of 11640645-n: 'sequoia', 'séquoia'
WARNING:tsv-duplicates:duplicate of 11640898-n: 'sequoia', 'séquoia'

You can run the script as follows:

$ python scripts/tsv-duplicates.py --ignore-case --underscore --diacritics wns/{arb,fra,msa}/*.tab
wn-data-arb.tab duplicates	1502 synsets	3013 lemmas
wn-nodia-arb.tab duplicates	176 synsets	355 lemmas
wn-data-fra.tab duplicates	3821 synsets	7702 lemmas
wn-data-ind.tab duplicates	466 synsets	934 lemmas
wn-data-zsm.tab duplicates	366 synsets	732 lemmas
total duplicates	6331 synsets	12736 lemmas

It takes a variable number of paths, so you can check one at a time or many at once. The --verbose option will print a warning for every duplicate it finds (best when only checking a single .tab file).

@ekaf
Copy link
Contributor

ekaf commented Oct 31, 2024

The two sequoia synsets are not duplicates: one is for the "tree" and the other denotes the "wood" (i.e. material to make furniture). The issue was discussed in OEWN some years ago.

@fcbond
Copy link
Contributor

fcbond commented Oct 31, 2024 via email

@ekaf
Copy link
Contributor

ekaf commented Oct 31, 2024

Thanks @fcbond,I should have made it clearer that I was responding to this opening comment above, where @goodmami expresses his sentiment that these two synsets might be duplicates, which they aren't:

There may be duplicate synsets, but the script does not test for these. But here's an example of two synsets that may be duplicates:

WARNING:tsv-duplicates:duplicate of 11640645-n: 'sequoia', 'séquoia'
WARNING:tsv-duplicates:duplicate of 11640898-n: 'sequoia',

@goodmami
Copy link
Collaborator Author

Apologies, @ekaf, my wording was imprecise and my example ill-chosen. @fcbond is correct that the script looks for near-duplicate lemmas within a synset and not for multiple synsets that are duplicates of each other (I've updated the issue text above to hopefully make this more clear). The idea is the TSV files have some lemmas that are only trivially different within the same synset and that we'd rather not keep all of them. This example might be more illustrative:

WARNING:tsv-duplicates:duplicate of 15277118-n: 'mortalite', 'mortalité'

If we look at all the lemmas for that synset, we see two others that are more interestingly different:

$ grep 15277118-n wns/fra/wn-data-fra.tab
15277118-n	fra:lemma	taux de mortalité
15277118-n	fra:lemma	mortalite
15277118-n	fra:lemma	morbidité
15277118-n	fra:lemma	mortalité

My guess is that the mortalite without diacritics is redundant and can be removed from the TSV file.

@ekaf
Copy link
Contributor

ekaf commented Jan 22, 2025

@goodmami, rather than just redundant, "the mortalite without diacritics" is incorrect. But, as you wrote earlier, only native lexicographers can make such corrections:

I don't think we can get around having human annotators to fix the upper/lower case, diacritics, and plurals.

Maybe a spell checker could detect some incorrect forms, but any orthographic editing would need to be approached with great caution.

@goodmami goodmami mentioned this pull request Jan 30, 2025
@goodmami
Copy link
Collaborator Author

I created #48 only for the scripts so we can move forward with that. The modifications to the data should happen in other PRs.

I think this PR should be closed without merging since the commits adding the scripts would cause conflicts. The manual modifications to the Icelandic wordnet could be cherry-picked for a new PR. Alternatively, we could repurpose this PR with a force-push that rebases without those commits. Let me know if you want help with either of those options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants