Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong implementation of BertWordpieceTokenizer #1192

Closed
juancavallotti opened this issue Oct 26, 2022 · 2 comments · Fixed by #1231
Closed

wrong implementation of BertWordpieceTokenizer #1192

juancavallotti opened this issue Oct 26, 2022 · 2 comments · Fixed by #1231

Comments

@juancavallotti
Copy link

juancavallotti commented Oct 26, 2022

This implementation of BertWordpieceTokenizer selects subtokens in the wrong order. For example, if you use the vocabulary of a bert-base-uncased and try to tokenize the word "vegan," the resulting tokens from your implementation will be v ##egan and not vega ##n.

To Reproduce
Steps to reproduce the behavior:

  1. Load the workpiece tokenizer with the bert-base-uncased vocabulary file. I used the one from hugging face.
  2. Tokenize 'vegan', the expected result is vega ##n, but you get v ##egan
  3. Try other words like 'cumin' or 'ajinomoto.'

Expected behavior
The specification of wordpiece says that you have to find the longest prefixes and not the longest affixes! So your recursive implementation is doing precisely the opposite of what it should be doing, picking from back to front the longest chunks, where it should be looking from front to back.

Please see my working version here: https://gist.github.com/juancavallotti/25d619f81dabb9c0476dcec87acc3a0a

@Apollon77
Copy link
Contributor

Copuld you please provide a PR with your changes? can also be done very easiely using GitHub Web be browing to the file an klicking the pencil button and the doing the needed chnages there and then do "propose file change" and then "create Pull request"

@ericzon
Copy link
Collaborator

ericzon commented Nov 28, 2022

Thanks for pointing the problem and the solution @juancavallotti . As @Apollon77 suggested, would be nice if you can add this change using a PR. If this is not possible by whatever reason, we'll apply this fix and you will be mentioned in the commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants