You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This implementation of BertWordpieceTokenizer selects subtokens in the wrong order. For example, if you use the vocabulary of a bert-base-uncased and try to tokenize the word "vegan," the resulting tokens from your implementation will be v ##egan and not vega ##n.
To Reproduce
Steps to reproduce the behavior:
Load the workpiece tokenizer with the bert-base-uncased vocabulary file. I used the one from hugging face.
Tokenize 'vegan', the expected result is vega ##n, but you get v ##egan
Try other words like 'cumin' or 'ajinomoto.'
Expected behavior
The specification of wordpiece says that you have to find the longest prefixes and not the longest affixes! So your recursive implementation is doing precisely the opposite of what it should be doing, picking from back to front the longest chunks, where it should be looking from front to back.
Copuld you please provide a PR with your changes? can also be done very easiely using GitHub Web be browing to the file an klicking the pencil button and the doing the needed chnages there and then do "propose file change" and then "create Pull request"
Thanks for pointing the problem and the solution @juancavallotti . As @Apollon77 suggested, would be nice if you can add this change using a PR. If this is not possible by whatever reason, we'll apply this fix and you will be mentioned in the commit.
This implementation of BertWordpieceTokenizer selects subtokens in the wrong order. For example, if you use the vocabulary of a
bert-base-uncased
and try to tokenize the word "vegan," the resulting tokens from your implementation will be v ##egan and not vega ##n.To Reproduce
Steps to reproduce the behavior:
bert-base-uncased
vocabulary file. I used the one from hugging face.Expected behavior
The specification of wordpiece says that you have to find the longest prefixes and not the longest affixes! So your recursive implementation is doing precisely the opposite of what it should be doing, picking from back to front the longest chunks, where it should be looking from front to back.
Please see my working version here: https://gist.github.com/juancavallotti/25d619f81dabb9c0476dcec87acc3a0a
The text was updated successfully, but these errors were encountered: