wrong implementation of BertWordpieceTokenizer #1192

juancavallotti · 2022-10-26T15:31:56Z

This implementation of BertWordpieceTokenizer selects subtokens in the wrong order. For example, if you use the vocabulary of a bert-base-uncased and try to tokenize the word "vegan," the resulting tokens from your implementation will be v ##egan and not vega ##n.

To Reproduce
Steps to reproduce the behavior:

Load the workpiece tokenizer with the bert-base-uncased vocabulary file. I used the one from hugging face.
Tokenize 'vegan', the expected result is vega ##n, but you get v ##egan
Try other words like 'cumin' or 'ajinomoto.'

Expected behavior
The specification of wordpiece says that you have to find the longest prefixes and not the longest affixes! So your recursive implementation is doing precisely the opposite of what it should be doing, picking from back to front the longest chunks, where it should be looking from front to back.

Please see my working version here: https://gist.github.com/juancavallotti/25d619f81dabb9c0476dcec87acc3a0a

The text was updated successfully, but these errors were encountered:

Apollon77 · 2022-10-26T16:18:42Z

Copuld you please provide a PR with your changes? can also be done very easiely using GitHub Web be browing to the file an klicking the pencil button and the doing the needed chnages there and then do "propose file change" and then "create Pull request"

ericzon · 2022-11-28T06:53:31Z

Thanks for pointing the problem and the solution @juancavallotti . As @Apollon77 suggested, would be nice if you can add this change using a PR. If this is not possible by whatever reason, we'll apply this fix and you will be mentioned in the commit.

Apollon77 mentioned this issue Dec 8, 2022

fix: correct the search logic for BertWordpieceTokenizer #1231

Merged

3 tasks

ericzon closed this as completed in #1231 May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wrong implementation of BertWordpieceTokenizer #1192

wrong implementation of BertWordpieceTokenizer #1192

juancavallotti commented Oct 26, 2022 •

edited

Loading

Apollon77 commented Oct 26, 2022

ericzon commented Nov 28, 2022

wrong implementation of BertWordpieceTokenizer #1192

wrong implementation of BertWordpieceTokenizer #1192

Comments

juancavallotti commented Oct 26, 2022 • edited Loading

Apollon77 commented Oct 26, 2022

ericzon commented Nov 28, 2022

juancavallotti commented Oct 26, 2022 •

edited

Loading