-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer is very large #22
Comments
The large size of the compiled tokeniser is a known issue. There are a few things you can do to reduce the size a bit, @flammie is the best to answer this, but major reductions in size is not possible without major work on the compiler itself, which is non-trivial. An example: the Optimising the |
Makes sense. I've done similar kinds of transformations for learner errors and the combinatorics explode pretty quickly. If @flammie doesn't know of any low-hanging fruit that could make a significant dent, then feel free to close this issue. |
I think there is an old github issue somewhere where I have tested a lot of things, but I'm on a limited travel conference laptop right now and github search doesn't seem to find anything of course. Titlecasing is legit one example that always has chance to n-duple automata sizes. And touching pmatch2fst and tokenise code is alwaysa big effort :-D |
I am gearing up to use lang-rus in a wasm environment, and I was surprised to discover that the compiled tokenizer
tokeniser-disamb-gt-desc.pmhfst
is 394MB, almost 50 times bigger thananalyser-gt-desc.hfstol
. I haven't worked with tokenizers much. Any pointers for how I can get that size down?The text was updated successfully, but these errors were encountered: