Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer is very large #22

Open
reynoldsnlp opened this issue Mar 4, 2025 · 3 comments
Open

Tokenizer is very large #22

reynoldsnlp opened this issue Mar 4, 2025 · 3 comments

Comments

@reynoldsnlp
Copy link
Collaborator

I am gearing up to use lang-rus in a wasm environment, and I was surprised to discover that the compiled tokenizer tokeniser-disamb-gt-desc.pmhfst is 394MB, almost 50 times bigger than analyser-gt-desc.hfstol. I haven't worked with tokenizers much. Any pointers for how I can get that size down?

@snomos
Copy link
Member

snomos commented Mar 4, 2025

The large size of the compiled tokeniser is a known issue. There are a few things you can do to reduce the size a bit, @flammie is the best to answer this, but major reductions in size is not possible without major work on the compiler itself, which is non-trivial. An example: the pmatch language contains several functions for text transformations, like Upper, Lower, etc. These are presently implemented as operations on the full net, thus increasing the size of the the resulting FST for each operation. The same goes for other pmatch functions, even though many of them could be implemented in a more optimal way.

Optimising the hfst-pmatch compiler is a major undertaking though, and won't happen in the near future. It is nevertheless high on my wish list 🙂

@reynoldsnlp
Copy link
Collaborator Author

Makes sense. I've done similar kinds of transformations for learner errors and the combinatorics explode pretty quickly. If @flammie doesn't know of any low-hanging fruit that could make a significant dent, then feel free to close this issue.

@flammie
Copy link
Contributor

flammie commented Mar 4, 2025

I think there is an old github issue somewhere where I have tested a lot of things, but I'm on a limited travel conference laptop right now and github search doesn't seem to find anything of course. Titlecasing is legit one example that always has chance to n-duple automata sizes. And touching pmatch2fst and tokenise code is alwaysa big effort :-D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants