Tokenizer is very large #22

reynoldsnlp · 2025-03-04T07:34:05Z

I am gearing up to use lang-rus in a wasm environment, and I was surprised to discover that the compiled tokenizer tokeniser-disamb-gt-desc.pmhfst is 394MB, almost 50 times bigger than analyser-gt-desc.hfstol. I haven't worked with tokenizers much. Any pointers for how I can get that size down?

The text was updated successfully, but these errors were encountered:

snomos · 2025-03-04T08:11:15Z

The large size of the compiled tokeniser is a known issue. There are a few things you can do to reduce the size a bit, @flammie is the best to answer this, but major reductions in size is not possible without major work on the compiler itself, which is non-trivial. An example: the pmatch language contains several functions for text transformations, like Upper, Lower, etc. These are presently implemented as operations on the full net, thus increasing the size of the the resulting FST for each operation. The same goes for other pmatch functions, even though many of them could be implemented in a more optimal way.

Optimising the hfst-pmatch compiler is a major undertaking though, and won't happen in the near future. It is nevertheless high on my wish list 🙂

reynoldsnlp · 2025-03-04T08:14:48Z

Makes sense. I've done similar kinds of transformations for learner errors and the combinatorics explode pretty quickly. If @flammie doesn't know of any low-hanging fruit that could make a significant dent, then feel free to close this issue.

flammie · 2025-03-04T19:16:04Z

I think there is an old github issue somewhere where I have tested a lot of things, but I'm on a limited travel conference laptop right now and github search doesn't seem to find anything of course. Titlecasing is legit one example that always has chance to n-duple automata sizes. And touching pmatch2fst and tokenise code is alwaysa big effort :-D

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer is very large #22

Tokenizer is very large #22

reynoldsnlp commented Mar 4, 2025

snomos commented Mar 4, 2025

reynoldsnlp commented Mar 4, 2025

flammie commented Mar 4, 2025

Tokenizer is very large #22

Tokenizer is very large #22

Comments

reynoldsnlp commented Mar 4, 2025

snomos commented Mar 4, 2025

reynoldsnlp commented Mar 4, 2025

flammie commented Mar 4, 2025