Incorrect detection #17

debuggio · 2024-11-26T06:25:23Z

Hi guys!
I use py3langid==0.2.2 and I found that in some cases Chinese language has higher probability than it probably should be. For example

identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
identifier.rank("Al furjan")

outputs:
[('zh', 0.24405981600284576), ('fi', 0.16715779900550842), ('mt', 0.1392195224761963), ('et', 0.10675894469022751), ('sl', 0.07787516713142395), ('en', 0.05285739526152611)......]

I understand that the text is quite short and it may return languages other that English, but Chinese?

The text was updated successfully, but these errors were encountered:

adbar · 2024-12-02T12:58:04Z

The original model is error-prone on short texts, as you say this is clearly a bug though.

adbar added the bug Something isn't working label Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect detection #17

Incorrect detection #17

debuggio commented Nov 26, 2024 •

edited

Loading

adbar commented Dec 2, 2024

Incorrect detection #17

Incorrect detection #17

Comments

debuggio commented Nov 26, 2024 • edited Loading

adbar commented Dec 2, 2024

debuggio commented Nov 26, 2024 •

edited

Loading