misc/language: make sure primary subtag has a match when guessing lang #16040

Dudemanguy · 2025-03-11T16:29:17Z

The old logic blindly accepted any string combination that had the same format as an IETF. Do at least a little bit of validation to make sure the first part matches an actual language. The downside is that this requires having a giant whitelist of language tags. But it's not like the amount of languages on planet earth is growing at a rapid pace or something. Living languages with ISO 639-2 tags are chosen as the basis of the whitelist since we accept those and having a list of everything in ISO 639-3 is just overkill. Fixes #16038.

github-actions · 2025-03-11T17:01:57Z

Download the artifacts for this pull request:

Windows

macOS

kasper93 · 2025-03-11T17:07:20Z

test/language.c

@@ -99,5 +99,15 @@ int main(void)
    TEST_LANG_GUESS("foo(en-US)(sdh).srt", "en-US", 3, true);
    TEST_LANG_GUESS("foo().srt", "", -1, false);

+    TEST_LANG_GUESS("foo.bar.srt", "", 3, false);


I think we should reset pos to -1 if there is no match.

Yeah that makes more sense.

dyphire · 2025-03-12T16:18:07Z

This doesn't really fix #16038, the real problem there is that the video file name is not stripped when guessing the language tag on the external track file, it just guesses from the name suffix that matches. If the character matched at the end of the video file name happens to hit ISO 639-* tags, should it be regarded as a valid language tag and used?

I think the correct fix is that when guessing language tags for external track files, you should first strip the video file name and then match and guess.

In addition, there are many ISO 639-2 tags, and the new whitelist doesn't look complete, at least I didn't see chi and zho representing Chinese. And this match is not correct, it can also be ISO 639-1 tags. See-also #13916 (comment)

Dudemanguy · 2025-03-12T17:09:23Z

The approach of stripping the file name only works for your specific example. If I attempt to use the same subtitle but with any other video with a different name, then mpv will continue to think the language is AAC-CMCTV.

If the character matched at the end of the video file name happens to hit ISO 639-* tags, should it be regarded as a valid language tag and used?

Aside from maybe adding adding an option to disable guessing, I don't see why not. If a file happens to have something like en-US at the end of it, it only makes to assume it's english if you're going to guess from the filename at all.

In addition, there are many ISO 639-2 tags, and the new whitelist doesn't look complete, at least I didn't see chi and zho representing Chinese.

When building the list, my query omitted macrolanguages and it should not have. That has been corrected now. Note that we first canonicalize all tags to the ISO 639-3 name and use that for matching. Chinese is actually not being canonicalized to the correct tag. It should normalize to zho not chi like what currently is happening in master. Anyways, that has been corrected as well.

Edit: Actually Chinese is not the only one. Looks like it uses the 639-2/B name. I'll change them all to use the T code instead.

dyphire · 2025-03-12T17:29:39Z

The approach of stripping the file name only works for your specific example. If I attempt to use the same subtitle but with any other video with a different name, then mpv will continue to think the language is AAC-CMCTV.

Let me clarify: I am not against the normalization of this pr for language tag guessing, which is good. I just don't think this completely solves the problem that is reflected in #16038.

If the character matched at the end of the video file name happens to hit ISO 639-* tags, should it be regarded as a valid language tag and used?

Aside from maybe adding adding an option to disable guessing, I don't see why not. If a file happens to have something like en-US at the end of it, it only makes to assume it's english if you're going to guess from the filename at all.

The real problem here is that if the subtitle file has the same name as the video file(etc. test.en-US.mp4, test.en-US.srt), then the subtitle file itself has no additional identification but mpv still tries to parse the language tag, but its actual subtitle content may be in any language. I think the language tag should be parsed only if the subtitle file name does have extra content relative to the video file name(etc. test.en-US.mp4, test.en-US.zho-Hans.srt). We've stripped the public part in the display of track titles, why not do it too when guessing language tags?

The old logic blindly accepted any string combination that had the same format as an IETF. Do at least a little bit of validation to make sure the first part matches an actual language. The downside is that this requires having a giant whitelist of language tags. But it's not like the amount of languages on planet earth is growing at a rapid pace or something. Living languages with ISO 639-2 tags are chosen as the basis of the whitelist since we accept those and having a list of everything in ISO 639-3 is just overkill. Fixes mpv-player#16038.

We need to map three letter codes to another three letter code sometimes, so put an extra space to make it more readable.

ISO 639-3 uses T codes so it makes more sense to match that.

Dudemanguy · 2025-03-12T18:08:47Z

The real problem here is that if the subtitle file has the same name as the video file(etc. test.en-US.mp4, test.en-US.srt)

I'm not sure I agree with this example. I could easily see someone expecting that the combination of these two files result in test.en-US.srt being read as english subtitles regardless of the fact the names happen to match. This is what currently happens so it would be a behavior change if we decided to strip the names.

dyphire · 2025-03-12T18:23:09Z

The real problem here is that if the subtitle file has the same name as the video file(etc. test.en-US.mp4, test.en-US.srt)

I'm not sure I agree with this example. I could easily see someone expecting that the combination of these two files result in test.en-US.srt being read as english subtitles regardless of the fact the names happen to match. This is what currently happens so it would be a behavior change if we decided to strip the names.

Ah, what you said makes sense. This is an extreme example, and more likely to encounter an example similar to test.sc.mp4 and test.sc.srt.
But given that I have given an example that works in extreme cases, I will not stick to this anymore.

kasper93 · 2025-03-13T02:05:44Z

See those documents for more information on validating bcp47 https://www.rfc-editor.org/info/bcp47

Dudemanguy force-pushed the ext-language branch 2 times, most recently from a4191bb to 1a55f4c Compare March 11, 2025 17:05

kasper93 reviewed Mar 11, 2025

View reviewed changes

Dudemanguy force-pushed the ext-language branch from 1a55f4c to f3ef425 Compare March 11, 2025 17:15

kasper93 added this to the Release v0.41.0 milestone Mar 11, 2025

Dudemanguy force-pushed the ext-language branch from f3ef425 to 2d442ad Compare March 12, 2025 17:03

Dudemanguy force-pushed the ext-language branch from 2d442ad to d764617 Compare March 12, 2025 17:13

Dudemanguy added 3 commits March 12, 2025 12:47

misc/language: format langmap

a14a4a8

We need to map three letter codes to another three letter code sometimes, so put an extra space to make it more readable.

misc/language: canonicalize to T code

aab9cd1

ISO 639-3 uses T codes so it makes more sense to match that.

Dudemanguy force-pushed the ext-language branch from d764617 to aab9cd1 Compare March 12, 2025 17:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

misc/language: make sure primary subtag has a match when guessing lang #16040

misc/language: make sure primary subtag has a match when guessing lang #16040

Dudemanguy commented Mar 11, 2025 •

edited

Loading

github-actions bot commented Mar 11, 2025 •

edited

Loading

kasper93 Mar 11, 2025

Dudemanguy Mar 11, 2025

dyphire commented Mar 12, 2025 •

edited

Loading

Dudemanguy commented Mar 12, 2025 •

edited

Loading

dyphire commented Mar 12, 2025 •

edited

Loading

Dudemanguy commented Mar 12, 2025 •

edited

Loading

dyphire commented Mar 12, 2025

kasper93 commented Mar 13, 2025

misc/language: make sure primary subtag has a match when guessing lang #16040

Are you sure you want to change the base?

misc/language: make sure primary subtag has a match when guessing lang #16040

Conversation

Dudemanguy commented Mar 11, 2025 • edited Loading

github-actions bot commented Mar 11, 2025 • edited Loading

kasper93 Mar 11, 2025

Choose a reason for hiding this comment

Dudemanguy Mar 11, 2025

Choose a reason for hiding this comment

dyphire commented Mar 12, 2025 • edited Loading

Dudemanguy commented Mar 12, 2025 • edited Loading

dyphire commented Mar 12, 2025 • edited Loading

Dudemanguy commented Mar 12, 2025 • edited Loading

dyphire commented Mar 12, 2025

kasper93 commented Mar 13, 2025

Dudemanguy commented Mar 11, 2025 •

edited

Loading

github-actions bot commented Mar 11, 2025 •

edited

Loading

dyphire commented Mar 12, 2025 •

edited

Loading

Dudemanguy commented Mar 12, 2025 •

edited

Loading

dyphire commented Mar 12, 2025 •

edited

Loading

Dudemanguy commented Mar 12, 2025 •

edited

Loading