Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

misc/language: make sure primary subtag has a match when guessing lang #16040

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Dudemanguy
Copy link
Member

@Dudemanguy Dudemanguy commented Mar 11, 2025

The old logic blindly accepted any string combination that had the same format as an IETF. Do at least a little bit of validation to make sure the first part matches an actual language. The downside is that this requires having a giant whitelist of language tags. But it's not like the amount of languages on planet earth is growing at a rapid pace or something. Living languages with ISO 639-2 tags are chosen as the basis of the whitelist since we accept those and having a list of everything in ISO 639-3 is just overkill. Fixes #16038.

Copy link

github-actions bot commented Mar 11, 2025

Download the artifacts for this pull request:

Windows
macOS

@Dudemanguy Dudemanguy force-pushed the ext-language branch 2 times, most recently from a4191bb to 1a55f4c Compare March 11, 2025 17:05
test/language.c Outdated
@@ -99,5 +99,15 @@ int main(void)
TEST_LANG_GUESS("foo(en-US)(sdh).srt", "en-US", 3, true);
TEST_LANG_GUESS("foo().srt", "", -1, false);

TEST_LANG_GUESS("foo.bar.srt", "", 3, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should reset pos to -1 if there is no match.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes more sense.

@kasper93 kasper93 added this to the Release v0.41.0 milestone Mar 11, 2025
@dyphire
Copy link
Contributor

dyphire commented Mar 12, 2025

This doesn't really fix #16038, the real problem there is that the video file name is not stripped when guessing the language tag on the external track file, it just guesses from the name suffix that matches. If the character matched at the end of the video file name happens to hit ISO 639-* tags, should it be regarded as a valid language tag and used?

I think the correct fix is ​​that when guessing language tags for external track files, you should first strip the video file name and then match and guess.

In addition, there are many ISO 639-2 tags, and the new whitelist doesn't look complete, at least I didn't see chi and zho representing Chinese. And this match is not correct, it can also be ISO 639-1 tags. See-also #13916 (comment)

@Dudemanguy
Copy link
Member Author

Dudemanguy commented Mar 12, 2025

The approach of stripping the file name only works for your specific example. If I attempt to use the same subtitle but with any other video with a different name, then mpv will continue to think the language is AAC-CMCTV.

If the character matched at the end of the video file name happens to hit ISO 639-* tags, should it be regarded as a valid language tag and used?

Aside from maybe adding adding an option to disable guessing, I don't see why not. If a file happens to have something like en-US at the end of it, it only makes to assume it's english if you're going to guess from the filename at all.

In addition, there are many ISO 639-2 tags, and the new whitelist doesn't look complete, at least I didn't see chi and zho representing Chinese.

When building the list, my query omitted macrolanguages and it should not have. That has been corrected now. Note that we first canonicalize all tags to the ISO 639-3 name and use that for matching. Chinese is actually not being canonicalized to the correct tag. It should normalize to zho not chi like what currently is happening in master. Anyways, that has been corrected as well.

Edit: Actually Chinese is not the only one. Looks like it uses the 639-2/B name. I'll change them all to use the T code instead.

@dyphire
Copy link
Contributor

dyphire commented Mar 12, 2025

The approach of stripping the file name only works for your specific example. If I attempt to use the same subtitle but with any other video with a different name, then mpv will continue to think the language is AAC-CMCTV.

Let me clarify: I am not against the normalization of this pr for language tag guessing, which is good. I just don't think this completely solves the problem that is reflected in #16038.

If the character matched at the end of the video file name happens to hit ISO 639-* tags, should it be regarded as a valid language tag and used?

Aside from maybe adding adding an option to disable guessing, I don't see why not. If a file happens to have something like en-US at the end of it, it only makes to assume it's english if you're going to guess from the filename at all.

The real problem here is that if the subtitle file has the same name as the video file(etc. test.en-US.mp4, test.en-US.srt), then the subtitle file itself has no additional identification but mpv still tries to parse the language tag, but its actual subtitle content may be in any language. I think the language tag should be parsed only if the subtitle file name does have extra content relative to the video file name(etc. test.en-US.mp4, test.en-US.zho-Hans.srt). We've stripped the public part in the display of track titles, why not do it too when guessing language tags?

The old logic blindly accepted any string combination that had the same
format as an IETF. Do at least a little bit of validation to make sure
the first part matches an actual language. The downside is that this
requires having a giant whitelist of language tags. But it's not like
the amount of languages on planet earth is growing at a rapid pace or
something. Living languages with ISO 639-2 tags are chosen as the basis
of the whitelist since we accept those and having a list of everything
in ISO 639-3 is just overkill.

Fixes mpv-player#16038.
We need to map three letter codes to another three letter code
sometimes, so put an extra space to make it more readable.
ISO 639-3 uses T codes so it makes more sense to match that.
@Dudemanguy
Copy link
Member Author

Dudemanguy commented Mar 12, 2025

The real problem here is that if the subtitle file has the same name as the video file(etc. test.en-US.mp4, test.en-US.srt)

I'm not sure I agree with this example. I could easily see someone expecting that the combination of these two files result in test.en-US.srt being read as english subtitles regardless of the fact the names happen to match. This is what currently happens so it would be a behavior change if we decided to strip the names.

@dyphire
Copy link
Contributor

dyphire commented Mar 12, 2025

The real problem here is that if the subtitle file has the same name as the video file(etc. test.en-US.mp4, test.en-US.srt)

I'm not sure I agree with this example. I could easily see someone expecting that the combination of these two files result in test.en-US.srt being read as english subtitles regardless of the fact the names happen to match. This is what currently happens so it would be a behavior change if we decided to strip the names.

Ah, what you said makes sense. This is an extreme example, and more likely to encounter an example similar to test.sc.mp4 and test.sc.srt.
But given that I have given an example that works in extreme cases, I will not stick to this anymore.

@kasper93
Copy link
Contributor

See those documents for more information on validating bcp47 https://www.rfc-editor.org/info/bcp47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mpv attempts to parse language tags from video filename for the external track file with the same name
3 participants