-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
misc/language: make sure primary subtag has a match when guessing lang #16040
base: master
Are you sure you want to change the base?
Conversation
Download the artifacts for this pull request: |
a4191bb
to
1a55f4c
Compare
test/language.c
Outdated
@@ -99,5 +99,15 @@ int main(void) | |||
TEST_LANG_GUESS("foo(en-US)(sdh).srt", "en-US", 3, true); | |||
TEST_LANG_GUESS("foo().srt", "", -1, false); | |||
|
|||
TEST_LANG_GUESS("foo.bar.srt", "", 3, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should reset pos to -1 if there is no match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes more sense.
1a55f4c
to
f3ef425
Compare
This doesn't really fix #16038, the real problem there is that the video file name is not stripped when guessing the language tag on the external track file, it just guesses from the name suffix that matches. If the character matched at the end of the video file name happens to hit ISO 639-* tags, should it be regarded as a valid language tag and used? I think the correct fix is that when guessing language tags for external track files, you should first strip the video file name and then match and guess. In addition, there are many ISO 639-2 tags, and the new whitelist doesn't look complete, at least I didn't see |
f3ef425
to
2d442ad
Compare
The approach of stripping the file name only works for your specific example. If I attempt to use the same subtitle but with any other video with a different name, then mpv will continue to think the language is AAC-CMCTV.
Aside from maybe adding adding an option to disable guessing, I don't see why not. If a file happens to have something like en-US at the end of it, it only makes to assume it's english if you're going to guess from the filename at all.
When building the list, my query omitted macrolanguages and it should not have. That has been corrected now. Note that we first canonicalize all tags to the ISO 639-3 name and use that for matching. Chinese is actually not being canonicalized to the correct tag. It should normalize to Edit: Actually Chinese is not the only one. Looks like it uses the 639-2/B name. I'll change them all to use the T code instead. |
2d442ad
to
d764617
Compare
Let me clarify: I am not against the normalization of this pr for language tag guessing, which is good. I just don't think this completely solves the problem that is reflected in #16038.
The real problem here is that if the subtitle file has the same name as the video file(etc. test.en-US.mp4, test.en-US.srt), then the subtitle file itself has no additional identification but mpv still tries to parse the language tag, but its actual subtitle content may be in any language. I think the language tag should be parsed only if the subtitle file name does have extra content relative to the video file name(etc. test.en-US.mp4, test.en-US.zho-Hans.srt). We've stripped the public part in the display of track titles, why not do it too when guessing language tags? |
The old logic blindly accepted any string combination that had the same format as an IETF. Do at least a little bit of validation to make sure the first part matches an actual language. The downside is that this requires having a giant whitelist of language tags. But it's not like the amount of languages on planet earth is growing at a rapid pace or something. Living languages with ISO 639-2 tags are chosen as the basis of the whitelist since we accept those and having a list of everything in ISO 639-3 is just overkill. Fixes mpv-player#16038.
We need to map three letter codes to another three letter code sometimes, so put an extra space to make it more readable.
ISO 639-3 uses T codes so it makes more sense to match that.
d764617
to
aab9cd1
Compare
I'm not sure I agree with this example. I could easily see someone expecting that the combination of these two files result in |
Ah, what you said makes sense. This is an extreme example, and more likely to encounter an example similar to |
See those documents for more information on validating bcp47 https://www.rfc-editor.org/info/bcp47 |
The old logic blindly accepted any string combination that had the same format as an IETF. Do at least a little bit of validation to make sure the first part matches an actual language. The downside is that this requires having a giant whitelist of language tags. But it's not like the amount of languages on planet earth is growing at a rapid pace or something. Living languages with ISO 639-2 tags are chosen as the basis of the whitelist since we accept those and having a list of everything in ISO 639-3 is just overkill. Fixes #16038.