Add a review command to evaluate the quality of translations #31

jonkan · 2024-05-17T21:22:31Z

(This is a continuation of #30)

I have a lot of failed (and poor) translations in my string catalogs that I want to correct. Basically #14. So I went ahead to try and add this.

Add a mark-needs-review command that marks all translated strings as NEEDS REVIEW. To allow for existing translations to be reviewed.
Add a review command to evaluate the quality of translations marked as NEEDS REVIEW. Good translations are marked as translated.

Example output:

Evaluating key `Restore Purchases` [Comment: n/a]
ro     ✅ Good:    Restabilire achiziții
nl     ✅ Good:    Aankopen herstellen
el     ✅ Good:    Επαναφορά Αγορών
fi     ✅ Good:    Palauta ostot
it     ✅ Good:    Ripristina Acquisti
es     ✅ Good:    Restaurar Compras
pl     ✅ Good:    Przywróć zakupy
hu     ✅ Good:    Vásárlások visszaállítása
de     ✅ Good:    Käufe wiederherstellen
nb     ✅ Good:    Gjenopprett kjøp
tr     ✅ Good:    Satın Alımları Geri Yükle
fr     ✅ Good:    Restaurer les achats
hr     ⚠️ Poor:    Vrati kupnje
-      🤖 Reason:  The translation correctly reflects the correct language (Croatian) and the general meaning; however, the specific term 'Restore Purchases' often refers to a functionality in apps or software. A more appropriate translation in this context could be 'Obnovi kupnje'. Therefore, the quality is marked as 'poor' rather than 'good'.
sv     ✅ Good:    Återställ köp
sk     ✅ Good:    Obnoviť nákupy
da     ⚠️ Poor:    Gendan Køb
-      🤖 Reason:  The translation 'Gendan Køb' is almost correct but not entirely accurate. A better translation would be 'Gendan køb' using lowercase 'k' for 'køb'.
cs     ✅ Good:    Obnovit nákupy
pt-PT  ✅ Good:    Restaurar Compras

Still work in progress and a couple of todo's etc.:

Still some false positives that should be able to be fixed by adjusting the prompt.
Fix/decide how the --lang option should work. I think "all" should mean "all in string catalog" and not "some common ones".
Also, I've refactored a couple of things to try and get this to fit better into the code. There's probably room for improvement.
It's rather (very!) slow. Maybe it would be possible to parallelize...
Some breaking changes that might be avoidable (e.g. moving the previous root command into a translate subcommand).
...

Let me know if you would be interested in merging this. If so, I'll happily take feedback and discuss improvements.

Cheers,
Jonas

The idea with the mark-needs-review command is that evaluate/reivew command should be updated to only reivew strings marked as "needs review" and change them to translated if the evaluation result is good

…eview translations marked as NEEDS REVIEW

…ong running evaluations

andrewtheis · 2024-05-22T18:36:39Z

Going to take some time to review this. Regarding the language input -- I believe the way I built this originally was if you omit that flag it will look at what is in the String Catalog. The issue is, unless there are existing translations, the target languages are known only at the project file, and not the catalog files

andrewtheis · 2024-06-02T23:44:01Z

Seems like there's a lot of overlap between these PRs and there's a lot going on here. I'd like these broken down into more digestible PRs.

The sourceKey property change should be reverted...

sourceKey in the context of localization is a reference to the identifier (key) of the text to translate. This could either be the literal display text in the source language such as Cannot save file, no disk space remaining or something like ALERT_OUT_OF_DISK_SPACE, which then would have an English value in the String Catalog for the text to display.

It's often recommended to use the latter, especially when it comes to multi-line text. This allows you to just use the key in your Text() or String(localized:) definitions, reducing clutter in your source code.

jonkan · 2024-06-03T21:16:02Z

Seems like there's a lot of overlap between these PRs and there's a lot going on here.

Yes, I know. I've added a lot of features I needed/wanted for my use-case. It's all in one straight line of commits. I tried to "package" it the best I could into review:able chunks. Unfortunately, when creating PR:s to a fork it's not possible to create PR dependencies and have each chunk/PR only show the diff to the previous one.

I'd like these broken down into more digestible PRs.

I totally understand. I'll can try and do that. What would you like me to start with? I've basically added these sub-commands:

review Review the quality of translations marked as NEEDS REVIEW. Marking good translations as translated/green.
mark-needs-review Mark translated strings as NEEDS REVIEW. E.g. if you want to review your translations using the review command.
lint Lint translations in String Catalog(s), optionally marking failed as NEEDS REVIEW.
remove Remove languages from String Catalog(s).

Then I've also added:

Support for fastlane metadata (I refactored the spec format slightly from what I first mentioned in Add support for fastlane metadata or other text files #12, see commit message in 74fc715).
Parallelization support to the translate and review commands, to speed them up. Very basic though, this should probably be improved, but it works, which was good enough for me at this point.

The sourceKey property change should be reverted...

sourceKey in the context of localization is a reference to the identifier (key) of the text to translate.
This could either be the literal display text in the source language such as Cannot save file, no disk space remaining or something like ALERT_OUT_OF_DISK_SPACE, which then would have an English value in the String Catalog for the text to display.

It's often recommended to use the latter, especially when it comes to multi-line text. This allows you to just use the key in your Text() or String(localized:) definitions, reducing clutter in your source code.

Well, I see what you're saying. But if the key is "ALERT_OUT_OF_DISK_SPACE" and the source language translation/value is "Cannot save file, no disk space remaining" I very much expect a variable named sourceKey to return "ALERT_OUT_OF_DISK_SPACE" and not "Cannot save file, no disk space remaining" like it currently does.

andrewtheis · 2024-06-04T14:31:32Z

How break these down is difficult. There's a lot of architectural changes--I had a mental roadmap of how the existing architecture could work with other features planned. So, some of these changes conflict with that. That's on me for not documenting this in the tickets... this unfortunately has taken a backseat to many other projects I'm involved in.

Before moving on the rest of this, tests are probably the priority. Mainly for StringCatalog, as it's the most finicky and really the core of how this all works. The actual translation code is really basic in comparison. I'll try and find time to document in tickets how I planned to implement the other features. There's a lot of usable stuff in your work, so I appreciate that. I just need to figure out how all the pieces fit together.

Some more notes so far:

Regarding sourceKey: after checking again this does appear to be an oversight after I did some refactor work to correctly support variations. It was previously the key, but that has moved to the key of the var localizableStringGroups Dictionary in StringCatalog. So I'm ok with this change, can be done alongside unit tests.
Setting Needs Review: Could be an issue, as this is something that Xcode does when it's not sure how to reconcile the strings found during compilation with what's already in the catalog. You can set it manually, but I don't know how that might change what Xcode operates. If you know or can point to any docs that would be useful.
Evaluating translations: The way I planned this to work wasn't to ask ChatGPT about the quality (as it could just hallucinate the quality of it's own work), but instead to have it translate the value back into the source language and diff the strings to see how well it matches. Or, even better, to use a secondary service like Google Translate to translate back to the source language. This is how I was manually verifying ChatGPT's work in our products. I have a branch open for Google Translate support using their basic v2 API. Their v3 API which is LLM based is more complicated, support can be added at a later time.
Prompts/etc, I still think using Functions is unnecessary now, as the system prompt seems to work 100% of the time to get ChatGPT to only output translations in it's response. Check develop and run against your own catalogs. Or use the ExampleCatalogs/LargeTestBench.xcstrings I added, which is pulled from one our products. It includes every possible scenario of strings/variances/etc in it.
Config: Using YML is def how this feature should be implemented. There are Swift libraries for parsing it. Just want to make a plan on the structure and how it all fits together.

All that being said, I have a hunch that Apple might announce something similar to this library directly integrated into Xcode next week. So may be good to pause on any work until after WWDC keynotes Monday.

jonkan added 9 commits May 15, 2024 21:08

Add support for gpt4 models via the newer chat api (WIP)

7e90a50

Update OpenAI

3cebe39

Support passing model as an argument

ace9c02

Refactor the chat completion to use function calling and JSON mode

a6852ec

Add support for evaluating the quality of string catalog translations

67b4109

Add a mark-needs-review command by introducing a concept of subcommands

98d7970

The idea with the mark-needs-review command is that evaluate/reivew command should be updated to only reivew strings marked as "needs review" and change them to translated if the evaluation result is good

Refactor the --evaluate flag into its own command (review) and only r…

04c8278

…eview translations marked as NEEDS REVIEW

Add saving the string catalog every 5 strings to avoid dataloss for l…

b114047

…ong running evaluations

Add logging the reason when a translation is valuated poor or bad

3cd1f82

This was referenced May 22, 2024

Add GPT-4 support #30

Closed

Add translation tests and improvements #32

Closed

jonkan mentioned this pull request Jun 3, 2024

Add lint and remove commands #33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a review command to evaluate the quality of translations #31

Add a review command to evaluate the quality of translations #31

jonkan commented May 17, 2024 •

edited

Loading

andrewtheis commented May 22, 2024 •

edited

Loading

andrewtheis commented Jun 2, 2024 •

edited

Loading

jonkan commented Jun 3, 2024

andrewtheis commented Jun 4, 2024

Add a review command to evaluate the quality of translations #31

Are you sure you want to change the base?

Add a review command to evaluate the quality of translations #31

Conversation

jonkan commented May 17, 2024 • edited Loading

andrewtheis commented May 22, 2024 • edited Loading

andrewtheis commented Jun 2, 2024 • edited Loading

jonkan commented Jun 3, 2024

andrewtheis commented Jun 4, 2024

jonkan commented May 17, 2024 •

edited

Loading

andrewtheis commented May 22, 2024 •

edited

Loading

andrewtheis commented Jun 2, 2024 •

edited

Loading