@@ -33,17 +33,38 @@ EXAMPLE
33
33
-------
34
34
35
35
...............................
36
+ v1
36
37
2 ' ' 1 " 1
37
38
1 m 2 r n 0
38
39
3 i i i 1 m 0
39
40
...............................
40
41
42
+ The first line is a version identifier.
41
43
In this example, all instances of the '2' character sequence '''' will
42
44
*always* be replaced by the '1' character sequence '"' ; a '1' character
43
45
sequence 'm' *may* be replaced by the '2' character sequence 'rn' , and
44
46
the '3' character sequence *may* be replaced by the '1' character
45
47
sequence 'm' .
46
48
49
+ Version 3.03 and on supports a new, simpler format for the unicharambigs
50
+ file:
51
+
52
+ ...............................
53
+ v2
54
+ '' " 1
55
+ m rn 0
56
+ iii m 0
57
+ ...............................
58
+
59
+ In this format, the "error" and "correction" are simple UTF-8 strings
60
+ separated by a space, and, after another space, the same type specifier
61
+ as v1 (0 for optional and 1 for mandatory substitution). Note the downside
62
+ of this simpler format is that Tesseract has to encode the UTF-8 strings
63
+ into the components of the unicharset. In complex scripts, this encoding
64
+ may be ambiguous. In this case, the encoding is chosen such as to use the
65
+ least UTF-8 characters for each component, ie the shortest unicharset
66
+ components will make up the encoding.
67
+
47
68
HISTORY
48
69
-------
49
70
The unicharambigs file first appeared in Tesseract 3.00; prior to that, a
@@ -60,6 +81,7 @@ letters in the unicharset.
60
81
SEE ALSO
61
82
--------
62
83
tesseract(1), unicharset(5)
84
+ https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.03%E2%80%933.05#the-unicharambigs-file
63
85
64
86
AUTHOR
65
87
------
0 commit comments