Skip to content

Commit aefcbac

Browse files
committed
add info about unicharambigs file v2; fixes #165
1 parent 32c1e4f commit aefcbac

File tree

1 file changed

+22
-0
lines changed

1 file changed

+22
-0
lines changed

doc/unicharambigs.5.asc

+22
Original file line numberDiff line numberDiff line change
@@ -33,17 +33,38 @@ EXAMPLE
3333
-------
3434

3535
...............................
36+
v1
3637
2 ' ' 1 " 1
3738
1 m 2 r n 0
3839
3 i i i 1 m 0
3940
...............................
4041

42+
The first line is a version identifier.
4143
In this example, all instances of the '2' character sequence '''' will
4244
*always* be replaced by the '1' character sequence '"'; a '1' character
4345
sequence 'm' *may* be replaced by the '2' character sequence 'rn', and
4446
the '3' character sequence *may* be replaced by the '1' character
4547
sequence 'm'.
4648

49+
Version 3.03 and on supports a new, simpler format for the unicharambigs
50+
file:
51+
52+
...............................
53+
v2
54+
'' " 1
55+
m rn 0
56+
iii m 0
57+
...............................
58+
59+
In this format, the "error" and "correction" are simple UTF-8 strings
60+
separated by a space, and, after another space, the same type specifier
61+
as v1 (0 for optional and 1 for mandatory substitution). Note the downside
62+
of this simpler format is that Tesseract has to encode the UTF-8 strings
63+
into the components of the unicharset. In complex scripts, this encoding
64+
may be ambiguous. In this case, the encoding is chosen such as to use the
65+
least UTF-8 characters for each component, ie the shortest unicharset
66+
components will make up the encoding.
67+
4768
HISTORY
4869
-------
4970
The unicharambigs file first appeared in Tesseract 3.00; prior to that, a
@@ -60,6 +81,7 @@ letters in the unicharset.
6081
SEE ALSO
6182
--------
6283
tesseract(1), unicharset(5)
84+
https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.03%E2%80%933.05#the-unicharambigs-file
6385

6486
AUTHOR
6587
------

0 commit comments

Comments
 (0)