Skip to content

Commit 00abf57

Browse files
committed
Update documentation for unicharset_extractor
1 parent 7d3e132 commit 00abf57

File tree

1 file changed

+10
-20
lines changed

1 file changed

+10
-20
lines changed

doc/unicharset_extractor.1.asc

+10-20
Original file line numberDiff line numberDiff line change
@@ -3,38 +3,28 @@ UNICHARSET_EXTRACTOR(1)
33

44
NAME
55
----
6-
unicharset_extractor - extract unicharset from Tesseract boxfiles
6+
unicharset_extractor - Reads box or plain text files to extract the unicharset.
77

88
SYNOPSIS
99
--------
10-
*unicharset_extractor* '[-D dir]' 'FILE'...
10+
*unicharset_extractor* [--output_unicharset filename] [--norm_mode mode] box_or_text_file [...]
11+
12+
Where mode means:
13+
1=combine graphemes (use for Latin and other simple scripts)
14+
2=split graphemes (use for Indic/Khmer/Myanmar)
15+
3=pure unicode (use for Arabic/Hebrew/Thai/Tibetan)
1116

1217
DESCRIPTION
1318
-----------
1419
Tesseract needs to know the set of possible characters it can output.
1520
To generate the unicharset data file, use the unicharset_extractor
16-
program on the same training pages bounding box files as used for
17-
clustering:
21+
program on training pages bounding box files or a plain text file:
1822

1923
unicharset_extractor fontfile_1.box fontfile_2.box ...
2024

21-
The unicharset will be put into the file 'dir/unicharset', or simply
22-
'./unicharset' if no output directory is provided.
23-
24-
Tesseract also needs to have access to character properties isalpha,
25-
isdigit, isupper, islower, ispunctuation. all of this auxilury data
26-
and more is encoded in this file. (See unicharset(5))
27-
28-
If your system supports the wctype functions, these values will be set
29-
automatically by unicharset_extractor and there is no need to edit the
30-
unicharset file. On some older systems (eg Windows 95), the unicharset
31-
file must be edited by hand to add these property description codes.
25+
The unicharset will be put into the file './unicharset' if no output filename is provided.
3226

33-
*NOTE* The unicharset file must be regenerated whenever inttemp, normproto
34-
and pffmtable are generated (i.e. they must all be recreated when the box
35-
file is changed) as they have to be in sync. This is made easier than in
36-
previous versions by running unicharset_extractor before mftraining and
37-
cntraining, and giving the unicharset to mftraining.
27+
*NOTE* Use the appropriate norm_mode based on the language.
3828

3929
SEE ALSO
4030
--------

0 commit comments

Comments
 (0)