@@ -3,38 +3,28 @@ UNICHARSET_EXTRACTOR(1)
3
3
4
4
NAME
5
5
----
6
- unicharset_extractor - extract unicharset from Tesseract boxfiles
6
+ unicharset_extractor - Reads box or plain text files to extract the unicharset.
7
7
8
8
SYNOPSIS
9
9
--------
10
- *unicharset_extractor* '[-D dir]' 'FILE' ...
10
+ *unicharset_extractor* [-- output_unicharset filename] [-- norm_mode mode] box_or_text_file [... ]
11
+
12
+ Where mode means:
13
+ 1=combine graphemes (use for Latin and other simple scripts)
14
+ 2=split graphemes (use for Indic/Khmer/Myanmar)
15
+ 3=pure unicode (use for Arabic/Hebrew/Thai/Tibetan)
11
16
12
17
DESCRIPTION
13
18
-----------
14
19
Tesseract needs to know the set of possible characters it can output.
15
20
To generate the unicharset data file, use the unicharset_extractor
16
- program on the same training pages bounding box files as used for
17
- clustering:
21
+ program on training pages bounding box files or a plain text file:
18
22
19
23
unicharset_extractor fontfile_1.box fontfile_2.box ...
20
24
21
- The unicharset will be put into the file 'dir/unicharset', or simply
22
- './unicharset' if no output directory is provided.
23
-
24
- Tesseract also needs to have access to character properties isalpha,
25
- isdigit, isupper, islower, ispunctuation. all of this auxilury data
26
- and more is encoded in this file. (See unicharset(5))
27
-
28
- If your system supports the wctype functions, these values will be set
29
- automatically by unicharset_extractor and there is no need to edit the
30
- unicharset file. On some older systems (eg Windows 95), the unicharset
31
- file must be edited by hand to add these property description codes.
25
+ The unicharset will be put into the file './unicharset' if no output filename is provided.
32
26
33
- *NOTE* The unicharset file must be regenerated whenever inttemp, normproto
34
- and pffmtable are generated (i.e. they must all be recreated when the box
35
- file is changed) as they have to be in sync. This is made easier than in
36
- previous versions by running unicharset_extractor before mftraining and
37
- cntraining, and giving the unicharset to mftraining.
27
+ *NOTE* Use the appropriate norm_mode based on the language.
38
28
39
29
SEE ALSO
40
30
--------
0 commit comments