Skip to content

Commit ecfee53

Browse files
committed
Don't set page segmentation mode for hocr, pdf and tsv configs
Setting the page segmentation mode in those config files gives unexpected results: the text recognized when no config or only txt is given changes if both txt and any of hocr, pdf or tsv is chosen. In a test set of nearly 200 pages from historical books, using segmentation mode 1 is typically slightly better than the default, but there are also cases where it is much worse. Therefore the user should be able to decide which page segmentation mode is best. Old results for hocr, pdf or tsv now need an explicit `--psm 1` for reproduction. Signed-off-by: Stefan Weil <sw@weilnetz.de>
1 parent b15fbf1 commit ecfee53

File tree

3 files changed

+0
-3
lines changed

3 files changed

+0
-3
lines changed

tessdata/configs/hocr

-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,2 @@
11
tessedit_create_hocr 1
2-
tessedit_pageseg_mode 1
32
hocr_font_info 0

tessdata/configs/pdf

-1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1 @@
11
tessedit_create_pdf 1
2-
tessedit_pageseg_mode 1

tessdata/configs/tsv

-1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1 @@
11
tessedit_create_tsv 1
2-
tessedit_pageseg_mode 1

0 commit comments

Comments
 (0)