all: add dpi parameter as manual override to image metadata #108

bertsky · 2020-01-23T23:06:37Z

Fixes #102.

This does influence segmentation and OSD a lot, but not so sure about (LSTM) recognition. Maybe we should document this somewhere – it's one of very few parameters we have to influence the quality of Tesseract's layout analysis (i.e. deliberately setting a fake value can improve results).

Note that Tesseract is optimised for modern fonts (which are typically smaller than historic ones) and thus biased against historic prints. So setting higher than factual DPI could be a useful recommendation for block (and maybe even line) segmentation. But this is still conjecture and I have only little evidence supporting it – someone would have to make systematic measurements first.

codecov · 2020-01-23T23:11:33Z

Codecov Report

Merging #108 into master will decrease coverage by 1.57%.
The diff coverage is 7.14%.

@@            Coverage Diff             @@
##           master     #108      +/-   ##
==========================================
- Coverage   39.14%   37.57%   -1.58%     
==========================================
  Files           9        9              
  Lines         894      942      +48     
  Branches      190      204      +14     
==========================================
+ Hits          350      354       +4     
- Misses        492      528      +36     
- Partials       52       60       +8

Impacted Files	Coverage Δ
ocrd_tesserocr/segment_table.py	`0% <0%> (ø)`	⬆️
ocrd_tesserocr/crop.py	`12.93% <0%> (-0.84%)`	⬇️
ocrd_tesserocr/deskew.py	`16.19% <0%> (-1.16%)`	⬇️
ocrd_tesserocr/segment_word.py	`72.88% <12.5%> (-7.89%)`	⬇️
ocrd_tesserocr/segment_line.py	`70.76% <12.5%> (-6.82%)`	⬇️
ocrd_tesserocr/recognize.py	`50.72% <12.5%> (-0.96%)`	⬇️
ocrd_tesserocr/segment_region.py	`57.03% <12.5%> (-2.48%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8ccde94...3a684a5. Read the comment docs.

ocrd_tesserocr/crop.py

ocrd_tesserocr/deskew.py

ocrd_tesserocr/recognize.py

bertsky · 2020-01-24T12:29:29Z

@kba when you merge, please don't forget to update CHANGELOG.md this time.

As for versioning, I think 108, 109 and 110 are all new features, therefore I suggest 0.8.0.

kba · 2020-01-24T14:11:59Z

please don't forget to update CHANGELOG.md this time

I'll try to, but I'm grateful for contributions. You have the best overview to outline your PR in broad strokes. Updated for 0.7.0 in master.

bertsky · 2020-01-24T14:26:44Z

I'll try to, but I'm grateful for contributions. You have the best overview to outline your PR in broad strokes.

ok, first, for 0.7.0, I believe these need to move from Added to Changed:

interprete overwrite_regions more consistently, improve segmentation #104
annotate @orientation (independent of dedicated deskewing processor) for vertical and @type for all other text blocks, improve segmentation #104
no separators and noise regions in reading order, improve segmentation #104

then for 0.8.0, how about this:

Added:
- parameter dpi overriding pixel density meta-data in all processors, Additional parameter for DPI override #102
- parameters char_whitelist, char_blacklist, char_unblacklist in recognize, recognize: expose character white/blacklisting parameters #107
- set lstm_choice_mode=2 when textequiv_level=glyph to get character alternatives from LSTM in recognize, also fill PAGE's glyphs and its variants and confidences via GetIterator() in recognize.py #7

bertsky added the enhancement New feature or request label Jan 23, 2020

bertsky requested review from kba and wrznr January 23, 2020 23:06

stweil reviewed Jan 24, 2020

View reviewed changes

ocrd_tesserocr/crop.py Outdated Show resolved Hide resolved

kba approved these changes Jan 24, 2020

View reviewed changes

ocrd_tesserocr/deskew.py Outdated Show resolved Hide resolved

ocrd_tesserocr/recognize.py Outdated Show resolved Hide resolved

all: add dpi parameter as manual override to image metadata

3a684a5

bertsky force-pushed the dpi-overrides branch from 10871e0 to 3a684a5 Compare January 24, 2020 12:22

kba merged commit 7d7315d into OCR-D:master Jan 24, 2020

bertsky deleted the dpi-overrides branch February 21, 2020 16:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

all: add dpi parameter as manual override to image metadata #108

all: add dpi parameter as manual override to image metadata #108

bertsky commented Jan 23, 2020

codecov bot commented Jan 23, 2020 •

edited

Loading

bertsky commented Jan 24, 2020

kba commented Jan 24, 2020

bertsky commented Jan 24, 2020

all: add dpi parameter as manual override to image metadata #108

all: add dpi parameter as manual override to image metadata #108

Conversation

bertsky commented Jan 23, 2020

codecov bot commented Jan 23, 2020 • edited Loading

Codecov Report

bertsky commented Jan 24, 2020

kba commented Jan 24, 2020

bertsky commented Jan 24, 2020

codecov bot commented Jan 23, 2020 •

edited

Loading