Skip to content

Added JPEG quality option parameter (-c jpg_quality=n) #1265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 18, 2018

Conversation

tleegwater
Copy link

I needed to be able to specify the JPEG quality level in PDF output files, so I made this parameter optional. Default JPEG quality will still be 85.

@amitdo
Copy link
Collaborator

amitdo commented Jan 11, 2018

bce2cd5f331b66

@amitdo
Copy link
Collaborator

amitdo commented Jan 11, 2018

If my memory serves me well, @jbreiden (author of the pdf renderer code) didn't like this feature.

@tleegwater
Copy link
Author

tleegwater commented Jan 11, 2018

@amitdo
Copy link
Collaborator

amitdo commented Jan 11, 2018

https://web.archive.org/web/20150413012101/https://code.google.com/p/tesseract-ocr/issues/detail?id=1300
@jbreiden commented:

I don't have a strong opinion about tessedit_pdf_jpg_quality. It might cause some confusion. But at least it will be harmless.

@jbreiden
Copy link
Contributor

jbreiden commented Jan 11, 2018

Let's talk about this for a bit. Tesseract's PDF module tries really hard to inline images instead of transcoding them, which means that the JPEG quality parameter should be rarely used. @tleegwater can you tell us what sort of image files you are feeding to Tesseract? Some sort of TIFF? Maybe attach one if possible?

I mainly want to make sure we don't have an accidental transcode situation.

@tleegwater
Copy link
Author

I'm feeding TIFF file to Tesseract. And I'm aware that as soon as I feed it JPEG, or something else that's supported, Tesseract will try to inline the inputfile.
But even if someone would feed JPEG and set the quality parameter to 50, Tesseract still inlines the image without transcoding to q=50, or q=85 for that matter.

@jbreiden
Copy link
Contributor

That's fine, and I'm fine with this changelist.

Can you please tell me what flavor of TIFF you are working with? Uncompressed? LZW? Pack? Deflate? JPEG? CCITT Group 4? You can find out with tiffinfo from libtiff, or identify -verbose from ImageMagick. I'm considering supporting a few more image format for inline, and this might influence prioritization.

@tleegwater
Copy link
Author

We use no compression at all for our TIFF's We are generating them ourselves so there's no chance we'll get anything other than Uncompressed.

@jbreiden
Copy link
Contributor

Got it. Please be aware that the built in JPEG encoder is standard libjpeg. If you ever want more precise control (for example, turning off chroma subsampling) a fancier encoder like Guetzli, then JPEG encode the images before feeding to Tesseract.

https://research.googleblog.com/2017/03/announcing-guetzli-new-open-source-jpeg.html

David Thornley added 2 commits September 13, 2018 16:03
Merge branch 'master' into jpg_quality_option

* master: (577 commits)
  fix issue tesseract-ocr#1889
  Add badges for download , licence and lgtm
  Replace macro MINGW by __MINGW32__
  EquationDetectBase: Define virtual destructor in .cpp file
  BlobGrid: Define virtual destructor in .cpp file
  GridBase: Define virtual destructor in .cpp file
  AlignedBlob: Define virtual destructor in .cpp file
  TransposedArray: Define virtual destructor in .cpp file
  IndexMapBiDi: Define virtual destructor in .cpp file
  Add missing include file (fixes linker error for Visual Studio)
  NthItemTest: Add definition for virtual destructor
  HeapTest: Add definition for virtual destructor
  IcuErrorCode: Define virtual destructor in .cpp file
  Validator: Define virtual destructor in .cpp file
  Dawg: Define virtual destructor in .cpp file
  CUtil: Define virtual destructor in .cpp file
  IndexMap: Define virtual destructor in .cpp file
  CCUtil: Define virtual destructor in .cpp file
  MATRIX: Define virtual destructor in .cpp file
  CCStruct: Define virtual destructor in .cpp file
  ...
@zdenop zdenop merged commit 62a5e8c into tesseract-ocr:master Sep 18, 2018
@amitdo
Copy link
Collaborator

amitdo commented Oct 9, 2018

I think the API break is unnecessary.

bool TessPDFRenderer::AddImageHandler(TessBaseAPI* api)
can use api->GetIntVariable("jpg_quality", &jpg_quality);

@zdenop
Copy link
Contributor

zdenop commented Oct 9, 2018

Thanks for pointing to this issue. What it better for maintaining compatibility of C-API:

  1. TESS_API TessResultRenderer* TESS_CALL TessPDFRendererCreate(const char* outputbase, const char* datadir, BOOL textonly, int jpg_quality=85);
  2. or to define TessPDFRendererCreate (the old call) and TessPDFRendererCreate2 (new call with jpg_quality )?

@amitdo
Copy link
Collaborator

amitdo commented Oct 9, 2018

It's not just C API break, it's also a C++ break.
There are at least 2 bindings that use the C++ API directly (Python, R).

@amitdo
Copy link
Collaborator

amitdo commented Oct 9, 2018

As said. I think the C++ code can be changed to not break previous API.
Then you can revert the C API change.

@zdenop
Copy link
Contributor

zdenop commented Oct 9, 2018

Ok. I got it. First I think about extending API, but it is not need because jpg quality is handled by tesseract parameter....

zdenop added a commit that referenced this pull request Oct 9, 2018
@zdenop
Copy link
Contributor

zdenop commented Oct 9, 2018

Done. Please check.

@amitdo
Copy link
Collaborator

amitdo commented Oct 9, 2018

I didn't test it, but the change LGTM.

zdenop added a commit that referenced this pull request Oct 9, 2018
* 'master' of https://github.com/tesseract-ocr/tesseract:
  Remove code for _MSC_VER < 1900
  keep API compatibility with #1265
  Update googletest submodule to release v1.8.1
  Update test submodule
  Always use isascii() with isspace()
  Avoid crash with --psm 0 and LSTM traineddata
  SVPaint: Remove empty block
  Classify: Don't hide debug parameter
  UNICHARMAP: Remove comparison which is always false
  svpaint: Change a variable from global to local
  pgedit: remove unused declaration of display_bln_lines
  Plumbing: Remove comparison which is always false
  Release candidate 2
  use pdf L_FLATE_ENCODE only for png input; fixes #1961
@amitdo amitdo added the PDF label Oct 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants