Skip to content

v3 API: general XPath 2.0 mechanism, generateDS true reverse mapping, ocrd-filter #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 218 commits into from

Conversation

bertsky
Copy link
Owner

@bertsky bertsky commented Sep 16, 2024

I initially published the first form of builtin processor ocrd-filter on OCR-D#1240 directly, but since this is new functionality and involves lots of other changes, I rebased and split this off into this for easier reviewing.

The idea behind ocrd-filter is that the user gets to write powerful XPath expressions as runtime parameters, and the processor takes care of the removal from PAGE (including ReadingOrder update, and optionally saving images for those segments that did get removed for quick visual inspection).

To make this as expressive as possible, we need

  1. custom functions (like pixelarea or concatenated textequiv, but more to come surely)
  2. XPath 2.0 operators and functions

For the former I initially (see first commits) experimented with lxml's builtin etree.FunctionNamespace, but this turned out to be quite buggy. (It crashes with segmentation errors if using the global namespace registration with a namespace prefix, even in single-threaded mode. It did work using local namespace registration, though.) I briefly looked at SaxonC-HE, but found it does not allow for extension functions in Python (only in Java). So I ended up with pure-Python elementpath, which is slower, but really powerful – and easy to use.

Then I figured it would be really helpful (for ocrd-filter, but also other processors) if our OcrdPage.revmap actually did contain a reverse lookup mechanism (from tree node to generated DOM object). And since generateds (after v2.40) does now support that, while it does not have the problems with simple type enums anymore, I decided to try and update ocrd_page_generateds again – and it worked. So now we can really do page.xpath()page.revmappage.pcgts.

I placed the first two extension functions under ocrd_models.xpath_functions (as we might also want to write some for METS or MODS or whatever), but this is just an idea ATM.

Besides more XPath extension functions (e.g. a function for the ratio of foreground pixels when binarized derived images are present) I am also planning on extending generated PcGtsType via user methods directly (e.g. a method for TextEquiv consistency across the hierarchy, and another for Coords projection)...

kba and others added 30 commits September 11, 2024 11:38
Signed-off-by: Stefan Weil <sw@weilnetz.de>
…e, avoid buggy lxml global registration mechanism
… 'query'), use 'elementpath.XPathParser.external_function' with global registration instead of 'etree.FunctionNamespace' with local extension
Co-authored-by: Konstantin Baierer <kba@users.noreply.github.com>
@kba kba self-requested a review November 18, 2024 14:17
Copy link
Collaborator

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM. Also good to be up-to-date with generateDS again. Testing now.

@bertsky
Copy link
Owner Author

bertsky commented Jan 6, 2025

Note to self: we need to know whether refactoring the AlternativeImage selection logic out of Workspace.image_from_* into a stateless function (without any download_file behaviour) would break any existing API in the future, hence whether it must be done prior to 3.0 or can be done later.

@kba kba mentioned this pull request Jan 8, 2025
@bertsky
Copy link
Owner Author

bertsky commented Jan 15, 2025

we need to know whether refactoring the AlternativeImage selection logic out of Workspace.image_from_* into a stateless function (without any download_file behaviour) would break any existing API in the future, hence whether it must be done prior to 3.0 or can be done later.

@kba I don't think we need to break anything here in the future. The methods Workspace.image_from_page and Workspace.image_from_segment could be re-implemented as follows:

  • delegate to new generateds user methods PageType.get_image and [*Region|TextLine|Word]Type.get_image,
    but pass as new kwarg resolve a function with the following definition:
    def resolve(image_url):
        try:
            f = next(self.mets.find_files(local_filename=str(image_url)))
            return f.local_filename
        except StopIteration:
            try:
                f = next(self.mets.find_files(url=str(image_url)))
                return self.download_file(f).local_filename
            except StopIteration:
                with download_temporary_file(image_url) as f:
                   return f.name
  • replace calls to resolve_image_exif by calls to exif_from_filename directly,
    but allow overriding filename via resolve
  • replace calls to resolve_image_as_pil by calls to a new function image_from_filename,
    which merely contains the parts that do Image.open() and .load() to give up the FD, as well as array conversion for the badly supported color modes I and F,
    but allow overriding filename via resolve

@bertsky
Copy link
Owner Author

bertsky commented Jan 20, 2025

closing – see OCR-D#1300 and OCR-D#1301

@bertsky bertsky closed this Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants