Skip to content

[test][rfct] propagate pytest #779

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 5, 2022

Conversation

M3ssman
Copy link
Contributor

@M3ssman M3ssman commented Dec 29, 2021

Here are some more refactorings for both test_ocrd_page and test_workspace modules.
The major challenge was to get where paths-n-files actually are located, which is a quite hard for someone not familiar with the pushd_popd functionalities. Further, I divided test cases where beside regular assertions also exceptions were risen to have the exception handling being placed in it's own, separate test case.

One point that still puzzles me is the behavior of the ocrd_workspace remove_...- calls with the force parameter for invalid data (i.e., data which doesn't exist at all). I know we had already discussions about this, but still, I don't get it, since one might check the returned value from these calls, which in will result into 'None' for invalid data but are considered success.

Please take into account all the TODO marks I've placed. They can be dropped if the reasons clear out.

@kba
Copy link
Member

kba commented Jan 12, 2022

One point that still puzzles me is the behavior of the ocrd_workspace remove_...- calls with the force parameter for invalid data (i.e., data which doesn't exist at all). I know we had already discussions about this, but still, I don't get it, since one might check the returned value from these calls, which in will result into 'None' for invalid data but are considered success.

There's different conditions here that are relevant for remove_file:

  1. Does the mets:file referenced by the kwargs of remove_file exist or not in the METS XML?
  2. Does the local file in the file system exist

With force=True, the method will succeed if either of these is False. Will only return None for the case that a file was not found but the ID was a regular expression IIRC.

And for remove_file_group:

  1. Does the fileGrp exist or not in the METS-XML
  2. Do all the files match the conditions for remove_file?

With force=True, will succeed if either of these is False.

Without force, you might be left with inconsistent data, e.g. an expected file to be deleted was not there (which might be because using the wrong mets:file/@ID for removal) or one of the AlternativeImages was not available. And for remove_file, without force, might lead to a half-deleted file group because one of the files in it was not found.

Would separating these aspects of concerns into different (and less generically named) kwargs help?

@M3ssman
Copy link
Contributor Author

M3ssman commented Jan 17, 2022

If there's a fight on two fronts behind the scenes, I guess this can be reflected by a attenzione:just proposing returned tuple of two numbers, where the first counts deleted Metadata-structures an the 2nd counts deleted physical files - both should match in an ideal world.

The use case with the half-life files because of a broken removal and therefore missing or ghost items sounds very reasonable. When processing large batches in automated fashion, this is a rather familiar scenario.

Instead of modifying the API, I prefer to return more meaningful responses.
If you can get along with this, too, I'll do an update.

@kba kba requested a review from bertsky February 2, 2022 10:13
@M3ssman
Copy link
Contributor Author

M3ssman commented Feb 3, 2022

To clear out my latest comment about the parameter force on example both remove_file/remove_file_group:

It checks if a) the METS itself matches the provided condition and if b) there are some physical files that would be affected. That's two different things. Now I propose, instead of returning None, to return a tuple where the first element accords to the metadata, and the second to physical files.

Having a file 0001.jpg and a group MAX, you end up with 4 combinations:

  1. file referenced, but not physical existing => file removed from metadata: (1,0, actual using force: success)
  2. file not referenced, but physical existing => file removed from directory: (0,1, actual using force: success)
  3. file referenced, and physical existing => file removed from both: (1,1)
  4. file neither referenced, nor physical existing => nothing done (0,0, actual None)

This reflects all 4 behaviors which can be checked/tested independently.
With the current implementation one cannot check from outside if any inconsistencies were encountered.
But realize inconsistencies might reveal some different problems/errors as well.

If you follow me, I would adjust the implementations from the remove-calls this way. And since you did not differentiate this before, it wouldn't affect existing client-code.

@M3ssman
Copy link
Contributor Author

M3ssman commented Feb 3, 2022

This holds for removals of complete mets:fileGrp as well, where the behavior could even reflect how many single files belonging to this group were dropped. One might even think of an inconsistent situation, where x files are referenced in the METS, but actually y files do exists physically.

@kba
Copy link
Member

kba commented Feb 3, 2022

I am still not 100% sure wether I understand how conditions 2) and 4) might arise - if a file is not referenced in the metadata, it will never be deleted. We don't delete files not related to the workspace, at least we should not.

But you seem to have a clear idea on how you'd go about implementing it and having a more expressive return value instead of a void method is indeed backwards-compatible, so by all means, go ahead, it probably becomes clearer with a (proof-of-concept) implementation 👍

@M3ssman
Copy link
Contributor Author

M3ssman commented Feb 4, 2022

You're right concerning this, maybe these scenarios are somewhat too academical.

I looked again at Workspace.remove_file , and still the behavior and outcome is very different. The more I gaze, it feels like the parameters force and keep_file do overlap, at least when both are negated. Setting parameter force implicitly to True when the main object's property is set to overwrite feels like this parameter should also be an object property. Further, according to the doc, using the page_recursive and page_same_group one can push out similar images - a behavior that intersects very much with Workspace.remove_file_group. Since from method name I'd expect that a call of Workspace.remove_file removes at most a single file.

To get on with the overall topic of just transforming the tests, I've updated the PR from recent master and I guess we should get along with that before this freezes again.

Copy link
Member

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 Great work! Let's just make sure that all asserts are actually asserting and it runs in Python 3.7.

@kba kba merged commit 197437b into OCR-D:master Feb 5, 2022
@kba kba deleted the feat/rfct/pytest4 branch February 5, 2022 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants