-
Notifications
You must be signed in to change notification settings - Fork 2k
IndexError in MsWordDocumentBackend when processing text without equations #1284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You can reproduce this using the docx document in this attachment |
I think I've identified the problem. The main text is stripped of leading/trailing whitespaces using text.strip(), but the equations array is not processed similarly. When splitting text_tmp using an equation (eq) as a delimiter, the split fails because eq may contain extra whitespaces that no longer match the stripped text_tmp. This leads to an IndexError when trying to access text_tmp.split(eq, maxsplit=1)[1] if the split fails to find a match. |
@rateixei The previous document issue was resolved—thanks for the quick fix! However, when testing another DOCX file with the latest fixed code, I still encountered problems. Here’s the document for reference: |
@yssAI Thanks for checking again! This new file was a treasure trove of edge cases which was extremely useful. Please check the export here to see if you spot anything undesired. You can also check this draft PR #1295 for the new logic. If you have any other interesting examples, I'd be happy to take a look! |
Thank you for your quick response! My project is still ongoing, so I’ll definitely provide feedback if I run into any other issues. Appreciate your support! |
Bug
When processing certain Word documents (.docx), the handle_text_elements method in msword_backend.py throws an IndexError: list index out of range when attempting to split text that doesn't contain equation markers.
Steps to reproduce
Docling version
docling 2.28.2
docling-core 2.24.0
docling-ibm-models 3.4.1
docling-parse 4.0.0
Python version
python==3.12.9
In some cases, an extra space in the equation causes an error
The text was updated successfully, but these errors were encountered: