Skip to content

IndexError in MsWordDocumentBackend when processing text without equations #1284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
yssAI opened this issue Apr 2, 2025 · 6 comments · Fixed by #1295
Closed

IndexError in MsWordDocumentBackend when processing text without equations #1284

yssAI opened this issue Apr 2, 2025 · 6 comments · Fixed by #1295
Assignees
Labels
bug Something isn't working docx issue related to docx backend

Comments

@yssAI
Copy link

yssAI commented Apr 2, 2025

Bug

When processing certain Word documents (.docx), the handle_text_elements method in msword_backend.py throws an IndexError: list index out of range when attempting to split text that doesn't contain equation markers.

Steps to reproduce

  1. Process a Word document containing paragraphs without equation markers (EQ)
  2. The error occurs in docling/backend/msword_backend.py, line 380
  3. The code assumes all text elements contain equations and tries to split on "EQ"

Docling version

docling 2.28.2
docling-core 2.24.0
docling-ibm-models 3.4.1
docling-parse 4.0.0

Python version

python==3.12.9

Image

Image

In some cases, an extra space in the equation causes an error

@yssAI yssAI added the bug Something isn't working label Apr 2, 2025
@yssAI
Copy link
Author

yssAI commented Apr 2, 2025

test.docx

You can reproduce this using the docx document in this attachment

@yssAI
Copy link
Author

yssAI commented Apr 2, 2025

I think I've identified the problem.

The main text is stripped of leading/trailing whitespaces using text.strip(), but the equations array is not processed similarly.

When splitting text_tmp using an equation (eq) as a delimiter, the split fails because eq may contain extra whitespaces that no longer match the stripped text_tmp.

This leads to an IndexError when trying to access text_tmp.split(eq, maxsplit=1)[1] if the split fails to find a match.

Image

@rateixei
Copy link
Contributor

rateixei commented Apr 2, 2025

@yssAI Thanks for bringing this up! I had observed this myself and had it in the pipeline. This should be fixed in PR #1268. I'm uploading here the markdown output of your test file.

1284_test.md

@rateixei rateixei self-assigned this Apr 2, 2025
@rateixei rateixei added the docx issue related to docx backend label Apr 2, 2025
@yssAI
Copy link
Author

yssAI commented Apr 3, 2025

@rateixei The previous document issue was resolved—thanks for the quick fix! However, when testing another DOCX file with the latest fixed code, I still encountered problems. Here’s the document for reference:

38213-i40.docx

Image

@rateixei
Copy link
Contributor

rateixei commented Apr 3, 2025

@yssAI Thanks for checking again! This new file was a treasure trove of edge cases which was extremely useful. Please check the export here to see if you spot anything undesired. You can also check this draft PR #1295 for the new logic. If you have any other interesting examples, I'd be happy to take a look!

38213-i40.md

@yssAI
Copy link
Author

yssAI commented Apr 7, 2025

Thank you for your quick response! My project is still ongoing, so I’ll definitely provide feedback if I run into any other issues. Appreciate your support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working docx issue related to docx backend
Projects
None yet
2 participants