IndexError in MsWordDocumentBackend when processing text without equations #1284

yssAI · 2025-04-02T08:53:58Z

Bug

When processing certain Word documents (.docx), the handle_text_elements method in msword_backend.py throws an IndexError: list index out of range when attempting to split text that doesn't contain equation markers.

Steps to reproduce

Process a Word document containing paragraphs without equation markers (EQ)
The error occurs in docling/backend/msword_backend.py, line 380
The code assumes all text elements contain equations and tries to split on "EQ"

Docling version

docling 2.28.2
docling-core 2.24.0
docling-ibm-models 3.4.1
docling-parse 4.0.0

Python version

python==3.12.9

In some cases, an extra space in the equation causes an error

yssAI · 2025-04-02T08:59:11Z

test.docx

You can reproduce this using the docx document in this attachment

yssAI · 2025-04-02T09:18:33Z

I think I've identified the problem.

The main text is stripped of leading/trailing whitespaces using text.strip(), but the equations array is not processed similarly.

When splitting text_tmp using an equation (eq) as a delimiter, the split fails because eq may contain extra whitespaces that no longer match the stripped text_tmp.

This leads to an IndexError when trying to access text_tmp.split(eq, maxsplit=1)[1] if the split fails to find a match.

rateixei · 2025-04-02T09:32:57Z

@yssAI Thanks for bringing this up! I had observed this myself and had it in the pipeline. This should be fixed in PR #1268. I'm uploading here the markdown output of your test file.

1284_test.md

yssAI · 2025-04-03T03:36:36Z

@rateixei The previous document issue was resolved—thanks for the quick fix! However, when testing another DOCX file with the latest fixed code, I still encountered problems. Here’s the document for reference:

38213-i40.docx

rateixei · 2025-04-03T16:03:44Z

@yssAI Thanks for checking again! This new file was a treasure trove of edge cases which was extremely useful. Please check the export here to see if you spot anything undesired. You can also check this draft PR #1295 for the new logic. If you have any other interesting examples, I'd be happy to take a look!

38213-i40.md

yssAI · 2025-04-07T01:52:36Z

Thank you for your quick response! My project is still ongoing, so I’ll definitely provide feedback if I run into any other issues. Appreciate your support!

yssAI added the bug Something isn't working label Apr 2, 2025

rateixei mentioned this issue Apr 2, 2025

fix(docx): Improve text parsing #1268

Merged

3 tasks

rateixei self-assigned this Apr 2, 2025

rateixei added the docx issue related to docx backend label Apr 2, 2025

cau-git mentioned this issue Apr 7, 2025

fix(docx): Adding new latex symbols, simplifying how equations are added to text #1295

Merged

3 tasks

ceberam closed this as completed in #1295 Apr 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

IndexError in MsWordDocumentBackend when processing text without equations #1284

IndexError in MsWordDocumentBackend when processing text without equations #1284

yssAI commented Apr 2, 2025

yssAI commented Apr 2, 2025

Uh oh!

yssAI commented Apr 2, 2025

Uh oh!

rateixei commented Apr 2, 2025

Uh oh!

yssAI commented Apr 3, 2025

Uh oh!

rateixei commented Apr 3, 2025

Uh oh!

yssAI commented Apr 7, 2025

Uh oh!

IndexError in MsWordDocumentBackend when processing text without equations #1284

IndexError in MsWordDocumentBackend when processing text without equations #1284

Comments

yssAI commented Apr 2, 2025

Bug

Steps to reproduce

Docling version

Python version

yssAI commented Apr 2, 2025

Uh oh!

yssAI commented Apr 2, 2025

Uh oh!

rateixei commented Apr 2, 2025

Uh oh!

yssAI commented Apr 3, 2025

Uh oh!

rateixei commented Apr 3, 2025

Uh oh!

yssAI commented Apr 7, 2025

Uh oh!