feat: adding support for images inside docx #277

PedroMiolaSilva · 2025-01-10T12:47:21Z

No description provided.

PedroMiolaSilva · 2025-01-10T14:03:32Z

@microsoft-github-policy-service agree

@microsoft-github-policy-service agree

joshjm · 2025-04-28T01:40:23Z

src/markitdown/_markitdown.py

+            text_content = result.text_content
+
+            # Find all base64 image markdown patterns
+            base64_pattern = r'!\[[\s\S]*?\]\(data:image/[a-z]+;base64.*?\)'


this might be a little error prone, if the source doc has a pattern match, do there could be extras floating around in the doc

also at least in my default use, i get images appearing as ![](data:image/png;base64...), so i think there needs to be another check for if the data actually exists in the output, or to fetch it another way, or ensure flags are enabled to embed the image data into the md

i think because of this; #1140

joshjm · 2025-04-28T02:39:29Z

src/markitdown/_markitdown.py

+        client = kwargs.get("llm_client")
+        model = kwargs.get("llm_model")
+        prompt = kwargs.get("llm_prompt")
+        result = self._get_llm_description_from_base64(base64_str, extension, client, model, prompt)


convert from base64 does more than just converting from base 64; probably better to rename

joshjm · 2025-04-28T03:44:58Z

After adding the keep_data_uris flag, im just doing some post processing with some vibe-coded utils, and its working great. Would love to see this capability make it into master.

def _get_llm_description_from_base64(base64_str: str, extension: str, client: Any, model: str, prompt: Optional[str] = None) -> str:
    """Get LLM description for a base64-encoded image string."""
    if prompt is None or prompt.strip() == "":
        prompt = "Write a detailed caption for this image."
    # Remove data URI prefix if present
    if ',' in base64_str:
        base64_str = base64_str.split(',', 1)[1]
    # Create data URI
    content_type, _ = mimetypes.guess_type("_dummy." + extension)
    if content_type is None:
        content_type = "image/jpeg"
    data_uri = f"data:{content_type};base64,{base64_str}"
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {
                    "type": "image_url",
                    "image_url": {"url": data_uri},
                },
            ],
        }
    ]
    response = client.chat.completions.create(model=model, messages=messages)
    return response.choices[0].message.content.strip()

def replace_base64_images_with_descriptions(md_result, llm_client, llm_model, llm_prompt: Optional[str] = None, filename: Optional[str] = None):
    """
    Replace all base64 image markdown in md_result.text_content with LLM-generated descriptions, using the filename as the reference.
    """
    import os
    text_content = md_result.text_content
    base64_pattern = r'!\[([^\]]*)\]\((data:image/([a-zA-Z0-9]+);base64,([^\)]+))\)'
    image_counter = 1
    replacements = []
    def _repl(match):
        nonlocal image_counter
        alt_text = match.group(1)
        extension = match.group(3)
        base64_str = match.group(4)
        description = _get_llm_description_from_base64(base64_str, extension, llm_client, llm_model, llm_prompt)
        # Use provided filename or generate a placeholder
        ref = filename if filename else f"image{image_counter}"
        image_counter += 1
        replacements.append((description, ref))
        return f"![{description}][{ref}]"
    text_content = re.sub(base64_pattern, _repl, text_content)
    md_result.text_content = text_content
    return md_result

joshjm · 2025-04-28T11:31:01Z

src/markitdown/_markitdown.py

+
+        # Extract any base64 encoded images from the HTML
+        descriptions = []
+        if kwargs.get("llm_client") and kwargs.get("llm_model"):


yeah should also check for keep_data_uris when calling convert; id imagine that gets passed along in the args

feat: adding support for images inside docx

928ddab

joshjm reviewed Apr 28, 2025

View reviewed changes

joshjm mentioned this pull request Apr 28, 2025

Images in docx files cannot be converted to md documents #1222

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: adding support for images inside docx #277

feat: adding support for images inside docx #277

PedroMiolaSilva commented Jan 10, 2025

PedroMiolaSilva commented Jan 10, 2025

joshjm Apr 28, 2025

joshjm Apr 28, 2025

joshjm Apr 28, 2025

joshjm Apr 28, 2025

joshjm commented Apr 28, 2025

joshjm Apr 28, 2025

feat: adding support for images inside docx #277

Are you sure you want to change the base?

feat: adding support for images inside docx #277

Conversation

PedroMiolaSilva commented Jan 10, 2025

PedroMiolaSilva commented Jan 10, 2025

joshjm Apr 28, 2025

Choose a reason for hiding this comment

joshjm Apr 28, 2025

Choose a reason for hiding this comment

joshjm Apr 28, 2025

Choose a reason for hiding this comment

joshjm Apr 28, 2025

Choose a reason for hiding this comment

joshjm commented Apr 28, 2025

joshjm Apr 28, 2025

Choose a reason for hiding this comment