-
Notifications
You must be signed in to change notification settings - Fork 2.8k
feat: adding support for images inside docx #277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@microsoft-github-policy-service agree |
text_content = result.text_content | ||
|
||
# Find all base64 image markdown patterns | ||
base64_pattern = r'!\[[\s\S]*?\]\(data:image/[a-z]+;base64.*?\)' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be a little error prone, if the source doc has a pattern match, do there could be extras floating around in the doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also at least in my default use, i get images appearing as 
, so i think there needs to be another check for if the data actually exists in the output, or to fetch it another way, or ensure flags are enabled to embed the image data into the md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think because of this; #1140
client = kwargs.get("llm_client") | ||
model = kwargs.get("llm_model") | ||
prompt = kwargs.get("llm_prompt") | ||
result = self._get_llm_description_from_base64(base64_str, extension, client, model, prompt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
convert from base64 does more than just converting from base 64; probably better to rename
After adding the def _get_llm_description_from_base64(base64_str: str, extension: str, client: Any, model: str, prompt: Optional[str] = None) -> str:
"""Get LLM description for a base64-encoded image string."""
if prompt is None or prompt.strip() == "":
prompt = "Write a detailed caption for this image."
# Remove data URI prefix if present
if ',' in base64_str:
base64_str = base64_str.split(',', 1)[1]
# Create data URI
content_type, _ = mimetypes.guess_type("_dummy." + extension)
if content_type is None:
content_type = "image/jpeg"
data_uri = f"data:{content_type};base64,{base64_str}"
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {"url": data_uri},
},
],
}
]
response = client.chat.completions.create(model=model, messages=messages)
return response.choices[0].message.content.strip()
def replace_base64_images_with_descriptions(md_result, llm_client, llm_model, llm_prompt: Optional[str] = None, filename: Optional[str] = None):
"""
Replace all base64 image markdown in md_result.text_content with LLM-generated descriptions, using the filename as the reference.
"""
import os
text_content = md_result.text_content
base64_pattern = r'!\[([^\]]*)\]\((data:image/([a-zA-Z0-9]+);base64,([^\)]+))\)'
image_counter = 1
replacements = []
def _repl(match):
nonlocal image_counter
alt_text = match.group(1)
extension = match.group(3)
base64_str = match.group(4)
description = _get_llm_description_from_base64(base64_str, extension, llm_client, llm_model, llm_prompt)
# Use provided filename or generate a placeholder
ref = filename if filename else f"image{image_counter}"
image_counter += 1
replacements.append((description, ref))
return f"![{description}][{ref}]"
text_content = re.sub(base64_pattern, _repl, text_content)
md_result.text_content = text_content
return md_result |
|
||
# Extract any base64 encoded images from the HTML | ||
descriptions = [] | ||
if kwargs.get("llm_client") and kwargs.get("llm_model"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah should also check for keep_data_uris
when calling convert
; id imagine that gets passed along in the args
No description provided.