-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
gh-50002: xml.dom.minidom now preserves whitespaces in attributes #107947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-50002: xml.dom.minidom now preserves whitespaces in attributes #107947
Conversation
Also double quotes (") are now only quoted in attributes.
It turned out that there were no any tests for quoting in the XML output. The tests would pass even if quoting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Seeing that there are issues with whitespace that suggest a change in output, I think it's ok to solve #81555 (quote escaping) along the way. I'll reopen the ticket.
Lib/xml/dom/minidom.py
Outdated
data = data.replace("&", "&").replace("<", "<"). \ | ||
replace("\"", """).replace(">", ">") | ||
writer.write(data) | ||
if text: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually prefer representing easy special cases in a short
if text: | |
if not text: | |
return |
rather than adding indentation to long chunks of code that only makes readers look down to see if something joined is still done at the end. Better make it clear right away that we're done handling this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I just kept the style of the old code.
Lib/xml/dom/minidom.py
Outdated
if "\n" in text: | ||
text = text.replace("\n", " ") | ||
if "\t" in text: | ||
text = text.replace("\t", "	") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, why 	
and not simply 	
? Is there a reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No technical reason. Probably based on Python's hex character spelling or due to consistency with the other two-digit codes above. The XML character spec does not need (or mention) leading zeros.
I'm happy to keep the leading zero. If you need compact data, use compression. That's way more effective than stripping some zeros from rare tab characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xml.sax.saxutils.quoteattr()
uses 	
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which makes me wonder why we need a new implementation here, rather than importing the existing one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now looked up the implementation in saxutils.py
– it looks fairly slow. minidom
will probably not become high-performance by any accident, but it doesn't feel good to slow it down even more. It's probably worth a new implementation..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is sad, but there is a copy of escaping function in almost every module which outputs XML or HTML. On one hand, it is a trivial function, and we want to avoid unneeded dependencies. On other hand, efficient and complete implementation is not so trivial. But xml.sax.saxutils
version is too generalized and far from been efficient.
It was worse in the past. Now many code just use html.escape()
.
Lib/xml/dom/minidom.py
Outdated
if "\n" in text: | ||
text = text.replace("\n", " ") | ||
if "\t" in text: | ||
text = text.replace("\t", "	") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is sad, but there is a copy of escaping function in almost every module which outputs XML or HTML. On one hand, it is a trivial function, and we want to avoid unneeded dependencies. On other hand, efficient and complete implementation is not so trivial. But xml.sax.saxutils
version is too generalized and far from been efficient.
It was worse in the past. Now many code just use html.escape()
.
Co-authored-by: scoder <stefan_ml@behnel.de>
There's a new commit after the PR has been approved. @scoder: please review the changes made to this pull request. |
Thank you for your review @scoder! |
|
Also double quotes (") are now only quoted in attributes.
It is based on changes in bpo-17582 and bpo-39011 for ElementTree.
It also fixes #81555.