Skip to content

gh-50002: xml.dom.minidom now preserves whitespaces in attributes #107947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Aug 14, 2023

Also double quotes (") are now only quoted in attributes.

It is based on changes in bpo-17582 and bpo-39011 for ElementTree.

It also fixes #81555.

Also double quotes (") are now only quoted in attributes.
@serhiy-storchaka
Copy link
Member Author

It turned out that there were no any tests for quoting in the XML output. The tests would pass even if quoting & and < was broken.

Copy link
Contributor

@scoder scoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Seeing that there are issues with whitespace that suggest a change in output, I think it's ok to solve #81555 (quote escaping) along the way. I'll reopen the ticket.

data = data.replace("&", "&amp;").replace("<", "&lt;"). \
replace("\"", "&quot;").replace(">", "&gt;")
writer.write(data)
if text:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually prefer representing easy special cases in a short

Suggested change
if text:
if not text:
return

rather than adding indentation to long chunks of code that only makes readers look down to see if something joined is still done at the end. Better make it clear right away that we're done handling this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I just kept the style of the old code.

if "\n" in text:
text = text.replace("\n", "&#10;")
if "\t" in text:
text = text.replace("\t", "&#09;")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, why &#09; and not simply &#9;? Is there a reason?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No technical reason. Probably based on Python's hex character spelling or due to consistency with the other two-digit codes above. The XML character spec does not need (or mention) leading zeros.

I'm happy to keep the leading zero. If you need compact data, use compression. That's way more effective than stripping some zeros from rare tab characters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xml.sax.saxutils.quoteattr() uses &#9;.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which makes me wonder why we need a new implementation here, rather than importing the existing one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now looked up the implementation in saxutils.py – it looks fairly slow. minidom will probably not become high-performance by any accident, but it doesn't feel good to slow it down even more. It's probably worth a new implementation..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is sad, but there is a copy of escaping function in almost every module which outputs XML or HTML. On one hand, it is a trivial function, and we want to avoid unneeded dependencies. On other hand, efficient and complete implementation is not so trivial. But xml.sax.saxutils version is too generalized and far from been efficient.

It was worse in the past. Now many code just use html.escape().

if "\n" in text:
text = text.replace("\n", "&#10;")
if "\t" in text:
text = text.replace("\t", "&#09;")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is sad, but there is a copy of escaping function in almost every module which outputs XML or HTML. On one hand, it is a trivial function, and we want to avoid unneeded dependencies. On other hand, efficient and complete implementation is not so trivial. But xml.sax.saxutils version is too generalized and far from been efficient.

It was worse in the past. Now many code just use html.escape().

@bedevere-bot
Copy link

There's a new commit after the PR has been approved.

@scoder: please review the changes made to this pull request.

@serhiy-storchaka serhiy-storchaka deleted the minidom-preserve-whitespaces-in-attributes branch August 23, 2023 12:23
@bedevere-bot bedevere-bot requested a review from scoder August 23, 2023 12:23
@serhiy-storchaka
Copy link
Member Author

Thank you for your review @scoder!

@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot AMD64 RHEL8 LTO + PGO 3.x has failed when building commit 154477b.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/all/#builders/568/builds/4615) and take a look at the build logs.
  4. Check if the failure is related to this commit (154477b) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/all/#builders/568/builds/4615

Summary of the results of the build (if available):

== Tests result: ENV CHANGED ==

431 tests OK.

10 slowest tests:

  • test_concurrent_futures: 2 min 14 sec
  • test_multiprocessing_spawn: 1 min 33 sec
  • test_multiprocessing_forkserver: 1 min 12 sec
  • test_multiprocessing_fork: 1 min 6 sec
  • test_signal: 1 min 1 sec
  • test_io: 34.8 sec
  • test_socket: 31.9 sec
  • test_imaplib: 30.4 sec
  • test_subprocess: 29.5 sec
  • test_xmlrpc: 27.5 sec

1 test altered the execution environment:
test_ssl

15 tests skipped:
test.test_asyncio.test_windows_events
test.test_asyncio.test_windows_utils test_devpoll test_gdb
test_ioctl test_kqueue test_launcher test_startfile test_tkinter
test_ttk test_winconsoleio test_winreg test_winsound test_wmi
test_zipfile64

Total duration: 2 min 18 sec

Click to see traceback logs
Note: switching to '154477be722ae5c4e18d22d0860e284006b09c4f'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 154477be72 gh-50002: xml.dom.minidom now preserves whitespaces in attributes (GH-107947)
Switched to and reset branch 'main'

find: ‘build’: No such file or directory
find: ‘build’: No such file or directory
find: ‘build’: No such file or directory
find: ‘build’: No such file or directory
make[2]: [Makefile:2809: clean-retain-profile] Error 1 (ignored)
./Modules/_decimal/libmpdec/context.c:57: warning: mpd_setminalloc: ignoring request to set MPD_MINALLOC a second time

./Modules/_decimal/libmpdec/context.c:57: warning: mpd_setminalloc: ignoring request to set MPD_MINALLOC a second time


make: *** [Makefile:2034: buildbottest] Error 3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Minidom does not have to escape quote inside text segments
3 participants