Skip to content

PERF: 7% faster pip install, remove fsync call in temporary file, not necessary #12859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

morotti
Copy link
Contributor

@morotti morotti commented Jul 19, 2024

Hello,

could I have a review on this code?

profiling on a run of pip install jupyter

_install_wheel() function is 3670 ms. that's the installation phase.
fsync is 594 ms out of that, about 15% of the install phase (or 7% of the whole pip run).

It is called here 3 times

with _generate_file(installer_path) as installer_file:

that's the end of the wheel installation, in _generate_file() decorator:

  • to create the dist-info/INSTALLER file that always contains "pip" string
  • to create the dist-info/direct-url-something.json that contains a one line json file with the path/url.
  • to create the dist-info/RECORD file that contains a list of all installed files (allegedly the most important file)

the way it works: it creates a temporary file in the same directory as the final path + flush() + fsync() + close() + atomic rename() the file to the final path.
I think the fsync call is not necessary.

fsync calls are surprisingly expensive, I am always seeing multiple milliseconds on various types of disks on my machine and even on /tmp.
there are only 3 files fsync'ed per extracted wheel, yet they take almost as long as everything else.
I think the run has 5564 files calls to save() to extract files + 116 calls to fsync() for metadata files + for a total of 58 wheels to install.

image
image

Copy link
Member

@pfmoore pfmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable, but I don't know enough about filesystem management to be sure that the fsync call is unnecessary. The history of the code doesn't give any indication.

Assuming the tests pass, and none of the other maintainers feels that this is a risk, I'm in favour. We can revert and add a note explaining why the fsync is needed if we do hit issues.

@morotti
Copy link
Contributor Author

morotti commented Jul 19, 2024

This looks reasonable, but I don't know enough about filesystem management to be sure that the fsync call is unnecessary. The history of the code doesn't give any indication.

This was added in 2019 with a comment "The file is created securely and is ensured to be written to disk after the context reaches its end."

My best guess, the guy who wrote this code went over-the-top to both flush() and fsync() to ensure the content is written. He probably through the record file is very important and went all in! :D
(The initial commit might have been bugged because it flushed the file result.file.flush() instead of the buffer result.flush())

As far as I understand, the fsync() is unnecessary because the flush() will flush AND the end of the with block will autoclose and flush too (closing is flushing).
If something crashes like a power loss before the write is complete and the rename is done, it's okay because the file is still a temporary filename. It's the atomic rename() after write that ensures consistency.
(By the way, I don't think that fsync() provides any guarantee that the data will be written to the media on modern hardware)

The code itself was probably a copy/pasted snippet from somewhere else (that didn't do atomic rename) or from previous code in pip (the commit message implies pip was writing in place before).

@morotti morotti mentioned this pull request Jul 19, 2024
Copy link
Member

@ichard26 ichard26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking at a look at how adjacent_tmp_file() is used across the codebase, it's only used as part of a write-replace sequence to ensure atomic writes:

  • In the HTTP cache write implementation:
    with adjacent_tmp_file(path) as f:
    f.write(data)
    replace(f.name, path)
  • For the self-check state file:
    with adjacent_tmp_file(self._statefile_path) as f:
    f.write(text.encode())
    try:
    # Since we have a prefix-specific state file, we can just
    # overwrite whatever is there, no need to check.
    replace(f.name, self._statefile_path)
  • For generate_file() as mentioned in this PR:
    @contextlib.contextmanager
    def _generate_file(path: str, **kwargs: Any) -> Generator[BinaryIO, None, None]:
    with adjacent_tmp_file(path, **kwargs) as f:
    yield f
    os.chmod(f.name, generated_file_mode)
    replace(f.name, path)

I did some reading, notably https://blog.gocept.com/2013/07/15/reliable-file-updates-with-python/ and https://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsync/. Assuming I haven't misunderstood something, the main benefit of os.fsync() is system crash resistance (durability). While flushing ensures that the written data is handed off to the OS (and isn't stored in a Python-level buffer which could be easily lost if pip crashes), it provides zero guarantee that the data was written to the physical media. In many situations, the disk write is deferred until some later time. If the entire system crashes or the power goes out, the OS buffer and any pending writes will be lost.

os.fsync() ensures that the writes have been accepted by the storage controller (aka flushed from the OS buffers). This is necessary for durable1 writes of the temporary file and eventual renamed file. Say, an user pip installs six. During that install, pip writes six's RECORD file using adjacent_tmp_file(). While at no point during the install will the OS or other programs see a partially written record file due to the atomic replace, if the system crashes after the install but before the OS has flushed the writes to the original temporary file to disk, the record file may be corrupted.

It's certainly nice to have an atomic and durable record files2, but I'm not certain whether that's a guarantee pip advertises/documents for package installation or any other part of its functionality. I'll also note that the replace operation is currently not durable. We'd have to fsync the containing directory to ensure the rename has been written to disk.

TL;DR: the proposed logic is atomic, but only when the system does not crash before the relevant data is stored safely on disk (i.e. it isn't durable).

Footnotes

  1. I wrote "accepted by the storage controller" for a reason. The controller may store the data in a buffer or cache temporarily before writing it to non-volatile storage. So, in theory a poorly timed power outage could still result in only some of the data being written to disk.

  2. I haven't thought about the importance of atomicity/durability for the self-check and cache use cases.

@morotti
Copy link
Contributor Author

morotti commented Jul 30, 2024

I think this patch is ready to merge.

@ichard26 per the link you gave, ext4 filesystem is ensuring that a rename can only be written after pending writes to the file have completed. the fsync is not necessary.
from the messages in the mailing list, they fix the filesystem because yum, the package manager on centos/rhel, was also doing rename after writes to ensure atomic changes and the filesystem had to handle it.

@ichard26
Copy link
Member

Isn't that a kernel/filesystem level hack? You know that we have other operation systems and filesystems to support. I'm not saying that we need to support durable installations, but if we want to, then relying on an ext4 feature is not the way to go.

@morotti
Copy link
Contributor Author

morotti commented Dec 11, 2024

reping. forgot to follow up.

would be nice to merge the patch to get the performance improvement! :D

@webknjaz
Copy link
Member

webknjaz commented Dec 11, 2024

@morotti looks like you forgot to respond to Richard's concern? It sounds like said guarantee is file system-specific and other file systems might behave differently. This might be important in the light of pip's cross-platform nature. Even under GNU/Linux distros, people tend to use different file systems, but beyond that, it's rarely possible to use the same file systems under different operating systems. Moreover, on GNU/Linux it's quite common for block devices to be layered on top of others (LUKS/LVM/btrfs). Do you know of implications this would have on them and file systems like NTFS/HFS/exfat etc.? What are the implications of sync on rename for those? Note that a rename across file system boundaries is typically implemented as copy+remove.
It sounds safer to do that fsync syscall than not.

Copy link
Member

@pradyunsg pradyunsg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relying on an ext4 feature is not the way to go.

I'd like us to have confirmation that this isn't relying on a files system specific thing or, if it does, let's only use it when the file system is known to be using a specific format that's compatible.

Copy link
Member

@pradyunsg pradyunsg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking as request changes, since the above is a blocking concern.

@morotti
Copy link
Contributor Author

morotti commented Dec 15, 2024

sorry for delay, there is no issue as far as I am aware.

the long discussion above is misleading, it was mainly around a bug ticket on RHEL/CentOS/ext4. early versions of ext4 started reordering filesystem operations in unsafe ways and data caused corruption. this was noticed with yum install leaving empty files and that was quite a problem. it was fixed forever ago after the issue was understood.
I'm not aware of any filesystem being affected by a similar issue.

by the way, the reason this code does a fsync() in the first place is because it is decades old code that used to write files in place which was unsafe (irrelevant of calling fsync or not). the code was fixed years ago to write to a separate file + flush + atomic rename, but they forgot to remove the fsync call.

@pfmoore
Copy link
Member

pfmoore commented Dec 15, 2024

but they forgot to remove the fsync call

You've asserted this a few times now, but what's your proof? You said yourself that this is your "best guess". But isn't it just as likely that the fsync was left in because it addressed some corner case (possibly on an obscure platform) that you're not aware of?

Your arguments seem to suggest that the fsync syscall is essentially useless. If that's the case, why does it even exist? It's clearly not a no-op, as it takes a non-trivial amount of time to execute.

While performance improvements are nice, we shouldn't make them at the expense of correctness. So the blocking question here is twofold:

  1. Do we have documentation explaining why the fsync call in this situation is unnecessary? It's not sufficient for you to just assert that's the case.
  2. Can we check for the conditions that make it unnecessary, and only omit the fsync in those cases? Note, for example, that on Windows, the os.fsync function calls the MS CRT function _commit. Neither of the articles you linked to above discuss Windows at all, so why do you assume the behaviour there is the same as Linux?

As a reminder - I originally said that I was OK with this change as long as the other maintainers didn't have concerns. It turns out the other maintainers do have concerns, and if you don't address them, this PR will remain blocked. I'm simply trying to clarify what the blocking concern is, but it's not me you have to persuade, it's @ichard26 and @pradyunsg

@ichard26 ichard26 added the type: performance Commands take too long to run label Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants