Skip to content

GH-120754: Disable buffering in Path.read_bytes #122111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 16, 2024

Conversation

cmaloney
Copy link
Contributor

@cmaloney cmaloney commented Jul 22, 2024

Path.read_bytes() is used to read a whole file. buffering= / BufferedIO is focused around making small, possibly interleaved, read and write efficient which doesn't add value for this case. This makes seek + isatty system call no longer called in this case, as well as removing the construction of the BufferedIO, allocation of a buffer, etc.

Benchmark

Run on my Mac

Benchmark code
import pyperf
from pathlib import Path

def read_all(all_paths):
    for p in all_paths:
        p.read_bytes()

def read_file(path_obj):
    path_obj.read_bytes()

all_rst = list(Path("Doc").glob("**/*.rst"))
all_py = list(Path(".").glob("**/*.py"))
assert all_rst, "Should have found rst files"
assert all_py, "Should have found python source files"

runner = pyperf.Runner()
runner.bench_func("read_file_small", read_file, Path("Doc/howto/clinic.rst"))
runner.bench_func("read_file_large", read_file, Path("Doc/c-api/typeobj.rst"))

Before:

.....................
read_file_small: Mean +- std dev: 6.80 us +- 0.07 us
.....................
read_file_large: Mean +- std dev: 10.8 us +- 0.2 us

After:

.....................
read_file_small: Mean +- std dev: 5.67 us +- 0.05 us
.....................
read_file_large: Mean +- std dev: 9.77 us +- 0.52 us

`Path.read_bytes()` is used to read a whole file. buffering /
BufferedIO is focused around making small, possibly interleaved,
read/write efficient which doesn't add value in this case.

On my Mac, running the benchmark:

```python
import pyperf
from pathlib import Path

def read_all(all_paths):
    for p in all_paths:
        p.read_bytes()

def read_file(path_obj):
    path_obj.read_bytes()

all_rst = list(Path("Doc").glob("**/*.rst"))
all_py = list(Path(".").glob("**/*.py"))
assert all_rst, "Should have found rst files"
assert all_py, "Should have found python source files"

runner = pyperf.Runner()
runner.bench_func("read_file_small", read_file, Path("Doc/howto/clinic.rst"))
runner.bench_func("read_file_large", read_file, Path("Doc/c-api/typeobj.rst"))
```

before:
```python
.....................
read_file_small: Mean +- std dev: 6.80 us +- 0.07 us
.....................
read_file_large: Mean +- std dev: 10.8 us +- 0.2 us
````

after:
```python
.....................
read_file_small: Mean +- std dev: 5.67 us +- 0.05 us
.....................
read_file_large: Mean +- std dev: 9.77 us +- 0.52 us
```
@hauntsaninja hauntsaninja merged commit 35d8ac7 into python:main Aug 16, 2024
35 checks passed
@cmaloney cmaloney deleted the cmaloney/read_bytes_nobuffer branch August 18, 2024 03:35
jeremyhylton pushed a commit to jeremyhylton/cpython that referenced this pull request Aug 19, 2024
`Path.read_bytes()` is used to read a whole file. buffering /
BufferedIO is focused around making small, possibly interleaved,
read/write efficient which doesn't add value in this case.

On my Mac, running the benchmark:

```python
import pyperf
from pathlib import Path

def read_all(all_paths):
    for p in all_paths:
        p.read_bytes()

def read_file(path_obj):
    path_obj.read_bytes()

all_rst = list(Path("Doc").glob("**/*.rst"))
all_py = list(Path(".").glob("**/*.py"))
assert all_rst, "Should have found rst files"
assert all_py, "Should have found python source files"

runner = pyperf.Runner()
runner.bench_func("read_file_small", read_file, Path("Doc/howto/clinic.rst"))
runner.bench_func("read_file_large", read_file, Path("Doc/c-api/typeobj.rst"))
```

before:
```python
.....................
read_file_small: Mean +- std dev: 6.80 us +- 0.07 us
.....................
read_file_large: Mean +- std dev: 10.8 us +- 0.2 us
````

after:
```python
.....................
read_file_small: Mean +- std dev: 5.67 us +- 0.05 us
.....................
read_file_large: Mean +- std dev: 9.77 us +- 0.52 us
```
blhsing pushed a commit to blhsing/cpython that referenced this pull request Aug 22, 2024
`Path.read_bytes()` is used to read a whole file. buffering /
BufferedIO is focused around making small, possibly interleaved,
read/write efficient which doesn't add value in this case.

On my Mac, running the benchmark:

```python
import pyperf
from pathlib import Path

def read_all(all_paths):
    for p in all_paths:
        p.read_bytes()

def read_file(path_obj):
    path_obj.read_bytes()

all_rst = list(Path("Doc").glob("**/*.rst"))
all_py = list(Path(".").glob("**/*.py"))
assert all_rst, "Should have found rst files"
assert all_py, "Should have found python source files"

runner = pyperf.Runner()
runner.bench_func("read_file_small", read_file, Path("Doc/howto/clinic.rst"))
runner.bench_func("read_file_large", read_file, Path("Doc/c-api/typeobj.rst"))
```

before:
```python
.....................
read_file_small: Mean +- std dev: 6.80 us +- 0.07 us
.....................
read_file_large: Mean +- std dev: 10.8 us +- 0.2 us
````

after:
```python
.....................
read_file_small: Mean +- std dev: 5.67 us +- 0.05 us
.....................
read_file_large: Mean +- std dev: 9.77 us +- 0.52 us
```
cmaloney added a commit to cmaloney/cpython that referenced this pull request Aug 28, 2024
NOTE: This needs a full buildbot test pass before merge, see: python#121143 (comment).

1. Added `statx` to set of allowed syscall forms (Should make Raspian bot pass).
2. Check that the `fd` returned from `open` is passed to all future calls. This helps ensure things like the `stat` call uses the file descriptor rather than the `filename` to avoid TOCTOU isuses.
3. Update the `Path().read_bytes()` test case to additionally validate the reduction in`isatty`/`ioctl` + `seek` calls from python#122111
4. Better diagnostic assertion messagess from @gpshead, so when the test fails have first information immediately available. Makes remote CI debugging much simpler.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants