Skip to content

LD_AUDIT sacrifices some performance #15

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amonakov opened this issue Oct 12, 2017 · 12 comments
Closed

LD_AUDIT sacrifices some performance #15

amonakov opened this issue Oct 12, 2017 · 12 comments

Comments

@amonakov
Copy link

amonakov commented Oct 12, 2017

There's an issue in Glibc that, when LD_AUDIT is non-empty, causes all calls via PLT (i.e. normally all calls to functions implemented in shared libraries) to go via a hook that saves some registers (including vector registers that may hold passed arguments) on stack, calls into dynamic linker to invoke la_pltenter hooks (even if none registered), restores registers, invokes the original destination function, invokes la_pltexit hooks, and finally returns to caller. Obviously this is slow and unnecessary if all the audit module wants is to redirect some libraries. The Glibc bugreport is here: https://sourceware.org/bugzilla/show_bug.cgi?id=15533 (I've hit this issue back then when playing with an idea similar to yours).

It appears you're setting LD_AUDIT for all child processes including games, so that slows games to some degree. If not, can you add a clarifying comment somewhere?

(edited: grammar and clarity)

@ikeydoherty
Copy link
Member

That's a fairly old bug report @amonakov but thank you for bringing it up. I'm happy to do some benchmarking if you have a copy of the original test program, so we can test it against liblsi-intercept
to see if the fact still stands. (I've not noticed any noticeable slowdowns here tbf.)

If there is still a slowdown, then we'll just patch glibc and document that, as the module only implements la_objsearch - any slow-down would be very much a bug in the libc implementation and not LSI.

@amonakov
Copy link
Author

The original test program is attached to the aforementioned bug report; here's a direct link to the attachment: https://sourceware.org/bugzilla/attachment.cgi?id=7044

Note that if your toolchain enables hardening by default (-z relro -z now) you won't see the slowdown because the test program won't use PLT (but games aren't usually compiled like that).

The reason I've brought it up is exactly because this Glibc bug remains unfixed.

If you prefer to patch Glibc on your end, what would be your recommendation to people packaging this on other distros?

@ikeydoherty
Copy link
Member

ikeydoherty commented Oct 12, 2017

I'm definitely seeing a minor regression with your test case here:

0.14user 0.00system 0:00.14elapsed 100%CPU (0avgtext+0avgdata 5904maxresident)k
0inputs+0outputs (0major+69minor)pagefaults 0swaps
time env LD_AUDIT=./libaudit.so ./main
1.15user 0.00system 0:01.16elapsed 99%CPU (0avgtext+0avgdata 9104maxresident)k
0inputs+0outputs (0major+158minor)pagefaults 0swaps

However, when I build your libaudit with the distro CFLAGS:

cc    -c -o main.o main.c
cc  -o main main.o -lm
cc    -c -o libaudit.o libaudit.c
cc "-g2 -O3 -pipe -fPIC -Wformat -Wformat-security -fno-omit-frame-pointer -fexceptions -D_FORTIFY_SOURCE=2 -fstack-protector --param ssp-buffer-size=32 -fasynchronous-unwind-tables -ftree-vectorize -feliminate-unused-debug-types -Wall -Wno-error -Wp,-D_REENTRANT" -shared -o libaudit.so libaudit.o
time ./main
0.03user 0.00system 0:00.03elapsed 100%CPU (0avgtext+0avgdata 5728maxresident)k
0inputs+0outputs (0major+67minor)pagefaults 0swaps
time env LD_AUDIT=./libaudit.so ./main
0.03user 0.00system 0:00.04elapsed 95%CPU (0avgtext+0avgdata 9296maxresident)k
0inputs+0outputs (0major+159minor)pagefaults 0swaps

Note that libaudit is being built with the CFLAGS, not the binary (representing a proprietary game).
Also note changing -O3 to the normalised package -O2 has zero difference.

If I reintroduce your -fno-builtin-sqrt call, then the regression is back:

CFLAGS='-fno-builtin-sqrt' make
cc -fno-builtin-sqrt   -c -o main.o main.c
cc -fno-builtin-sqrt -o main main.o -lm
cc -fno-builtin-sqrt   -c -o libaudit.o libaudit.c
cc "-g2 -O2 -pipe -fPIC -Wformat -Wformat-security -fno-omit-frame-pointer -fexceptions -D_FORTIFY_SOURCE=2 -fstack-protector --param ssp-buffer-size=32 -fasynchronous-unwind-tables -ftree-vectorize -feliminate-unused-debug-types -Wall -Wno-error -Wp,-D_REENTRANT" -shared -o libaudit.so libaudit.o
time ./main
0.12user 0.00system 0:00.12elapsed 100%CPU (0avgtext+0avgdata 5904maxresident)k
0inputs+0outputs (0major+69minor)pagefaults 0swaps
time env LD_AUDIT=./libaudit.so ./main
1.09user 0.00system 0:01.09elapsed 99%CPU (0avgtext+0avgdata 8736maxresident)k
0inputs+0outputs (0major+154minor)pagefaults 0swaps

Thus I assume this is more about symbol resolution time, thus, I hacked the demo to call some gtk_ calls:

+ ./main

real	0m0.128s
user	0m0.113s
sys	0m0.010s
+ env LD_AUDIT=./libaudit.so ./main

real	0m0.150s
user	0m0.135s
sys	0m0.011s

Even building everything with hardening didn't make a significant difference after.

Finally, after installing your patch, even with a hardened toolchain (which Solus uses by default), and having done tests with full relro on the main binary and audit lib, and finally replacing it with the LSI lib:

+ ./main

real	0m0.127s
user	0m0.112s
sys	0m0.011s
+ env LD_AUDIT=/usr/lib64/liblsi-intercept.so ./main

real	0m0.128s
user	0m0.113s
sys	0m0.010s

Basically, we need the rtld-audit interface, and we also need your patch. Given that LSI is aimed at distribution integrators, my hope is that they also integrate your patch (we can add this to Solus without issue). It seems your original patch thread died out, perhaps now is the time to upstream it so that all the distributions benefit from it?

Distributions like Ubuntu are more willing to import an out of series patch to fix a bug when it has already landed in the VCS of the upstream project. :)

@ikeydoherty
Copy link
Member

Oh, and as a final metric, using your installed patches and your original test:

0.13user 0.00system 0:00.13elapsed 99%CPU (0avgtext+0avgdata 5936maxresident)k
0inputs+0outputs (0major+70minor)pagefaults 0swaps
time env LD_AUDIT=./libaudit.so ./main
0.12user 0.00system 0:00.12elapsed 99%CPU (0avgtext+0avgdata 9472maxresident)k
0inputs+0outputs (0major+163minor)pagefaults 0swaps

@ikeydoherty
Copy link
Member

@ikeydoherty
Copy link
Member

Thinking further on this, and correct me if I'm wrong, but the performance regression should only come from initial symbol resolution, thus affecting startup time and module load time, right? During the initial mapping.

Anyway, this further illustrates the need for a self contained LSI bundle that is free from distro issues..

@amonakov
Copy link
Author

Thinking further on this, and correct me if I'm wrong, but the performance regression should only come from initial symbol resolution, thus affecting startup time and module load time, right? During the initial mapping.

No, of course not, please read main.c in the testcase (note that it deliberately calls the same function in a loop many times to highlight the issue) and the initial report. The issue is that every runtime call that goes via PLT gets slower, not just initial calls!

If only initial calls get slower, that's not a major issue for games in the first place.

@ikeydoherty
Copy link
Member

Ah well that's not good at all. Just read properly through _dl_relocate_object, apologies, not awake that long. :)

OK so I'm going to document this issue within the README, just so integrators know the story. Obviously it would be fantastic if upstream accepts your patch (thank you for that!). FWIW LSI does allow you to turn off the intercept module, which may actually come in useful for those wanting to do benchmarks with and without the patch inside the games themselves.

FWIW I'm aware of the pressure on distributions when faced with integrating Steam, and it is becoming a heavy burden for them. This is why I'm looking to third party application systems with the view of building a specialised (ABI compatible) runtime containing a strict-mode LSI (and your glibc patch ofc!) that would effectively be a Solus-based runtime to provide the same Steam experience everywhere, even on distributions not supporting multilib.

In these third party systems we can ensure only our own libraries are used, and there is no more cross contamination, and distributions wouldn't have to worry about these issues anymore. :)

ikeydoherty added a commit that referenced this issue Oct 12, 2017
This goes some way to satisfy the concerns of issue #15 so that everyone
knows where they stand, how to mitigate this **now**, and what we intend
to do about this in future.

Signed-off-by: Ikey Doherty <ikey@solus-project.com>
@ikeydoherty
Copy link
Member

^ I've documented this in the README - if you feel it needs more clarification or details, please let me know :)

@amonakov
Copy link
Author

Users with older AVX-capable CPUs, especially the famous SandyBridge generation (i5-2500 and such) should especially beware, since there the penalty due to this issue is the highest. My test indicates roughly extra 420 cycles per call (this very high!), of those 140 I believe are twice 70 cycles avx transition penalty; didn't try to accurately analyze the rest.

@ikeydoherty
Copy link
Member

Damn - very common CPU too.

@ikeydoherty
Copy link
Member

Gonna close this now as the issue is documented, Solus is patched, and we're gonna provide a Snap with a patched glibc.

joebonrichie pushed a commit to solus-packages/glibc that referenced this issue Aug 14, 2023
The patch ensures that performance handlers aren't installed (breaking PLT
lookup) when the RTLD interface (`liblsi-intercept`) doesn't do any PLT
mangling, but only implements basic la_objsearch functions.

LSI issue: solus-project/linux-steam-integration#15
glibc issue: https://sourceware.org/bugzilla/show_bug.cgi?id=15533

Signed-off-by: Ikey Doherty <ikey@solus-project.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants