-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Builder is sometimes unexpectedly killed #2176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I am not able to reproduce, probably due to my current environment being a bit unusual. Are you able to capture backtraces or the like? Any details would be useful, thanks! :) |
I could not replicate this (I went to 3000). You should include the hardware you used to replicate this as well as details of the configuration like kernel version and the file system you are using. @nlewo Interesting find, though. |
@coretemp what version of Nix are you using? I'm not able to reproduce with Nix 1.11.16. @dtzWill I'm not able to get any backtraces and logs are not really useful. The
The logs when it succeeds:
Note also the Nix expression can not be built if sandbox is enabled. |
I also use ext4.
|
I reproduced the issue 3 times (56th iteration, 96th iteration, 994th iteration). On Archlinux, nixpkgs-18.09pre134746.9430efbe49d
|
Just curious-- why not? Needed paths aren't available? |
It seems the worker gets the EOF on the child logger and sends a kill signal before the child process reach the terminated state. I don't know yet how to fix this... I could also reproduce by manually building the master of Nix. I run it without any deamon and it takes more iterations (about 2000) to hit the bug. @dtzWill With sandboxing, it fails with
|
@edolstra I think this has been introduced in 21948de. If I remove the code that kill the process, I never hit this issue. |
Sometimes, the builder process is killed while the build succeeded. It seems this is because the signal is sent after file descriptors are closed (in `do_exit`) and before the process reaches the terminated state. This leads to a build failure (failed due to signal 9). To mitigate this issue, a delay (~10s) is introduced before sending the kill signal to the build process if it doesn't reach the terminated state. Fixes NixOS#2176
Sometimes, the builder process is killed while the build succeeded. It seems this is because the signal is sent after file descriptors are closed (in `do_exit`) and before the process reaches the terminated state. This leads to a build failure (failed due to signal 9). To mitigate this issue, a delay (~10s) is introduced before sending the kill signal to the build process if it doesn't reach the terminated state. Fixes NixOS#2176
This appears to be due to coreutils, since it closes stdout/stderr prior to quitting (see https://github.com/coreutils/gnulib/commits/master/lib/closeout.c). Probably if you change |
For example:
whereas
|
Good find, but what are you saying. Is it that coreutils would need to be fixed or Nix? I have the impression that what coreutils does is correct, which would suggest that Nix still needs to be fixed. |
Nobody is incorrect here. Nix just requires that builders don't close stdout/stderr. |
@edolstra yeah, good catch. Thanks! The point is that it has been really hard to troubleshoot (for me at least:/) and it would be nice to avoid this kind of issue. This issue comes from way the worker knows the builder process is finished. The git history also confirms this has been a long story, with some other issues (builder which never termintaes). |
From the manual:
The manual states Please explain how given the above argument you still think Nix does not have a bug here. |
I suddenly experience random sigkills after an upgrade to nix 2.3. Not sure if that's the same issue. I also consistenly cannot build the expression mentioned in the beginning of this thread. |
I'm not sure of the root cause yet but we started seeing kill -9 after using this script. See NixOS/nix#2176 (comment)
Sometimes the builds get killed with -9. According to Edolstra the coreutils tend to close stdout/stderr which leads to that error. NixOS/nix#2176 (comment)
Seeing it all of a sudden since yesterday. nix-shell with cmake and make (doing what the derivation says) works fine. building of '/nix/store/ajkgsssfq8gfqvv707lyxv35lg528hik-openenclave-sdk.drv': read 34 bytes |
We had a bunch of instances of NixOS/nix#2176, where nix would exit with a “killed by signal 9” error. According to Eelco in that issue, this is perfectly normal behaviour of course, and appears if the last command in a loop closes `stdout` or `stdin`, then the builder will SIGKILL it immediately. This is of course also a perfectly fine error message for that case. It turns out that mainly GNU coreutils exhibit this behaviour … Let’s see if using a more sane tool suite fixes that.
I marked this as stale due to inactivity. → More info |
Sometimes the builds get killed with -9. According to Edolstra the coreutils tend to close stdout/stderr which leads to that error. NixOS/nix#2176 (comment)
I'm seeing this on nix 2.11. |
I'm seeing this on remote builds from my CI (from Are there any flags to that I can use to produce more verbose logs which might be able to help figure out what is sending the SIGKILL? Edit: I built it with |
FWIW, I tried the latest |
I believe we may be seeing this very often in GHC CI, particularly when building Haskell packages. |
We are also experiencing this once in a while in our CI. A very rough estimate is that it happens a bit more than 1 in 100 times. We just hit it today. This is the derivation we're building: pkgs.runCommand "foo" {} ''
mkdir -p $out/bin
echo "#!${pkgs.bash}/bin/bash" > $out/bin/foo
echo echo Hello World [we put a UUID here to make sure builds happen all the time] > $out/bin/foo
chmod u+x $out/bin/foo
# We're doing exit 0 here to avoid issue https://github.com/NixOS/nix/issues/2176
exit 0
'' It looks like the |
As a consequence of this decision we also have We have two potential events to indicate termination
We should wait for both and have a timer to catch both edge cases where one happens but the other takes too long, as also mentioned in the other issue. |
Nix sometimes fails to build the expression
Generally, the build works but for ~1% of attempts, it fails with
I can reproduce on my laptop (nix 2.0) and in a vm (nix 2.0.2) with:
Note I'm not able to reproduce if a delay is introduced. The following expression has been successfully built 6000 times:
Are you also able to reproduce?
The text was updated successfully, but these errors were encountered: