Skip to content

Builder is sometimes unexpectedly killed #2176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nlewo opened this issue May 23, 2018 · 23 comments
Open

Builder is sometimes unexpectedly killed #2176

nlewo opened this issue May 23, 2018 · 23 comments
Labels
bug derivation-build The process of building an individual derivation (see also sandbox label)

Comments

@nlewo
Copy link
Member

nlewo commented May 23, 2018

Nix sometimes fails to build the expression

with import <nix/config.nix>;

{ simple = builtins.derivation {
    name = "simple";
    system = "x86_64-linux";
    PATH = coreutils;
    builder = shell;
    args = ["-c" "touch $out"];
  };
}

Generally, the build works but for ~1% of attempts, it fails with

builder for '/nix/store/94yfdqk4r6vmja7qhqkyy0hl25bvvfjj-simple.drv' failed due to signal 9 (Killed)

I can reproduce on my laptop (nix 2.0) and in a vm (nix 2.0.2) with:

$ nix-build -E 'with import <nix/config.nix>; { simple = builtins.derivation {name="simple";system="x86_64-linux";PATH=coreutils;builder=shell;args=["-c" "touch $out"];};}'

$ i=0; while nix-build -E 'with import <nix/config.nix>; { simple = builtins.derivation {name="simple";system="x86_64-linux";PATH=coreutils;builder=shell;args=["-c" "touch $out"];};}' --check; do echo $i; i=$(($i+1)); done

Note I'm not able to reproduce if a delay is introduced. The following expression has been successfully built 6000 times:

with import <nix/config.nix>;

{ simple = builtins.derivation {
    name = "simple";
    system = "x86_64-linux";
    PATH = coreutils;
    builder = shell;
    args = ["-c" "sleep 0.5; touch $out"];
  };
}

Are you also able to reproduce?

@nlewo nlewo changed the title Builder is sometimes unexpectidly killed Builder is sometimes unexpectedly killed May 24, 2018
@dtzWill
Copy link
Member

dtzWill commented May 26, 2018

I am not able to reproduce, probably due to my current environment being a bit unusual.

Are you able to capture backtraces or the like? Any details would be useful, thanks! :)

@coretemp
Copy link

I could not replicate this (I went to 3000).

You should include the hardware you used to replicate this as well as details of the configuration like kernel version and the file system you are using.

@nlewo Interesting find, though.

@nlewo
Copy link
Member Author

nlewo commented May 28, 2018

@coretemp what version of Nix are you using? I'm not able to reproduce with Nix 1.11.16.
@coretemp Kernel 4.9.86 and ext4 FS.

@dtzWill I'm not able to get any backtraces and logs are not really useful.

The nix-store -vvv --realize logs when it fails to build:

evaluating file '/nix/store/j4di8j9awar03dfz2c91hd0yrdw427v1-nix-2.0.2/share/nix/corepkgs/derivation.nix'
evaluating attribute 'simple'
processing attribute 'PATH'
resolved search path element '/nix/var/nix/profiles/per-user/root/channels' to '/nix/var/nix/profiles/per-user/root/channels'
resolved search path element '/nix/store/j4di8j9awar03dfz2c91hd0yrdw427v1-nix-2.0.2/share/nix/corepkgs' to '/nix/store/j4di8j9awar03dfz2c91hd0yrdw427v1-nix-2.0.2/share/nix/corepkgs'
evaluating file '/nix/store/j4di8j9awar03dfz2c91hd0yrdw427v1-nix-2.0.2/share/nix/corepkgs/config.nix'
processing attribute 'args'
processing attribute 'builder'
processing attribute 'name'
processing attribute 'system'
locking this thread to CPU 3
acquiring global GC lock '/nix/var/nix/gc.lock'
acquiring read lock on '/nix/var/nix/temproots/20175'
acquiring write lock on '/nix/var/nix/temproots/20175'
downgrading to read lock on '/nix/var/nix/temproots/20175'
instantiated 'simple' -> '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv'
querying info about missing paths...
starting pool of 3 threads
querying info about missing paths...
starting pool of 3 threads
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': created
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': woken up
entered goal loop
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': init
substitution of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': created
substitution of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': woken up
substitution of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': init
acquiring write lock on '/nix/var/nix/temproots/20175'
downgrading to read lock on '/nix/var/nix/temproots/20175'
substitution of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': done
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': waitee 'substitution of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv'' done; 0 left
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': woken up
substitution of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': goal destroyed
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': loading derivation
acquiring write lock on '/nix/var/nix/temproots/20175'
downgrading to read lock on '/nix/var/nix/temproots/20175'
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': have derivation
acquiring write lock on '/nix/var/nix/temproots/20175'
downgrading to read lock on '/nix/var/nix/temproots/20175'
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': all outputs substituted (maybe)
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': all inputs realised
building path '/nix/store/ifgmbbxn61i0wc4qp610r7jg0yf2qiwr-simple'
added input paths
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': woken up
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': trying to build
locking path '/nix/store/ifgmbbxn61i0wc4qp610r7jg0yf2qiwr-simple'
lock acquired on '/nix/store/ifgmbbxn61i0wc4qp610r7jg0yf2qiwr-simple.lock'
found build user 'nixbld1'
found build user 'nixbld10'
found build user 'nixbld11'
found build user 'nixbld12'
found build user 'nixbld13'
found build user 'nixbld14'
found build user 'nixbld15'
found build user 'nixbld16'
found build user 'nixbld17'
found build user 'nixbld18'
found build user 'nixbld19'
found build user 'nixbld2'
found build user 'nixbld20'
found build user 'nixbld21'
found build user 'nixbld22'
found build user 'nixbld23'
found build user 'nixbld24'
found build user 'nixbld25'
found build user 'nixbld26'
found build user 'nixbld27'
found build user 'nixbld28'
found build user 'nixbld29'
found build user 'nixbld3'
found build user 'nixbld30'
found build user 'nixbld31'
found build user 'nixbld32'
found build user 'nixbld4'
found build user 'nixbld5'
found build user 'nixbld6'
found build user 'nixbld7'
found build user 'nixbld8'
found build user 'nixbld9'
trying user 'nixbld1'
killing all processes running under uid '30001'
executing builder '/nix/store/zqh3l3lyw32q1ayb15bnvg9f24j5v2p0-bash-4.4-p12/bin/bash'
closing leaked FD 3
closing leaked FD 4
closing leaked FD 5
closing leaked FD 6
closing leaked FD 7
closing leaked FD 8
closing leaked FD 9
closing leaked FD 10
closing leaked FD 11
closing leaked FD 12
closing leaked FD 13
closing leaked FD 14
checking outputs of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv'...
waiting for children
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': got EOF
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': woken up
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': build done
killing process 20178
builder process for '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv' finished
killing all processes running under uid '30001'
builder for '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv' failed due to signal 9 (Killed)
lock released on '/nix/store/ifgmbbxn61i0wc4qp610r7jg0yf2qiwr-simple.lock'
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': done
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': goal destroyed
error: build of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv' failed

The logs when it succeeds:

evaluating file '/nix/store/j4di8j9awar03dfz2c91hd0yrdw427v1-nix-2.0.2/share/nix/corepkgs/derivation.nix'
evaluating attribute 'simple'
processing attribute 'PATH'
resolved search path element '/nix/var/nix/profiles/per-user/root/channels' to '/nix/var/nix/profiles/per-user/root/channels'
resolved search path element '/nix/store/j4di8j9awar03dfz2c91hd0yrdw427v1-nix-2.0.2/share/nix/corepkgs' to '/nix/store/j4di8j9awar03dfz2c91hd0yrdw427v1-nix-2.0.2/share/nix/corepkgs'
evaluating file '/nix/store/j4di8j9awar03dfz2c91hd0yrdw427v1-nix-2.0.2/share/nix/corepkgs/config.nix'
processing attribute 'args'
processing attribute 'builder'
processing attribute 'name'
processing attribute 'system'
locking this thread to CPU 3
acquiring global GC lock '/nix/var/nix/gc.lock'
acquiring read lock on '/nix/var/nix/temproots/20165'
acquiring write lock on '/nix/var/nix/temproots/20165'
downgrading to read lock on '/nix/var/nix/temproots/20165'
instantiated 'simple' -> '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv'
querying info about missing paths...
starting pool of 3 threads
querying info about missing paths...
starting pool of 3 threads
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': created
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': woken up
entered goal loop
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': init
substitution of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': created
substitution of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': woken up
substitution of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': init
acquiring write lock on '/nix/var/nix/temproots/20165'
downgrading to read lock on '/nix/var/nix/temproots/20165'
substitution of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': done
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': waitee 'substitution of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv'' done; 0 left
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': woken up
substitution of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': goal destroyed
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': loading derivation
acquiring write lock on '/nix/var/nix/temproots/20165'
downgrading to read lock on '/nix/var/nix/temproots/20165'
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': have derivation
acquiring write lock on '/nix/var/nix/temproots/20165'
downgrading to read lock on '/nix/var/nix/temproots/20165'
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': all outputs substituted (maybe)
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': all inputs realised
building path '/nix/store/ifgmbbxn61i0wc4qp610r7jg0yf2qiwr-simple'
added input paths
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': woken up
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': trying to build
locking path '/nix/store/ifgmbbxn61i0wc4qp610r7jg0yf2qiwr-simple'
lock acquired on '/nix/store/ifgmbbxn61i0wc4qp610r7jg0yf2qiwr-simple.lock'
found build user 'nixbld1'
found build user 'nixbld10'
found build user 'nixbld11'
found build user 'nixbld12'
found build user 'nixbld13'
found build user 'nixbld14'
found build user 'nixbld15'
found build user 'nixbld16'
found build user 'nixbld17'
found build user 'nixbld18'
found build user 'nixbld19'
found build user 'nixbld2'
found build user 'nixbld20'
found build user 'nixbld21'
found build user 'nixbld22'
found build user 'nixbld23'
found build user 'nixbld24'
found build user 'nixbld25'
found build user 'nixbld26'
found build user 'nixbld27'
found build user 'nixbld28'
found build user 'nixbld29'
found build user 'nixbld3'
found build user 'nixbld30'
found build user 'nixbld31'
found build user 'nixbld32'
found build user 'nixbld4'
found build user 'nixbld5'
found build user 'nixbld6'
found build user 'nixbld7'
found build user 'nixbld8'
found build user 'nixbld9'
trying user 'nixbld1'
killing all processes running under uid '30001'
executing builder '/nix/store/zqh3l3lyw32q1ayb15bnvg9f24j5v2p0-bash-4.4-p12/bin/bash'
closing leaked FD 3
closing leaked FD 4
closing leaked FD 5
closing leaked FD 6
closing leaked FD 7
closing leaked FD 8
closing leaked FD 9
closing leaked FD 10
closing leaked FD 11
closing leaked FD 12
closing leaked FD 13
closing leaked FD 14
checking outputs of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv'...
waiting for children
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': got EOF
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': woken up
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': build done
killing process 20168
builder process for '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv' finished
killing all processes running under uid '30001'
warning: rewriting hashes in '/nix/store/ifgmbbxn61i0wc4qp610r7jg0yf2qiwr-simple'; cross fingers
scanning for references inside '/nix/store/ifgmbbxn61i0wc4qp610r7jg0yf2qiwr-simple'
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': done
lock released on '/nix/store/ifgmbbxn61i0wc4qp610r7jg0yf2qiwr-simple.lock'
building of '/nix/store/dpgvr0sxqixdbqdf9w3cqavx8xxlfxk5-simple.drv': goal destroyed
acquiring global GC lock '/nix/var/nix/gc.lock'
/nix/store/ifgmbbxn61i0wc4qp610r7jg0yf2qiwr-simple

Note also the Nix expression can not be built if sandbox is enabled.

@coretemp
Copy link

coretemp commented May 28, 2018

I also use ext4.

@eonpatapon
Copy link

I reproduced the issue 3 times (56th iteration, 96th iteration, 994th iteration).

On Archlinux, nixpkgs-18.09pre134746.9430efbe49d

evaluating file '/nix/store/lw4c9f9x4nnhlpby2q8sjkvrcgphnym7-nix-2.0/share/nix/corepkgs/derivation.nix'
evaluating attribute 'simple'
resolved search path element '/nix/store/lw4c9f9x4nnhlpby2q8sjkvrcgphnym7-nix-2.0/share/nix/corepkgs' to '/nix/store/lw4c9f9x4nnhlpby2q8sjkvrcgphnym7-nix-2.0/share/nix/corepkgs'
evaluating file '/nix/store/lw4c9f9x4nnhlpby2q8sjkvrcgphnym7-nix-2.0/share/nix/corepkgs/config.nix'
acquiring global GC lock '/nix/var/nix/gc.lock'
acquiring read lock on '/nix/var/nix/temproots/2776'
acquiring write lock on '/nix/var/nix/temproots/2776'
downgrading to read lock on '/nix/var/nix/temproots/2776'
instantiated 'simple' -> '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv'
querying info about missing paths...
starting pool of 3 threads
querying info about missing paths...
starting pool of 3 threads
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': created
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': woken up
entered goal loop
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': init
substitution of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': created
substitution of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': woken up
substitution of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': init
acquiring write lock on '/nix/var/nix/temproots/2776'
downgrading to read lock on '/nix/var/nix/temproots/2776'
substitution of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': done
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': waitee 'substitution of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv'' done; 0 left
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': woken up
substitution of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': goal destroyed
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': loading derivation
acquiring write lock on '/nix/var/nix/temproots/2776'
downgrading to read lock on '/nix/var/nix/temproots/2776'
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': have derivation
acquiring write lock on '/nix/var/nix/temproots/2776'
downgrading to read lock on '/nix/var/nix/temproots/2776'
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': all outputs substituted (maybe)
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': all inputs realised
building path '/nix/store/v7gvm5w920zr66385aw8y2l3afp3d3kx-simple'
added input paths 
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': woken up
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': trying to build
locking path '/nix/store/v7gvm5w920zr66385aw8y2l3afp3d3kx-simple'
lock acquired on '/nix/store/v7gvm5w920zr66385aw8y2l3afp3d3kx-simple.lock'
executing builder '/nix/store/q1g0rl8zfmz7r371fp5p42p4acmv297d-bash-4.4-p19/bin/bash'
closing leaked FD 3
closing leaked FD 4
closing leaked FD 5
closing leaked FD 6
closing leaked FD 7
closing leaked FD 8
closing leaked FD 9
closing leaked FD 10
closing leaked FD 11
closing leaked FD 12
checking outputs of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv'...
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': got EOF
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': woken up
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': build done
killing process 2781
builder process for '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv' finished
builder for '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv' failed due to signal 9 (Killed)
lock released on '/nix/store/v7gvm5w920zr66385aw8y2l3afp3d3kx-simple.lock'
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': done
building of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv': goal destroyed
error: build of '/nix/store/5va03xj28lq8i3709mrzrb63xpvf7kh9-simple.drv' failed

@dtzWill
Copy link
Member

dtzWill commented May 29, 2018

Note also the Nix expression can not be built if sandbox is enabled.

Just curious-- why not? Needed paths aren't available?

@nlewo
Copy link
Member Author

nlewo commented May 30, 2018

It seems the worker gets the EOF on the child logger and sends a kill signal before the child process reach the terminated state. I don't know yet how to fix this...

I could also reproduce by manually building the master of Nix. I run it without any deamon and it takes more iterations (about 2000) to hit the bug.

@dtzWill With sandboxing, it fails with

...
...
checking outputs of '/nix/store/2nh58nwgxp376phzw8vlvmfdhnl8ky30-simple.drv'...
waiting for children
building of '/nix/store/2nh58nwgxp376phzw8vlvmfdhnl8ky30-simple.drv': read 146 bytes
�while setting up the build environment: executing '/nix/store/zqh3l3lyw32q1ayb15bnvg9f24j5v2p0-bash-4.4-p12/bin/bash': No such file or directory
waiting for children
building of '/nix/store/2nh58nwgxp376phzw8vlvmfdhnl8ky30-simple.drv': got EOF
...
...

@nlewo
Copy link
Member Author

nlewo commented May 31, 2018

@edolstra I think this has been introduced in 21948de. If I remove the code that kill the process, I never hit this issue.
I don't really know what is happening... maybe the kill signal could be emitted after FDs are removed (in do_exit) and before the process reaches the terminated state: even if the builder succeeds, the builder process is then known as killed.

nlewo added a commit to nlewo/nix that referenced this issue Jun 1, 2018
Sometimes, the builder process is killed while the build succeeded. It
seems this is because the signal is sent after file descriptors are
closed (in `do_exit`) and before the process reaches the terminated
state. This leads to a build failure (failed due to signal 9).

To mitigate this issue, a delay (~10s) is introduced before sending the kill
signal to the build process if it doesn't reach the terminated state.

Fixes NixOS#2176
nlewo added a commit to nlewo/nix that referenced this issue Jun 1, 2018
Sometimes, the builder process is killed while the build succeeded. It
seems this is because the signal is sent after file descriptors are
closed (in `do_exit`) and before the process reaches the terminated
state. This leads to a build failure (failed due to signal 9).

To mitigate this issue, a delay (~10s) is introduced before sending the kill
signal to the build process if it doesn't reach the terminated state.

Fixes NixOS#2176
@edolstra
Copy link
Member

edolstra commented Jun 4, 2018

This appears to be due to coreutils, since it closes stdout/stderr prior to quitting (see https://github.com/coreutils/gnulib/commits/master/lib/closeout.c).

Probably if you change touch $out to touch $out; exit 0, this problem won't occur (since then bash won't do a tail-call of touch).

@edolstra
Copy link
Member

edolstra commented Jun 4, 2018

For example:

$ strace bash -c 'touch x'
...
close(0)                                = 0
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?

whereas

$ strace bash -c 'touch x; exit 0'
...
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f43e2441250) = 28751
...
wait4(-1, 0x7fffc3ac6bd0, WNOHANG, NULL) = -1 ECHILD (No child processes)
rt_sigreturn({mask=[]})                 = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
exit_group(0)                           = ?

@coretemp
Copy link

coretemp commented Jun 4, 2018

Good find, but what are you saying. Is it that coreutils would need to be fixed or Nix? I have the impression that what coreutils does is correct, which would suggest that Nix still needs to be fixed.

@edolstra
Copy link
Member

edolstra commented Jun 4, 2018

Nobody is incorrect here. Nix just requires that builders don't close stdout/stderr.

@nlewo
Copy link
Member Author

nlewo commented Jun 5, 2018

@edolstra yeah, good catch. Thanks!
I confirme touch $out; exit 0 works well. In fact, I've already observed this (and used it as workaround) but I didn't have understood why it worked...

The point is that it has been really hard to troubleshoot (for me at least:/) and it would be nice to avoid this kind of issue.

This issue comes from way the worker knows the builder process is finished. The git history also confirms this has been a long story, with some other issues (builder which never termintaes).
Why are we waiting for closed file descriptor to know the child is terminated? Just for sake of simplicity?

@coretemp
Copy link

coretemp commented Jun 5, 2018

From the manual:

So we have to build a package. Building something from other stuff is called a derivation in Nix (as 
opposed to sources, which are built by humans instead of computers). We perform a derivation by calling stdenv.mkDerivation. mkDerivation is a function provided by stdenv that builds a package from a set of attributes. A set is just a list of key/value pairs where each key is a string and each value is an arbitrary Nix expression. They take the general form { name1 = expr1; ... nameN = exprN; }.

The manual states arbitrary. Additionally, the manual does not specify any conditions in which a builder should not work when a build process would work outside of Nix (which is the situation we have here).

Please explain how given the above argument you still think Nix does not have a bug here.

@knedlsepp
Copy link
Member

knedlsepp commented Nov 5, 2019

I suddenly experience random sigkills after an upgrade to nix 2.3. Not sure if that's the same issue. I also consistenly cannot build the expression mentioned in the beginning of this thread.

zimbatm added a commit to zimbatm/nix-build-uncached that referenced this issue Aug 27, 2020
I'm not sure of the root cause yet but we started seeing kill -9 after using this script.

See NixOS/nix#2176 (comment)
@zimbatm zimbatm closed this as completed Aug 27, 2020
zimbatm added a commit to numtide/devshell that referenced this issue Aug 28, 2020
Sometimes the builds get killed with -9. According to Edolstra the
coreutils tend to close stdout/stderr which leads to that error.

NixOS/nix#2176 (comment)
@domenkozar domenkozar reopened this Aug 29, 2020
@yakman2020
Copy link

Seeing it all of a sudden since yesterday. nix-shell with cmake and make (doing what the derivation says) works fine.

building of '/nix/store/ajkgsssfq8gfqvv707lyxv35lg528hik-openenclave-sdk.drv': read 34 bytes
[ 0%] Built target oeedger8r-cpp
waiting for children
building of '/nix/store/ajkgsssfq8gfqvv707lyxv35lg528hik-openenclave-sdk.drv': read 89 bytes
make[1]: Leaving directory '/nix/store/i1cdkbcab2wag039a9wb4xfb5qib0mim-openenclave-sdk'
waiting for children
building of '/nix/store/ajkgsssfq8gfqvv707lyxv35lg528hik-openenclave-sdk.drv': read 38 bytes
make: *** [Makefile:182: all] Error 2
waiting for children
building of '/nix/store/ajkgsssfq8gfqvv707lyxv35lg528hik-openenclave-sdk.drv': got EOF
building of '/nix/store/ajkgsssfq8gfqvv707lyxv35lg528hik-openenclave-sdk.drv': woken up
building of '/nix/store/ajkgsssfq8gfqvv707lyxv35lg528hik-openenclave-sdk.drv': build done
killing process 29772
builder process for '/nix/store/ajkgsssfq8gfqvv707lyxv35lg528hik-openenclave-sdk.drv' finished
builder for '/nix/store/ajkgsssfq8gfqvv707lyxv35lg528hik-openenclave-sdk.drv' failed with exit code 2
lock released on '/nix/store/i1cdkbcab2wag039a9wb4xfb5qib0mim-openenclave-sdk.lock'
building of '/nix/store/ajkgsssfq8gfqvv707lyxv35lg528hik-openenclave-sdk.drv': done
building of '/nix/store/ajkgsssfq8gfqvv707lyxv35lg528hik-openenclave-sdk.drv': goal destroyed
error: build of '/nix/store/ajkgsssfq8gfqvv707lyxv35lg528hik-openenclave-sdk.drv' failed

Profpatsch added a commit to openlab-aux/vuizvui that referenced this issue Mar 26, 2021
We had a bunch of instances of
NixOS/nix#2176,
where nix would exit with a “killed by signal 9” error.

According to Eelco in that issue, this is perfectly normal behaviour
of course, and appears if the last command in a loop closes `stdout`
or `stdin`, then the builder will SIGKILL it immediately. This is of
course also a perfectly fine error message for that case.

It turns out that mainly GNU coreutils exhibit this behaviour …

Let’s see if using a more sane tool suite fixes that.
@stale
Copy link

stale bot commented Jun 2, 2021

I marked this as stale due to inactivity. → More info

tennox pushed a commit to tennox/devshell that referenced this issue May 17, 2022
Sometimes the builds get killed with -9. According to Edolstra the
coreutils tend to close stdout/stderr which leads to that error.

NixOS/nix#2176 (comment)
@kjeremy
Copy link
Contributor

kjeremy commented Oct 4, 2022

I'm seeing this on nix 2.11.

@stale stale bot removed the stale label Oct 4, 2022
@luizribeiro
Copy link

luizribeiro commented Jan 7, 2023

I'm seeing this on remote builds from my CI (from nix flake check). Getting "unexpected Nix daemon error: error: writing to file: Broken pipe" on the nix daemon logs. opensshd logs don't show anything interesting (just the builder user authentication when logging in). It's generally a different build that gets killed with SIGKILL, not a specific derivation.

Are there any flags to that I can use to produce more verbose logs which might be able to help figure out what is sending the SIGKILL?

Edit: I built it with -vvv and noticed that it is the same issue as others mentioned earlier in the thread, where the worker is receiving an EOF from the builder

@srid
Copy link

srid commented Mar 27, 2023

FWIW, I tried the latest master of Nix (thinking that #8049 may have fixed it), but the problem still remains.

@bgamari
Copy link
Contributor

bgamari commented Mar 27, 2023

I believe we may be seeing this very often in GHC CI, particularly when building Haskell packages.

@mupdt
Copy link
Contributor

mupdt commented Mar 28, 2023

We are also experiencing this once in a while in our CI. A very rough estimate is that it happens a bit more than 1 in 100 times.

We just hit it today. This is the derivation we're building:

pkgs.runCommand "foo" {} ''
  mkdir -p $out/bin
  echo "#!${pkgs.bash}/bin/bash" > $out/bin/foo
  echo echo Hello World [we put a UUID here to make sure builds happen all the time] > $out/bin/foo
  chmod u+x $out/bin/foo
  # We're doing exit 0 here to avoid issue https://github.com/NixOS/nix/issues/2176
  exit 0
''

It looks like the exit 0 didn't help.

@roberth
Copy link
Member

roberth commented Jul 11, 2024

Nobody is incorrect here. Nix just requires that builders don't close stdout/stderr.

As a consequence of this decision we also have

We have two potential events to indicate termination

  • outputs have closed
  • process has exited

We should wait for both and have a timer to catch both edge cases where one happens but the other takes too long, as also mentioned in the other issue.

@roberth roberth added bug derivation-build The process of building an individual derivation (see also sandbox label) labels Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug derivation-build The process of building an individual derivation (see also sandbox label)
Projects
None yet
Development

Successfully merging a pull request may close this issue.