Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: tests crash with exit code 42 #598

Closed
lmb opened this issue Mar 15, 2022 · 6 comments
Closed

CI: tests crash with exit code 42 #598

lmb opened this issue Mar 15, 2022 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@lmb
Copy link
Collaborator

lmb commented Mar 15, 2022

We're having problems with our CI, where tests fairly often fail with exit code 42. That error is generated when the VM doesn't output anything and no success file is generated.

ebpf/run-tests.sh

Lines 60 to 62 in bf256fd

if [[ ! -e "${output}/success" ]]; then
exit 42
fi

This happens across all of the packages we test and across multiple major kernel versions. It doesn't always reproduce, but usually rebuilding a PR once or twice will trigger the problem at least once.

@lmb lmb added the bug Something isn't working label Mar 15, 2022
@lmb lmb self-assigned this Mar 15, 2022
@lmb
Copy link
Collaborator Author

lmb commented Mar 15, 2022

I've been banging my head against this for a while, and finally made some progress. I enabled tracing of the kvm_run_exit event in qemu using -trace kvm_run_exit:

15268@1647341556.924605:kvm_run_exit cpu_index 0, reason 2
15268@1647341556.928341:kvm_run_exit cpu_index 0, reason 8

reason is given to us by the kernel, as the field exit_reason of struct kvm_run. 2 is KVM_EXIT_IO, 8 is KVM_EXIT_SHUTDOWN. The latter sounds innocuous, but actually is only generated in very rare circumstances: https://elixir.bootlin.com/linux/v5.16.14/A/ident/KVM_EXIT_SHUTDOWN It seems to triggered when the VM experiences a triple fault.

This means digging into what's happening in the kernel. Using perf to record all kvm tracepoints I managed to capture the following:

...
[001]   770.850155:                      kvm:kvm_entry: vcpu 0, rip 0x1000fe
[001]   770.850177:                       kvm:kvm_exit: vcpu 0 reason EXTERNAL_INTERRUPT rip 0x100107 info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x800000fb error_code 0x00000000
[001]   770.850207:                      kvm:kvm_entry: vcpu 0, rip 0x100107
[001]   770.850228:                       kvm:kvm_exit: vcpu 0 reason CR_ACCESS rip 0x100143 info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
[001]   770.850234:                         kvm:kvm_cr: cr_write 0 = 0x80000001
[001]   770.850287:                      kvm:kvm_entry: vcpu 0, rip 0x100146
[001]   770.850307:                       kvm:kvm_exit: vcpu 0 reason TRIPLE_FAULT rip 0x100146 info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
[001]   770.850313:                        kvm:kvm_fpu: unload
[001]   770.850316:             kvm:kvm_userspace_exit: reason KVM_EXIT_SHUTDOWN (8)

@lmb
Copy link
Collaborator Author

lmb commented Mar 16, 2022

Sent an email to the KVM mailing list: https://lore.kernel.org/kvm/95c1dc01-4aa0-46a6-95b1-bbc62588ac6e@www.fastmail.com/T/#u

@lmb
Copy link
Collaborator Author

lmb commented Apr 23, 2022

No takers on the mailing list unfortunately, I've decided to report with ubuntu: https://bugs.launchpad.net/ubuntu/+source/linux-meta-hwe-5.13/+bug/1970034

lmb added a commit to lmb/ebpf that referenced this issue May 3, 2022
Let's add a botch since I've not made any progress with fixing the problem.
Retry running a test twice if we hit the error 42 condition and are
executing on CI.

The CI variable is set by Semaphore: https://docs.semaphoreci.com/ci-cd-environment/environment-variables/#ci

Updates cilium#598
lmb added a commit that referenced this issue May 3, 2022
Let's add a botch since I've not made any progress with fixing the problem.
Retry running a test twice if we hit the error 42 condition and are
executing on CI.

The CI variable is set by Semaphore: https://docs.semaphoreci.com/ci-cd-environment/environment-variables/#ci

Updates #598
@lmb
Copy link
Collaborator Author

lmb commented Sep 1, 2022

The workaround works, so I'm closing this.

@lmb lmb closed this as completed Sep 1, 2022
@juhoarvid
Copy link

Hi,

What the workaround was? I am intrested, not getting my VM succesfully up and getting
"CPU-27633 [000] 67926.303544: kvm_exit: reason CR_ACCESS rip 0xcf39 info 100 0"

@lmb
Copy link
Collaborator Author

lmb commented Sep 29, 2022

The workaround is to restart qemu a couple of times! Your error message looks different than ours though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants