-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: TestStackWrapperStackPanic failure on windows-amd64-2016 #35773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Marking as release-blocker since |
2019-11-21T14:55:12-8bbfc51/windows-amd64-2008 |
The failure is always when running with |
The failure is always |
The test makes a call to This doesn't seem all that crucial, though it does mean that in some failure cases we will print the wrong stack trace. |
If we preempt the thread right at the point of the exception, we may inject a asyncPreempt call, and then the exception handler runs (before asyncPreempt runs), which injects a sigpanic call. Then we could see sigpanic right on top of asyncPreempt. In traceback, normally we would not elide wrappers if it is the immediate caller of sigpanic, but we will if the immediate caller is asyncPreempt. If this is the case, we could see the stack trace like that. I'm not familiar about the ordering of Windows' SuspendThread/ResumeThread and exception handling, so not sure if this could actually happen. |
I've been seeing this pretty often in TryBot runs. (Not as frequent as the 2019-12-09T15:12:29-f6774bc/windows-amd64-2012 |
This hasn't happened on the builders since 2019-12-09. The findflakes program reports only a 0.14% chance that the failure is still happening. That said I have no idea what would have changed to fix this. |
I can easily reproduce this at current master (817afe8) and at f6774bc with the following patch and script: --- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -4469,6 +4469,7 @@ func sysmon() {
if delay > 10*1000 { // up to 10ms
delay = 10 * 1000
}
+ delay = 10
usleep(delay)
now := nanotime()
next := timeSleepUntil()
@@ -4563,7 +4564,8 @@ type sysmontick struct {
// forcePreemptNS is the time slice given to a G before it is
// preempted.
-const forcePreemptNS = 10 * 1000 * 1000 // 10ms
+//const forcePreemptNS = 10 * 1000 * 1000 // 10ms
+const forcePreemptNS = 10 * 1000 // 10us
func retake(now int64) uint32 {
n := 0 cd $GOROOT/src/runtime
VM=$(gomote create windows-amd64-2016)
GOOS=windows go test -c && \
gomote put $VM runtime.test.exe && \
gomote run $VM ./runtime.test.exe -test.run TestStackWrapperStackPanic/sigpanic/CallersFrames -test.count 1000000 Running with
(normally it stops when it reaches What I don't fully understand is how we go from |
If we tweak the test to print the PC in the stack trace, we see that the |
This is also my previous guess. One possibility would be printing the wrapper function if the previous frame is asyncPreempt. This will make the wrapper included in the stack trace. But the stack trace will look like
which looks like asyncPreempt itself panics. I think this can happen with nil pointer dereference panics in general. Another possibility is that when injecting a sigpanic, if the PC is the entry PC of asyncPreempt, just overwrite the PC instead of pushing another frame. This will make the stack trace correct. Losing preemption is probably not a problem, as sigpanic enters the runtime, which will probably preempt it synchronously at some point. |
Change https://golang.org/cl/213879 mentions this issue: |
Change https://golang.org/cl/213957 mentions this issue: |
I implemented both approaches. Using @aclements's repro, both fix this. Personally, I prefer option 2. |
Thanks! I assume by option 2 you mean overwriting the asyncPreempt frame. I certainly like that option better, as it's much more localized to the cause of the issue. Changing the traceback logic, on the other hand, is several dominoes away from the actual cause. |
Yes, by option 2 I meant overwriting the asyncPreempt frame. Thanks. |
https://storage.googleapis.com/go-build-log/07ca05df/windows-amd64-2016_8338c7a2.log
CC @aclements @cherrymui @mknyszek @danscales
The text was updated successfully, but these errors were encountered: