-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: fatal error: stopg: invalid status #17099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I feel like I've seen that error elsewhere, though. @aclements? |
Not a go test, but a test of something written in go. |
@bradfitz, you saw it in the trybot run of CL 26762: https://storage.googleapis.com/go-build-log/2b07a28a/nacl-amd64p32_70247f4c.log Comment 1 is the only time this has ever shown up on the build dashboard, at least for master. |
CL https://golang.org/cl/29280 mentions this issue. |
Issue #17099 shows a failure that indicates we rescanned a stack twice concurrently during mark termination, which suggests that the rescan list became inconsistent. Add a simple check when we dequeue something from the rescan list that it claims to be at the index where we found it. Change-Id: I6a267da4154a2e7b7d430cb4056e6bae978eaf62 Reviewed-on: https://go-review.googlesource.com/29280 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
The Android build (which uses go) was updated to 1.7 on Aug 17, since then we've seen 4 occurrences of "fatal error: stopg: invalid status" on the build bots, and I just saw one on a local build. |
We've had one failure since I put in the consistency check: 2016-09-26T15:46:44-375092b/linux-amd64:
I'm not sure what's going on here, but it's interesting that a whole pile of Gs were in the wrong places on the rescan list (all of these errors happened while the first error was just trying to get the panic out the door). This suggests we have some sort of ordering issue that's completely smashing the state of the rescan list. |
Another failure on the s390x builder: https://build.golang.org/log/2fe597deb692316d5bf7b07bcb0a787e94a8c185 Some of the pointers seem to match up with the amd64 failure:
|
Another trybot flake on linux-amd64: |
Just got an internal bug report of a "rescan list overflow" panic:
This is further evidence that we're adding a G to the rescan list more than once. |
A couple more failures on the builders @ 94f49fd: amd64: https://build.golang.org/log/052fde13671350d57666b95b60eaf455dfed36e1 |
I'm not at all an expert on concurrent programming, but there doesn't to me seem to be much preventing adding a G to the rescan list twice? If there are two concurrent calls to queueRescan with the same g, the one that ends up waiting on the lock will still add the g to the rescan list. Shouldn't it check gp.gcRescan != -1 (again, or instead) after taking the lock? |
The theory, at least, is that everywhere that modifies gp.gcRescan requires exclusive ownership of gp, which is ensured by a different set of locks. So, gcRescan shouldn't be changing underfoot in queueRescan, even if it blocks on the lock. Of course, it's possible we got that wrong, but I suspect we didn't: if there were two concurrent calls to queueRescan, I would expect it to fail at least sometimes with "g already on rescan list", but that's never happened on the build bots. Looking back over the recent failures, there's a very interesting pattern: the indexes printed by the panic are all off by the same amount (that is, the goroutine at rescan index N was supposed to be at index N+M, for a fixed M). Even weirder, M is specifically 33 in most of the failures. |
In all but one of the logs, M is one more than the number of background mark workers. Rescanning of mark workers is funny, so there may well be a bug around this. |
In fact, when markroot queues a mark worker for rescan (which is the only type of goroutine markroot queues), it does not own the goroutine. I've been racking my brain trying to find some sequence of events that could cause this to be an actual problem without any luck, but it certainly is suspicious. Unfortunately, I also haven't been able to reproduce this. However, I'm about to send out a CL that removes the queueRescan from markroot, both because I don't trust it and because it interferes with the hybrid barrier. I suspect that will fix this, but we may have to wait and keep an eye on the builders to confirm that. |
CL https://golang.org/cl/31820 mentions this issue. |
CL https://golang.org/cl/31766 mentions this issue. |
Currently, markroot delays scanning mark worker stacks until mark termination by putting the mark worker G directly on the rescan list when it encounters one during the mark phase. Without this, since mark workers are non-preemptible, two mark workers that attempt to scan each other's stacks can deadlock. However, this is annoyingly asymmetric and causes some real problems. First, markroot does not own the G at that point, so it's not technically safe to add it to the rescan list. I haven't been able to find a specific problem this could cause, but I suspect it's the root cause of issue #17099. Second, this will interfere with the hybrid barrier, since there is no stack rescanning during mark termination with the hybrid barrier. This commit switches to a different approach. We move the mark worker's call to gcDrain to the system stack and set the mark worker's status to _Gwaiting for the duration of the drain to indicate that it's preemptible. This lets another mark worker scan its G stack while the drain is running on the system stack. We don't return to the G stack until we can switch back to _Grunning, which ensures we don't race with a stack scan. This lets us eliminate the special case for mark worker stack scans and scan them just like any other goroutine. The only subtlety to this approach is that we have to disable stack shrinking for mark workers; they could be referring to captured variables from the G stack, so it's not safe to move their stacks. Updates #17099 and #17503. Change-Id: Ia5213949ec470af63e24dfce01df357c12adbbea Reviewed-on: https://go-review.googlesource.com/31820 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
Sadly, it looks like CL 31820 (d6625ca) did not fix this. It just happened again at 0acefdb: https://build.golang.org/log/cf689c09c427cfd6494a80b19480c49a6cc392f9 |
With the hybrid barrier in place, we can now disable stack rescanning by default. This commit adds a "gcrescanstacks" GODEBUG variable that is off by default but can be set to re-enable STW stack rescanning. The plan is to leave this off but available in Go 1.8 for debugging and as a fallback. With this change, worst-case mark termination time at GOMAXPROCS=12 *not* including time spent stopping the world (which is still unbounded) is reliably under 100 µs, with a 95%ile around 50 µs in every benchmark I tried (the go1 benchmarks, the x/benchmarks garbage benchmark, and the gcbench activegs and rpc benchmarks). Including time spent stopping the world usually adds about 20 µs to total STW time at GOMAXPROCS=12, but I've seen it add around 150 µs in these benchmarks when a goroutine takes time to reach a safe point (see issue #10958) or when stopping the world races with goroutine switches. At GOMAXPROCS=1, where this isn't an issue, worst case STW is typically 30 µs. The go-gcbench activegs benchmark is designed to stress large numbers of dirty stacks. This commit reduces 95%ile STW time for 500k dirty stacks by nearly three orders of magnitude, from 150ms to 195µs. This has little effect on the throughput of the go1 benchmarks or the x/benchmarks benchmarks. name old time/op new time/op delta XGarbage-12 2.31ms ± 0% 2.32ms ± 1% +0.28% (p=0.001 n=17+16) XJSON-12 12.4ms ± 0% 12.4ms ± 0% +0.41% (p=0.000 n=18+18) XHTTP-12 11.8µs ± 0% 11.8µs ± 1% ~ (p=0.492 n=20+18) It reduces the tail latency of the x/benchmarks HTTP benchmark: name old p50-time new p50-time delta XHTTP-12 489µs ± 0% 491µs ± 1% +0.54% (p=0.000 n=20+18) name old p95-time new p95-time delta XHTTP-12 957µs ± 1% 960µs ± 1% +0.28% (p=0.002 n=20+17) name old p99-time new p99-time delta XHTTP-12 1.76ms ± 1% 1.64ms ± 1% -7.20% (p=0.000 n=20+18) Comparing to the beginning of the hybrid barrier implementation ("runtime: parallelize STW mcache flushing") shows that the hybrid barrier trades a small performance impact for much better STW latency, as expected. The magnitude of the performance impact is generally small: name old time/op new time/op delta BinaryTree17-12 2.37s ± 1% 2.42s ± 1% +2.04% (p=0.000 n=19+18) Fannkuch11-12 2.84s ± 0% 2.72s ± 0% -4.00% (p=0.000 n=19+19) FmtFprintfEmpty-12 44.2ns ± 1% 45.2ns ± 1% +2.20% (p=0.000 n=17+19) FmtFprintfString-12 130ns ± 1% 134ns ± 0% +2.94% (p=0.000 n=18+16) FmtFprintfInt-12 114ns ± 1% 117ns ± 0% +3.01% (p=0.000 n=19+15) FmtFprintfIntInt-12 176ns ± 1% 182ns ± 0% +3.17% (p=0.000 n=20+15) FmtFprintfPrefixedInt-12 186ns ± 1% 187ns ± 1% +1.04% (p=0.000 n=20+19) FmtFprintfFloat-12 251ns ± 1% 250ns ± 1% -0.74% (p=0.000 n=17+18) FmtManyArgs-12 746ns ± 1% 761ns ± 0% +2.08% (p=0.000 n=19+20) GobDecode-12 6.57ms ± 1% 6.65ms ± 1% +1.11% (p=0.000 n=19+20) GobEncode-12 5.59ms ± 1% 5.65ms ± 0% +1.08% (p=0.000 n=17+17) Gzip-12 223ms ± 1% 223ms ± 1% -0.31% (p=0.006 n=20+20) Gunzip-12 38.0ms ± 0% 37.9ms ± 1% -0.25% (p=0.009 n=19+20) HTTPClientServer-12 77.5µs ± 1% 78.9µs ± 2% +1.89% (p=0.000 n=20+20) JSONEncode-12 14.7ms ± 1% 14.9ms ± 0% +0.75% (p=0.000 n=20+20) JSONDecode-12 53.0ms ± 1% 55.9ms ± 1% +5.54% (p=0.000 n=19+19) Mandelbrot200-12 3.81ms ± 0% 3.81ms ± 1% +0.20% (p=0.023 n=17+19) GoParse-12 3.17ms ± 1% 3.18ms ± 1% ~ (p=0.057 n=20+19) RegexpMatchEasy0_32-12 71.7ns ± 1% 70.4ns ± 1% -1.77% (p=0.000 n=19+20) RegexpMatchEasy0_1K-12 946ns ± 0% 946ns ± 0% ~ (p=0.405 n=18+18) RegexpMatchEasy1_32-12 67.2ns ± 2% 67.3ns ± 2% ~ (p=0.732 n=20+20) RegexpMatchEasy1_1K-12 374ns ± 1% 378ns ± 1% +1.14% (p=0.000 n=18+19) RegexpMatchMedium_32-12 107ns ± 1% 107ns ± 1% ~ (p=0.259 n=18+20) RegexpMatchMedium_1K-12 34.2µs ± 1% 34.5µs ± 1% +1.03% (p=0.000 n=18+18) RegexpMatchHard_32-12 1.77µs ± 1% 1.79µs ± 1% +0.73% (p=0.000 n=19+18) RegexpMatchHard_1K-12 53.6µs ± 1% 54.2µs ± 1% +1.10% (p=0.000 n=19+19) Template-12 61.5ms ± 1% 63.9ms ± 0% +3.96% (p=0.000 n=18+18) TimeParse-12 303ns ± 1% 300ns ± 1% -1.08% (p=0.000 n=19+20) TimeFormat-12 318ns ± 1% 320ns ± 0% +0.79% (p=0.000 n=19+19) Revcomp-12 (*) 509ms ± 3% 504ms ± 0% ~ (p=0.967 n=7+12) [Geo mean] 54.3µs 54.8µs +0.88% (*) Revcomp is highly non-linear, so I only took samples with 2 iterations. name old time/op new time/op delta XGarbage-12 2.25ms ± 0% 2.32ms ± 1% +2.74% (p=0.000 n=16+16) XJSON-12 11.6ms ± 0% 12.4ms ± 0% +6.81% (p=0.000 n=18+18) XHTTP-12 11.6µs ± 1% 11.8µs ± 1% +1.62% (p=0.000 n=17+18) Updates #17503. Updates #17099, since you can't have a rescan list bug if there's no rescan list. I'm not marking it as fixed, since gcrescanstacks can still be set to re-enable the rescan lists. Change-Id: I6e926b4c2dbd4cd56721869d4f817bdbb330b851 Reviewed-on: https://go-review.googlesource.com/31766 Reviewed-by: Rick Hudson <rlh@golang.org>
Rescan lists will be disabled in Go 1.8 unless a GODEBUG flag is set. I'm moving this to Go 1.9 with the intent of closing it when we remove recsan lists entirely. If we decide in the next few weeks that this is worth fixing for a 1.7 point release, we can re-re-milestone this issue. |
CL https://golang.org/cl/36619 mentions this issue. |
Seen once on the builder so far: https://build.golang.org/log/4c0ec95e51e895280902515d8fa6c608487d01a3
Possibly a subtle bug in the new SSA backend.
The text was updated successfully, but these errors were encountered: