[feature]: Resolve race conditions in WithBlockingCG option of fn.ContextGuard #9412

starius · 2025-01-10T17:24:22Z

Context

The package fn includes the type ContextGuard, which provides a wait group and a main quit channel to create guarded contexts. When the Quit() method is called, all contexts returned by ContextGuard are canceled. The WgWait method waits for all contexts to expire. The ContextGuard.Create method includes an option, WithBlockingCG, which prevents the returned context from being canceled by the Quit() method. Such contexts are used to block the shutdown of important tasks.

Is your feature request related to a problem? Please describe.

The issue arises when Quit() and Create(ctx, fn.WithBlockingCG()) are called in parallel. In this scenario, the wait group counter might reach 0 before the blocking context is created (and increments the counter back to 1). As a result, WgWait() may unblock when the counter first reaches 0, which is undesirable since it should wait for blocking contexts as well.

A related problem is that WgWait() waits for context cancellations rather than the actual completion of goroutines using those contexts. This means WgWait() could return before the associated goroutine finishes. Typically, a context created by the Create() method is passed to a goroutine in the code using ContextGuard.

Describe the solution you'd like

To address the issue of WgWait() unblocking prematurely, we could merge ContextGuard into GoroutineManager. The latter already supports waiting for the completion of goroutines rather than just context expiration. Since the two share much of the same functionality, and the typical usage pattern of ContextGuard aligns well with GoroutineManager, this integration seems logical.

GoroutineManager internally creates a child context from the one passed to the Go method, then launches a goroutine. We could factor out the context creation logic from the Go method into a Create method, effectively replicating the functionality of ContextGuard. If a context is created using Create(), the Stop method would wait for its expiration instead of the corresponding goroutine's completion.

(Discussion point: Do we need a Create method at all? If created contexts are always passed to goroutines, it might make sense to have Go method only.)

If we adopt this approach, there is still the issue of blocking contexts or goroutines. Currently, the Go method has a boolean return value indicating whether the operation succeeded. It prevents creating new goroutines after the Stop method has been called. However, this behavior is incompatible with blocking contexts or goroutines, as their creation might be required after Stop() is called, e.g. a goroutine might be needed to sync data to the database during shutdown. This raises an open question about how to handle blocking tasks. Let's brainstorm!

One possible approach is to require blocking goroutines to start before Stop() is called. The Stop() method would not cancel blocking contexts but would wait for the corresponding goroutines to complete. Users of GoroutineManager would need to start such goroutines in advance, perhaps in the Start() method of their implementation, and handle context expiration or shutdown logic themselves. If the creation of blocking goroutines in Start() fails, Start() will return an error. This can only occur if Stop() is called while Start() is still running.

Additional context

Related issue #9308

The text was updated successfully, but these errors were encountered:

starius · 2025-01-22T18:40:08Z

It is straight-forward to create the version of Go method creating a goroutine which is not cancelled by Stop() method of GoroutineManager, but still blocking the waiting in Stop():

// GoBlocking tries to start a new blocking goroutine and returns a boolean
// indicating its success. It returns true if the goroutine was successfully
// created and false otherwise. A goroutine will fail to be created iff the
// goroutine manager is stopping.
//
// The difference from Go is that GoroutineManager doesn't manage contexts in
// this case, so the goroutine can run as long as needed. GoroutineManager will
// still wait for its completion from Stop() method. But it is the caller's
// responsibility to stop the launched goroutine and to pass a context to it
// if needed.
//
// This method is intended to perform shutdown of important tasks, where
// interruption is not desirable. The user can start the blocking goroutine
// in advance, e.g. from Start() method of their type, to make sure GoBlocking
// succeeds.
func (g *GoroutineManager) GoBlocking(f func()) bool {
        // Calling wg.Add(1) and wg.Wait() when the wg's counter is 0 is a race
        // condition, since it is not clear if Wait() should block or not. This
        // kind of race condition is detected by Go runtime and results in a
        // crash if running with `-race`. To prevent this, we protect the calls
        // to wg.Add(1) and wg.Wait() with a mutex. If we block here because
        // Stop is running first, then Stop will close the quit channel which
        // will cause the context to be cancelled, and we will exit before
        // calling wg.Add(1). If we grab the mutex here before Stop does, then
        // Stop will block until after we call wg.Add(1).
        g.mu.Lock()
        defer g.mu.Unlock()

        // Ensure that the goroutine is not started if the manager has stopped.
        select {
        case <-g.quit:
                return false
        default:
        }

        g.wg.Add(1)
        go func() {
                defer func() {
                        g.wg.Done()
                }()

                f(ctx)
        }()

        return true
}

The issue with this approach is that the user has to pre-create the blocking goroutine in advance (e.g. from Start() method of their object) and that the goroutine can't create sub-workers after Stop() is called, however it would be safe, since at least one goroutine is still running (the one creating sub-workers).

A better version would have a different requirement for GoBlocking: that at least one goroutine is running (either created by Go or by GoBlocking). Then it would be safe to call GoBlocking from any goroutine created by GoroutineManager - very convenient! However I don't know how to do it with a WaitGroup. We can switch the implementation to wait on a sync.Cond and use a counter, then we can implement this approach.

It is needed to implement blocking goroutines such that they could be launched from any goroutine launched by the GoroutineManager. It is not clear how to implement such a feature using a WaitGroup. See lightningnetwork#9412 (comment)

This method is intended to perform shutdown of important tasks, where interruption is not desirable. See lightningnetwork#9412

starius added the enhancement Improvements to existing features / behaviour label Jan 10, 2025

starius added a commit to starius/lnd that referenced this issue Jan 23, 2025

fn: add method GoBlocking to GoroutineManager

21ff732

This method is intended to perform shutdown of important tasks, where interruption is not desirable. See lightningnetwork#9412

starius mentioned this issue Jan 23, 2025

fn: add method GoBlocking to GoroutineManager #9437

Draft

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature]: Resolve race conditions in WithBlockingCG option of fn.ContextGuard #9412

[feature]: Resolve race conditions in WithBlockingCG option of fn.ContextGuard #9412

starius commented Jan 10, 2025

starius commented Jan 22, 2025

[feature]: Resolve race conditions in WithBlockingCG option of fn.ContextGuard #9412

[feature]: Resolve race conditions in WithBlockingCG option of fn.ContextGuard #9412

Comments

starius commented Jan 10, 2025

starius commented Jan 22, 2025