Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a shutdown race condition in ControlCore #18632

Merged
merged 2 commits into from
Mar 14, 2025

Conversation

lhecker
Copy link
Member

@lhecker lhecker commented Feb 26, 2025

I found multiple issues while investigating this:

  • Render thread shutdown is racy, because it doesn't actually stop the render thread.
  • Lifetime management in ControlCore failed to account for the circular dependency of render thread --> renderer --> render data --> terminal --> renderer --> render thread. Fixed by reordering the ControlCore members to ensure their correct destruction.
  • Ensured that the connection setter calls close on the previous connection.

(Hopefully) Closes #18598

Validation Steps Performed

  • Can't repro the original failure ❌
  • Opening and closing tabs as fast as possible doesn't crash anymore ✅
  • Detaching and reattaching a tab producing continuous output ✅

@microsoft-github-policy-service microsoft-github-policy-service bot added Issue-Bug It either shouldn't be doing this or needs an investigation. Area-Quality Stability, Performance, Etc. Product-Terminal The new Windows Terminal. Severity-Crash Crashes are real bad news. labels Feb 26, 2025
@lhecker lhecker force-pushed the dev/lhecker/18598-shutdown-crash branch from 5302525 to 45d2519 Compare February 26, 2025 16:15
_outputRevoker = wrappedConnection.TerminalOutput(winrt::auto_revoke, { this, &DebugTapConnection::_OutputHandler });
_stateChangedRevoker = wrappedConnection.StateChanged(winrt::auto_revoke, [this](auto&& /*s*/, auto&& /*e*/) {
StateChanged.raise(*this, nullptr);
_outputRevoker = wrappedConnection.TerminalOutput(winrt::auto_revoke, { get_weak(), &DebugTapConnection::_OutputHandler });
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_weak() to ensure pending calls during revocation can complete without this being deallocated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this by chance fix the issue where closing a connection while the debug tap is on it crashes terminal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware we had such an issue. It's quite likely that this fixes it.

_connectionStateChangedRevoker = newConnection.StateChanged(winrt::auto_revoke, [this](auto&& /*s*/, auto&& /*v*/) {
ConnectionStateChanged.raise(*this, nullptr);
});
_connectionStateChangedRevoker = newConnection.StateChanged(winrt::auto_revoke, { get_weak(), &ControlCore::_connectionStateChangedHandler });
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_weak() to ensure pending calls during revocation can complete without this being deallocated. (Same below.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fuck. I just realized this can't work because it allows this to be destructed on a background thread which breaks WinUI. Got to use a lock or something instead.

// TODO: This manual event revoking doesn't make much sense.
// We could just drop the old connection. Why have all that Close() stuff?
// It also shouldn't need to be exposed to the outside. I suspect we can only
// improve this though, once drag/drop of tabs doesn't use "startup actions" anymore.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(For future code archealogists.)

Comment on lines -145 to -151
// First create the render thread.
// Then stash a local pointer to the render thread so we can initialize it and enable it
// to paint itself *after* we hand off its ownership to the renderer.
// We split up construction and initialization of the render thread object this way
// because the renderer and render thread have circular references to each other.
auto renderThread = std::make_unique<::Microsoft::Console::Render::RenderThread>();
auto* const localPointerToThread = renderThread.get();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has always been a thorn in my eye, so I fixed it in this PR by moving the RenderThread into the Renderer. No more ownership issues.

Comment on lines 230 to 232

_renderer.reset();
_renderEngine.reset();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this, because I fixed the member order of the class.

{
AddRenderEngine(rgpEngines[i]);
}
THROW_IF_FAILED(_pThread->Initialize(this));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always forget... is this guaranteed to be usable in the constructor? is Initialize safe for this if this is partially constructed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a valid pointer in the constructor, it's just that the members aren't necessarily fully initialized yet.

@DHowett
Copy link
Member

DHowett commented Feb 26, 2025

I know the tests probably interact with RenderThread in weird ways.

@lhecker lhecker force-pushed the dev/lhecker/18598-shutdown-crash branch from dc45cbb to 8fe4350 Compare March 11, 2025 16:04
Copy link
Member

@DHowett DHowett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I think I need you to explain how moving from the current architecture where we have "is the thread allowed to spin" and "does the thread need to actually produce a frame" is OK, and what implications all of this simplification should have on the application. I can see that there's real issues fixed here, but it is a big change in a touchy area.

@@ -434,7 +445,10 @@ namespace winrt::Microsoft::Terminal::Control::implementation
// - <none>
void ControlCore::_sendInputToConnection(std::wstring_view wstr)
{
_connection.WriteInput(winrt_wstring_to_array_view(wstr));
if (_connection)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does [[likely]] make a difference here? ergo, is input slower because of the branch? is that too much of a micro-optimization?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it won't. likely/unlikely rarely make a difference, but if they do it's primarily in tight code.


// ↑ This one is tricky - all of these are raw pointers:
// 1. _terminal depends on _renderer (for invalidations)
// 2. _renderer depends on _terminal (for IRenderData)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comments below say 3 3 and 1. there is no 2.

@@ -91,7 +63,6 @@ IRenderData* Renderer::GetRenderData() const noexcept
if (--tries == 0)
{
// Stop trying.
_pThread->DisablePainting();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wha?

Copy link
Member Author

@lhecker lhecker Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's way safer if this function simply signals the end of painting to the renderer via a return value. This reminds me that I didn't test this, shame on me.

@lhecker
Copy link
Member Author

lhecker commented Mar 14, 2025

I have good news: this PR fixes another bug where restarting a connection would fail to terminate the previous one. (It would leak the connection by not calling Close on it, since it is unfortunately self-referential.)

Regarding why all this is necessary: I realized during development that the entire logic around _hPaintCompletedEvent in the RenderThread is flawed as it doesn't actually reliably protect against another frame being scheduled. This rewrite fixes the issue which prevents random AVs during shutdown.
…hopefully without any regressions though.

@DHowett DHowett merged commit 70f85a4 into main Mar 14, 2025
19 checks passed
@DHowett DHowett deleted the dev/lhecker/18598-shutdown-crash branch March 14, 2025 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Quality Stability, Performance, Etc. Issue-Bug It either shouldn't be doing this or needs an investigation. Product-Terminal The new Windows Terminal. Severity-Crash Crashes are real bad news.
Projects
Status: To Consider
Status: To Consider
Development

Successfully merging this pull request may close these issues.

Microsoft::Console::VirtualTerminal::StateMachine::ProcessString Crash
2 participants