Skip to content

Enable running tests on Helix (distributed workload platform) #4248

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
natemcmaster opened this issue Nov 27, 2018 · 21 comments
Closed

Enable running tests on Helix (distributed workload platform) #4248

natemcmaster opened this issue Nov 27, 2018 · 21 comments
Assignees
Labels
area-infrastructure Includes: MSBuild projects/targets, build scripts, CI, Installers and shared framework
Milestone

Comments

@natemcmaster
Copy link
Contributor

natemcmaster commented Nov 27, 2018

Currently, running all ASP.NET Core tests can take hours, even with assembly parallelization. Other large .NET repos have used Helix, a platform for distributing and running workloads in parallel. We should setup our PRs and CI builds to use this.

Scenarios

  1. Running tests without helix.
    This must continue to work so contributors do not need to run tests on Helix. This should work via dotnet test for most projects (and/or Visual Studio Test Explorer).

  2. Microsoft developers: running tests on Helix. Those with an access token to Helix should be able to run tests on Helix with a few simple parameters, such as build.cmd -helixTest /p:HelixApiToken=xyz123

  3. CI: by default, PRs and CI builds run tests on Helix. build.cmd -ci.

  4. Configuring Helix test settings can be done in the test's .csproj file.

<!-- MyTests.csproj -->
<Project Sdk="Microsoft.NET.Sdk">

    <PropertyGroup>
        <TargetFramework>netcoreapp3.0</TargetFramework>

        <!-- Projects can opt-out of Helix -->
        <DisableDistributedTests>true</DisableDistributedTests>

        <!-- By default, all test projects would run on the same set of platforms, as defined in $(DefaultTestPlatforms). Exact platform names TBD. -->
        <DefaultTestPlatforms>Windows10.x64;macOS.x64;Ubuntu16.x64</DefaultTestPlatforms>
    </PropertyGroup>

    <ItemGroup>
        <!-- Most projects should not need to specify. This would be set by default for all test projects. $(DisableDefaultTestPlatforms) can be used to opt-out. -->
        <DistributedTestPlatform Include="$(DefaultTestPlatforms)" Condition=" '$(DisableDefaultTestPlatforms)' != 'false' " />

        <!-- For special cases, projects can configure additional platforms -->
        <DistributedTestPlatform Include="Windows10.x86" />

        <!-- Projects can also Include/Exclude to filter the defaults -->
        <DistributedTestPlatform Include="$(DefaultTestPlatforms)" Exclude="macOS.x64" />
    </ItemGroup>

</Project>

cref https://github.com/aspnet/AspNetCore-Internal/issues/981 - additional details for the implementer

@natemcmaster natemcmaster added the area-infrastructure Includes: MSBuild projects/targets, build scripts, CI, Installers and shared framework label Nov 27, 2018
@HaoK
Copy link
Member

HaoK commented Nov 27, 2018

Note: currently submitting the helix job with the 92 work items for extensions (test projects * net461/netcore20/21) just on windows takes 20 minutes. Each publish directory is around 15megs so that's around 1.3gigs which might account for why it takes so long :/

@natemcmaster
Copy link
Contributor Author

That number is likely to be even higher on aspnet/AspNetCore. We should coordinate with the Helix and corefx guys to see if there are tricks we can use to reduce the size of payloads.

@HaoK
Copy link
Member

HaoK commented Nov 27, 2018

Yeah, each target queue is another multiple too, so if we are targeting mac/linux/windows queues that's already an hour just to publish the 3 jobs for extensions :(

@HaoK
Copy link
Member

HaoK commented Feb 7, 2019

@natemcmaster good to close this or do we want this open to track further helix work? There are a lot of projects/tests disabled (individual issues are tracking them but there's no meta issue tracking the full set), and there's also the open question of next steps, as keeping helix green is going to be a constant battle in the short term (between new tests that just don't work, and sporadic flaky tests which I've mostly skipped already)

@natemcmaster
Copy link
Contributor Author

natemcmaster commented Feb 7, 2019

There are still important action items on this. I wouldn't close it yet. From my perspective, the remaining work is

  • stabilize tests on Helix. This doesn't me @HaoK fixes them. Skipping + opening issues seems to be the only reasonable approach.
  • Merge the 'helix-test' definition with the main CI. Implication: this makes Helix tests required to pass to merge PRs/produce a release
  • Stop running tests on the CI build agents altogether.
  • Add docs in docs/ explaining how developers can run Helix locally.

@HaoK
Copy link
Member

HaoK commented Feb 20, 2019

@HaoK
Copy link
Member

HaoK commented Feb 20, 2019

Files with skipped tests on helix (Total of 24 on 4/22)

  • src/Identity/ApiAuthorization.IdentityServer/test/Configuration/ConfigureSigningCredentialsTests.cs
  • src/Identity/ApiAuthorization.IdentityServer/test/Configuration/SigningKeysLoaderTests.cs
  • src/Hosting/Hosting/test/WebHostTests.cs
  • src/Servers/Kestrel/Kestrel/test/GeneratedCodeTests.cs
  • src/Servers/Kestrel/test/Interop.FunctionalTests/H2SpecTests.cs
  • src/Servers/Kestrel/test/InMemory.FunctionalTests/HttpsConnectionAdapterTests.cs
  • src/Servers/Kestrel/test/InMemory.FunctionalTests/Http2/Http2TimeoutTests.cs
  • src/Servers/Kestrel/test/InMemory.FunctionalTests/Http2/Http2StreamTests.cs
  • src/Servers/Kestrel/test/BindTests/AddressRegistrationTests.cs
  • src/Servers/Kestrel/test/InMemory.FunctionalTests/Http2/TlsTests.cs
  • src/Servers/Kestrel/test/FunctionalTests/RequestTests.cs
  • src/Servers/HttpSys/test/FunctionalTests/ResponseCachingTests.cs
  • src/Servers/HttpSys/test/FunctionalTests/HttpsTests.cs
  • src/Servers/IIS/IIS/test/Common.FunctionalTests/Inprocess/StartupTests.cs
  • src/Servers/IIS/IIS/test/IIS.Shared.FunctionalTests/MofFileTests.cs
  • src/DataProtection/Extensions/test/DataProtectionProviderTests.cs
  • src/Components/Blazor/Build/test/RuntimeDependenciesResolverTest.cs
  • src/Components/Components/test/RendererTest.cs
  • src/Tools/FirstRunCertGenerator/test/CertificateManagerTests.cs
  • src/Tools/dotnet-watch/test/GlobbingAppTests.cs
  • src/Tools/dotnet-watch/test/NoDepsAppTests.cs
  • src/Tools/dotnet-watch/test/DotNetWatcherTests.cs
  • src/Security/Authentication/test/SecureDataFormatTests.cs

@Eilon
Copy link
Contributor

Eilon commented Apr 17, 2019

@HaoK / @aspnet/build / @ajcvickers - what milestone should this be in?

@ajcvickers
Copy link
Contributor

Moved it to 6 for now.

@HaoK
Copy link
Member

HaoK commented Jun 1, 2019

@natemcmaster do we still need this meta issue open? Moving to preview7 for now since preview6 is done

@HaoK HaoK modified the milestones: 3.0.0-preview6, 3.0.0-preview7 Jun 1, 2019
@natemcmaster
Copy link
Contributor Author

There are a bunch of checkbox lists in the comments above. If there are individual issues tracking those work, then I guess we could close this. I've been tracking this issue as I haven't seen any other issues open tracking the work get make Helix the default method of testing. Right now it's still an optional check that only runs on PRs.

@HaoK
Copy link
Member

HaoK commented Jun 3, 2019

Okay sounds like this is still the main issue tracking helix to be the default then

@dougbu
Copy link
Contributor

dougbu commented Jul 16, 2019

@HaoK we keep pushing this down the road without making changes to the issue. What is the current status?

If much is left open, especially if too much is on your plate to make further progress, suggest creating smaller issues and spreading them to team members with the most context in each test area.

/cc @anurse @mkArtakMSFT @ajcvickers @Pilchie

@HaoK
Copy link
Member

HaoK commented Jul 16, 2019

Since I think we have tracking items for all of the tests that aren't running on helix yet, this issue is mostly about making helix required for merging PRs, which is somewhat subjective. The build team probably has more overall insight on how consistently the helix tests have been green. But for my own personal PRs in the past month or so, helix runs have still almost always been red, which leads me to conclude its not ready to be a required check yet.

I will spend some time this week gathering more info and seeing what's causing the consistent failures (last time I looked it was a high amount of helix tests timing out due to queue contention)

@HaoK
Copy link
Member

HaoK commented Jul 16, 2019

Okay I think the issue with helix checks showing as red when tests pass with an automatic retries, for example in this PR today https://dev.azure.com/dnceng/public/_build/results?buildId=266896

Two tests passed on retry and the overview of the job shows green for these two tests (since they passed on rerun), but if you go to the individual work items you see the failures

All green overview:
https://mc.dot.net/#/user/aspnetcore/pr~2Faspnet~2Faspnetcore/ci/20190715.58.x64.1

Fireballs (even though they passed on rerun):
https://mc.dot.net/#/user/aspnetcore/pr~2Faspnet~2Faspnetcore/ci/20190715.58.x64.1/workItem/Sockets.FunctionalTests~2Fnetcoreapp3.0
https://mc.dot.net/#/user/aspnetcore/pr~2Faspnet~2Faspnetcore/ci/20190715.58.x64.1/workItem/IIS.NewHandler.FunctionalTests~2Fnetcoreapp3.0

I'll follow up with the helix folks to figure out how to ignore the failures if they pass on retry

@ajcvickers
Copy link
Contributor

@dougbu Has Helix become a priority again? My last understanding was that we were not prioritizing Helix currently.

@dougbu
Copy link
Contributor

dougbu commented Jul 16, 2019

@ajcvickers I commented because it wasn't clear whether the empty checkboxes were a true reflection of the current state nor whether this was now irrelevant given other issues. I didn't say we should change the issue's priority.

In any case, I don't remember exactly how we left Helix. I would be very interested in the technology if

  • it were more reliable
  • we partitioned the tests to use multiple agents per OS and sped things up significantly, making it a viable way to shorten our pipeline times
  • flaky tests were better supported

Side note: The Helix tests have found a few platform-specific issues we'd have otherwise missed.

A topic for our sync this afternoon…

@ajcvickers
Copy link
Contributor

@dougbu We do need to move to Helix; that is the plan. It's just not currently something we are spending time on.

@HaoK
Copy link
Member

HaoK commented Jul 19, 2019

Moving out of 3.0

@HaoK HaoK modified the milestones: 3.0.0-preview8, Backlog Jul 19, 2019
@HaoK
Copy link
Member

HaoK commented Jul 19, 2019

Parked in backlog (not sure if 3.1 is more appropriate)

@HaoK
Copy link
Member

HaoK commented Nov 8, 2019

Closing this in favor of smaller tracking items now that helix is part of the required checks for builds

@HaoK HaoK closed this as completed Nov 8, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Dec 8, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-infrastructure Includes: MSBuild projects/targets, build scripts, CI, Installers and shared framework
Projects
None yet
Development

No branches or pull requests

7 participants