Production - [Alerting] Servicing jobs in R&D queues alert #5277

dotnet-eng-status · 2025-03-19T00:03:48Z

💔 Metric state changed to alerting

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name

We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.

The job was sent to a queue that doesn't have a corresponding servicing queue

We need to create the missing queue in helix machines repo

Next steps:

Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72

Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

ServicingJobs 1

Go to rule

@dotnet/dnceng, @dotnet/prodconsvcs, please investigate

Automation information below, do not change

Grafana-Automated-Alert-Id-5aa74f27ef6445ce9d3d8d3d382e7e35

dougbu · 2025-03-19T00:57:20Z

Kusto shows all results relate to use of NetCore1ESPool-Publishing-Internal. I think we need to

carve out an exception in the query for this alert i.e. we don't have or want to have a NetCore1ESPool-Publishing-Svc-Internal pool
create such a pool and update the eng/common/ code in multiple Arcade branches to use it

I lean toward (1) but am unsure. thoughts on which path to take @premun, @mmitche, @ilyas1974❓

ilyas1974 · 2025-03-19T14:36:55Z

I think the "better" question to ask is why are they running jobs related to servicing in this pool at all? This pool has been around for a couple of years, and this is the first time it was used for this purpose.

@mmitche please correct me if I'm wrong, but I seem to remember this pool being created to speed up publishing during release. In my opinion, it shouldn't be used for jobs outside that scope.

dougbu · 2025-03-19T17:12:44Z

the jobs are all publishing-related: Publish Assets, Publish Using Darc, and Publish to Build Asset Registry

dougbu · 2025-03-19T17:46:22Z

actually, build jobs aren't the issue here. please ignore everything I said above. this is about Helix queues and the problem is perhaps threefold:

should building preview releases use the .svc queues❓
we're not excluding the .cet on-premises queues in the current query; that's an error
link to the documentation in the issue description goes to a blank page in our Wiki instead of https://dev.azure.com/dnceng/internal/_wiki/wikis/DNCEng%20Services%20Wiki/1178/-Alerts-Servicing-jobs-in-R-D-queues-alert. (my search in the Wiki didn't find the correct page)

current rule shows the following for the past week:

the current query for this rule is

// Note: If you are changing how we filter jobs, remember to make the same changes in the graph and table
let UntrackedQueues = Jobs 
| project QueueName = tolower(QueueName)
| where QueueName has_any ('osx','perf','armarch','arm64','arcade','coreappcompat','iot','ppc64le.experimental','s390x') 
or QueueName matches regex 'windows.*amd64.android.open'
or QueueName == 'windows.10.amd64.x86.rt'
| distinct QueueName;
Jobs
| where  $__timeFilter(Queued)
| where tolower(QueueName) !in (UntrackedQueues)
| extend TargetBranch=parse_json(Properties)["System.PullRequest.TargetBranch"]
| where (Branch contains "/release/" or Branch startswith "release/" or TargetBranch startswith "release/" or TargetBranch contains "/release/") and QueueName !endswith ".svc"
| project JobId, Queued, Repository, Branch, TargetBranch, QueueName

note .cet queues aren't untracked and preview branches aren't excluded. as smaller details,

all armarch queue use Azure VMs, not on-premises machines; we should remove that clause
the arm64 check catches both VM-based and on-premises queues; changing that to Cedarcrest would be an improvement

ilyas1974 · 2025-03-21T17:40:40Z

I don't have a problem with us implementing your above suggestions.

dougbu · 2025-03-22T01:35:18Z

basically, logic in dotnet-helix-service does the redirect and only warns if the new queue name doesn't exist. so, minimal action here unless we update JobController and RedirectHelper in dotnet-helix-service

can't address the armarch and arm64 problem at the moment. aspnetcore, machinelearning, and runtime repos make extensive use of Arm queues w/o .svc suffixes and we don't currently provide .svc queues for VM-based ARM queues. oops…
it also seems those repos are using .svc queues for preview branches. if we decide that's not fine, dotnet-helix-service code should change before we update the alert

for now, we can add ,'cet','cedarcrest' into the query and clean up these alerts (avoiding on-premises queues a bit more)

dougbu · 2025-03-22T04:54:07Z

I created a few sub-issues and expect #5291 will clear this alert, allowing us to close it. the other two sub-issues need a bit more work and/or discussion

dougbu · 2025-03-24T19:26:03Z

/fyi @ilyas1974 I'm leaving this as assigned to me since #5291 is assigned to me and waiting for rollout

dotnet-eng-status bot added Active Alert Issues from Grafana alerts that are now active Critical Grafana Alert Issues opened by Grafana Ops - First Responder Production Tied to the Production environment (as opposed to Staging) labels Mar 19, 2025

dougbu self-assigned this Mar 21, 2025

This was referenced Mar 22, 2025

Clean up exclusions for on-premises queues #5291

Open

Add more .svc queues #5292

Open

Don't use .svc queues for pre-release builds #5293

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production - [Alerting] Servicing jobs in R&D queues alert #5277

Production - [Alerting] Servicing jobs in R&D queues alert #5277

dotnet-eng-status bot commented Mar 19, 2025

dougbu commented Mar 19, 2025

ilyas1974 commented Mar 19, 2025

dougbu commented Mar 19, 2025

dougbu commented Mar 19, 2025 •

edited

Loading

ilyas1974 commented Mar 21, 2025

dougbu commented Mar 22, 2025 •

edited

Loading

dougbu commented Mar 22, 2025

dougbu commented Mar 24, 2025

Production - [Alerting] Servicing jobs in R&D queues alert #5277

Production - [Alerting] Servicing jobs in R&D queues alert #5277

Comments

dotnet-eng-status bot commented Mar 19, 2025

dougbu commented Mar 19, 2025

ilyas1974 commented Mar 19, 2025

dougbu commented Mar 19, 2025

dougbu commented Mar 19, 2025 • edited Loading

ilyas1974 commented Mar 21, 2025

dougbu commented Mar 22, 2025 • edited Loading

dougbu commented Mar 22, 2025

dougbu commented Mar 24, 2025

dougbu commented Mar 19, 2025 •

edited

Loading

dougbu commented Mar 22, 2025 •

edited

Loading