Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production - [Alerting] Servicing jobs in R&D queues alert #5277

Open
dotnet-eng-status bot opened this issue Mar 19, 2025 · 8 comments
Open

Production - [Alerting] Servicing jobs in R&D queues alert #5277

dotnet-eng-status bot opened this issue Mar 19, 2025 · 8 comments
Assignees
Labels
Active Alert Issues from Grafana alerts that are now active Critical Grafana Alert Issues opened by Grafana Ops - First Responder Production Tied to the Production environment (as opposed to Staging)

Comments

@dotnet-eng-status
Copy link

💔 Metric state changed to alerting

One or more servicing jobs were executed in a R&D queue, the expectation is that FR investigates why the jobs weren't redirected. The most common reasons are:

  • The job was sent to an on-prem queue, an on-prem queue is one that has osx, arm64 or perf within the name
    • We don't have physical hardware for servicing work so on-prem queues should be excluded from this effort. To fix the alert, we need to update the query and add the queue name to the third line where list on-prem queues.
  • The job was sent to a queue that doesn't have a corresponding servicing queue
    • We need to create the missing queue in helix machines repo

Next steps:

  1. Go to https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/historical/backend-status?orgId=1&viewPanel=72
  2. Investigate every job in the table and decide if we need to update the alert to exclude the job or if need to create a servicing queue for it

For more context go here

  • ServicingJobs 1

Go to rule

@dotnet/dnceng, @dotnet/prodconsvcs, please investigate

Automation information below, do not change

Grafana-Automated-Alert-Id-5aa74f27ef6445ce9d3d8d3d382e7e35

@dotnet-eng-status dotnet-eng-status bot added Active Alert Issues from Grafana alerts that are now active Critical Grafana Alert Issues opened by Grafana Ops - First Responder Production Tied to the Production environment (as opposed to Staging) labels Mar 19, 2025
@dougbu
Copy link
Member

dougbu commented Mar 19, 2025

Kusto shows all results relate to use of NetCore1ESPool-Publishing-Internal. I think we need to

  1. carve out an exception in the query for this alert i.e. we don't have or want to have a NetCore1ESPool-Publishing-Svc-Internal pool
  2. create such a pool and update the eng/common/ code in multiple Arcade branches to use it

I lean toward (1) but am unsure. thoughts on which path to take @premun, @mmitche, @ilyas1974

@ilyas1974
Copy link
Contributor

I think the "better" question to ask is why are they running jobs related to servicing in this pool at all? This pool has been around for a couple of years, and this is the first time it was used for this purpose.

@mmitche please correct me if I'm wrong, but I seem to remember this pool being created to speed up publishing during release. In my opinion, it shouldn't be used for jobs outside that scope.

@dougbu
Copy link
Member

dougbu commented Mar 19, 2025

the jobs are all publishing-related: Publish Assets, Publish Using Darc, and Publish to Build Asset Registry

@dougbu
Copy link
Member

dougbu commented Mar 19, 2025

actually, build jobs aren't the issue here. please ignore everything I said above. this is about Helix queues and the problem is perhaps threefold:

  1. should building preview releases use the .svc queues❓
  2. we're not excluding the .cet on-premises queues in the current query; that's an error
  3. link to the documentation in the issue description goes to a blank page in our Wiki instead of https://dev.azure.com/dnceng/internal/_wiki/wikis/DNCEng%20Services%20Wiki/1178/-Alerts-Servicing-jobs-in-R-D-queues-alert. (my search in the Wiki didn't find the correct page)

current rule shows the following for the past week:
Image

the current query for this rule is

// Note: If you are changing how we filter jobs, remember to make the same changes in the graph and table
let UntrackedQueues = Jobs 
| project QueueName = tolower(QueueName)
| where QueueName has_any ('osx','perf','armarch','arm64','arcade','coreappcompat','iot','ppc64le.experimental','s390x') 
or QueueName matches regex 'windows.*amd64.android.open'
or QueueName == 'windows.10.amd64.x86.rt'
| distinct QueueName;
Jobs
| where  $__timeFilter(Queued)
| where tolower(QueueName) !in (UntrackedQueues)
| extend TargetBranch=parse_json(Properties)["System.PullRequest.TargetBranch"]
| where (Branch contains "/release/" or Branch startswith "release/" or TargetBranch startswith "release/" or TargetBranch contains "/release/") and QueueName !endswith ".svc"
| project JobId, Queued, Repository, Branch, TargetBranch, QueueName

note .cet queues aren't untracked and preview branches aren't excluded. as smaller details,

  1. all armarch queue use Azure VMs, not on-premises machines; we should remove that clause
  2. the arm64 check catches both VM-based and on-premises queues; changing that to Cedarcrest would be an improvement

@ilyas1974
Copy link
Contributor

I don't have a problem with us implementing your above suggestions.

@dougbu dougbu self-assigned this Mar 21, 2025
@dougbu
Copy link
Member

dougbu commented Mar 22, 2025

basically, logic in dotnet-helix-service does the redirect and only warns if the new queue name doesn't exist. so, minimal action here unless we update JobController and RedirectHelper in dotnet-helix-service

  • can't address the armarch and arm64 problem at the moment. aspnetcore, machinelearning, and runtime repos make extensive use of Arm queues w/o .svc suffixes and we don't currently provide .svc queues for VM-based ARM queues. oops…
  • it also seems those repos are using .svc queues for preview branches. if we decide that's not fine, dotnet-helix-service code should change before we update the alert

for now, we can add ,'cet','cedarcrest' into the query and clean up these alerts (avoiding on-premises queues a bit more)

@dougbu
Copy link
Member

dougbu commented Mar 22, 2025

I created a few sub-issues and expect #5291 will clear this alert, allowing us to close it. the other two sub-issues need a bit more work and/or discussion

@dougbu
Copy link
Member

dougbu commented Mar 24, 2025

/fyi @ilyas1974 I'm leaving this as assigned to me since #5291 is assigned to me and waiting for rollout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Active Alert Issues from Grafana alerts that are now active Critical Grafana Alert Issues opened by Grafana Ops - First Responder Production Tied to the Production environment (as opposed to Staging)
Projects
None yet
Development

No branches or pull requests

2 participants