Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DCGM Policy Violation Notification channel reporting too many PCIe violations on P5 instance type from AWS EC2 (H100) #184

Open
haardm opened this issue Aug 14, 2024 · 4 comments

Comments

@haardm
Copy link

haardm commented Aug 14, 2024

Hi team,

We are observing PCIe violations on p5 happen consistently fromt the time the instance is launched.
We act on this with a terminate and replace instance logic, but that's an expensive operation timewise as well as the instance type being a P5 from EC2. Also, this is a default set threshold and can't be managed by client while subscribing to the policy.

Few asks:

Is there an upstream fix from Nvidia that is planned?
Is there any repercussion of temporarily not subscribing to this policy?
What would go wrong if we let the PCIe errors to keep happening silently?

@haardm haardm changed the title DCGM Policy Violation Notification channel reporting too many PCIe violations on P5 DCGM Policy Violation Notification channel reporting too many PCIe violations on P5 instance type from AWS EC2 (H100) Aug 15, 2024
@nikkon-dev
Copy link
Collaborator

Hi,

This is a known issue that will be resolved in an upcoming release. In short, this diagnostic didn’t correctly normalize the error rates to the PCIe generation throughput.

@haardm
Copy link
Author

haardm commented Mar 10, 2025

Hello, checking if this issue still exists? I am still seeing too many PCIe detections sporadically on my hardware.

@nikkon-dev
Copy link
Collaborator

@haardm, Could you clarify which FieldID you are monitoring or whether you are using health monitoring? I'd like to understand your use case better. It would also be helpful to see the nv-hostengine debug logs: restart nv-hostengine with the following arguments nv-hostengine -f debug.log --log-level debug.

We have indeed fixed the PCI error thresholds based on the PCI generation and expected throughput. However, I would like to know if the PCI generation is accurately detected on your system.

@haardm
Copy link
Author

haardm commented Mar 11, 2025

FieldID

I didn't find a fieldID for PCI error policy condition https://github.com/NVIDIA/go-dcgm/blob/5f1fc9afaa3194d378193ffd8fc49a271e1afd5c/pkg/dcgm/policy.go#L181,L215

policyCondition("PCI error") this is what we subscribe to in our method that calls ListenForPolicyViolations api

I do see one for NvlinkPolicy tho, should we be using that to monitor PCI error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants