-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DCGM v4 diag info message count regression #215
Comments
Thank you for your report. I can see two potential concerns:
Following our Reporting An Issue process would help collect diagnostic information (logs, version information, etc.) that would be valuable for thoroughly investigating both aspects of this behavior. |
I thought that I'd provided enough details to reproduce the issue. Regardless, here are the raw json files: Compressed Here are the nvidia-smi files: |
Thank you for the additional details. The issue reported here has been reproduced and confirmed. We are exploring potential solutions. |
With prior versions of
dcgmi diag -r 4
we could reliably expect informational messages about each test run on each gpu to be output to aninfo
field both in the JSON and text based output. For instance:When we've run the same tests within DCGM version 4, the number of informational messages returned on a per-diagnostic run is reduced to at most 16 entries.
Tracing through the code, there is an attempt to cap the number of informational messages per test run to
16
,https://github.com/NVIDIA/DCGM/blob/master/dcgmlib/dcgm_structs.h#L2318
That cap is then reused/renamed here to cap the maximum number of informational responses:
https://github.com/NVIDIA/DCGM/blob/master/dcgmlib/dcgm_structs.h#L2763
Which is then used here to set the maximum array size for those informational messages
https://github.com/NVIDIA/DCGM/blob/master/dcgmlib/dcgm_structs.h#L2794
Within the version 4 output, this is seen as an info section being truncated early and then just missing from subsequent test results:
I would expect to be able to see the informational messages on a per-test instance basis.
The text was updated successfully, but these errors were encountered: