Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

H800 can not open profile feature? Help... #183

Open
cc8476 opened this issue Aug 6, 2024 · 2 comments
Open

H800 can not open profile feature? Help... #183

cc8476 opened this issue Aug 6, 2024 · 2 comments

Comments

@cc8476
Copy link

cc8476 commented Aug 6, 2024

dcgmi profile --resume
Error: unable to resume profiling metrics: Feature not supported.

NEED HELP ... PLZ!!!

my env is that >>>>>>>>>>>>>>>

Hostengine build info:
Version : 3.3.7
Build ID : 26
Build Date : 2024-07-09
Build Type : Release
Commit ID : 105620196e46a7ef2f99a1ce3e69a5d12af1e845
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : c1b74febf52d45d29ae956b78c091857

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.5 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H800 Off | 00000000:16:00.0 Off | 0 |
| N/A 29C P0 73W / 700W | 3MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA H800 Off | 00000000:17:00.0 Off | 0 |
| N/A 32C P0 71W / 700W | 3MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA H800 Off | 00000000:40:00.0 Off | 0 |
| N/A 33C P0 117W / 700W | 743MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA H800 Off | 00000000:41:00.0 Off | 0 |
| N/A 35C P0 74W / 700W | 3MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA H800 Off | 00000000:96:00.0 Off | 0 |
| N/A 29C P0 72W / 700W | 3MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA H800 Off | 00000000:97:00.0 Off | 0 |
| N/A 33C P0 72W / 700W | 3MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA H800 Off | 00000000:C0:00.0 Off | 0 |
| N/A 29C P0 73W / 700W | 3MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA H800 Off | 00000000:C1:00.0 Off | 0 |
| N/A 32C P0 72W / 700W | 3MiB / 81559MiB | 0% Default |
| | | Disabled

[root@dcef26e4e4ae /]# dcgmi modules -l
+-----------+--------------------+--------------------------------------------------+
| List Modules |
| Status: Success |
+===========+====================+==================================================+
| Module ID | Name | State |
+-----------+--------------------+--------------------------------------------------+
| 0 | Core | Loaded |
| 1 | NvSwitch | Loaded |
| 2 | VGPU | Not loaded |
| 3 | Introspection | Not loaded |
| 4 | Health | Not loaded |
| 5 | Policy | Not loaded |
| 6 | Config | Not loaded |
| 7 | Diag | Not loaded |
| 8 | Profiling | Failed to load |
| 9 | SysMon | Not loaded

@cc8476
Copy link
Author

cc8476 commented Aug 6, 2024

add extro info:
nv-hostengine -f host.debug.log --log-level debug
Err: Failed to start DCGM Server: -7

@HH-66
Copy link

HH-66 commented Mar 18, 2025

the same issue about L20 and dcgmi 4.1.1
Version : 4.1.1
Build ID : 11087
Build Date : 2025-02-14
Build Type : RelWithDebInfo
Commit ID : 3965d2e947bcea4c496759177222de6115bd58d0
Branch Name : v4.1.1
CPU Arch : x86_64
Build Platform : Linux 5.15.0-122-generic #132-Ubuntu SMP Thu Aug 29 13:45:52 UTC 2024 x86_64
CRC : 84de5921dcda2d8986924b6bcea05213

Image

Image

some debug log
Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants