You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Good day, seeing a similar profiling issue with the L4s and H100s
OS: SLES15 SP5
CUDA Version: 12.6.0 (updated here given this version was compiled against 12.6.3)
Driver Version 560.28.03
GPU: On a single chassis: 4 L4s and 2 H100s
CPU: Intel XEON Platinum 8462Y+
Kernel: 5.14.22-150500.55.7
Tried with version 4.1.1 at the nvidia cuda repos and hitting issues with the new version. Everything seemingly works well with the 3.3.9 version but having to move away from that due to this bug on the dcgm-exporter available at that version NVIDIA/dcgm-exporter#412. The errors generally state
To also note, tested on SLES15 SP5 with the A2s and profiling seems to work fine on that generation. But the Hopper and Ada architectures are failing. Seems similar to #210.
The things that stand out a bit more in the debug logs are:
The logs show the steps that repeats for a single L4 and then what happens for the 2 H100s on the same host
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] Building LOP metrics for GPU with CC 8.9 [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1195] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::BuildLopMetricList]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] gpuId 3, id GrActive, lopTag gr__cycles_active.avg.pct_of_peak_sustained_elapsed is supported. [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1261] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::BuildLopMetricList]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] gpuId 3, id SmActive, lopTag sm__cycles_active.avg.pct_of_peak_sustained_elapsed is supported. [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1261] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::BuildLopMetricList]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] gpuId 3, id SmOccupancy, lopTag sm__warps_active_realtime.avg.pct_of_peak_sustained_elapsed is supported. [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1261] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::BuildLopMetricList]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] gpuId 3, id PipeTensorActive, lopTag sm__pipe_tensor_cycles_active_realtime_v2.avg.pct_of_peak_sustained_elapsed is supported. [/builds/dcgm/dcgm/dcgm
_private/modules/profiling/DcgmModuleProfiling.cpp:1261] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::BuildLopMetricList]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] gpuId 3, id DramActive, lopTag dram__throughput.avg.pct_of_peak_sustained_elapsed is supported. [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1261] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::BuildLopMetricList]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] gpuId 3, id PipeFp64Active, lopTag NULL is not supported. [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1245] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::BuildLopMetricList]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] gpuId 3, id PipeFp32Active, lopTag sm__inst_executed_pipe_fma_type_fp32_realtime.avg.pct_of_peak_sustained_elapsed is supported. [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1261] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::BuildLopMetricList]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] gpuId 3, id PipeFp16Active, lopTag sm__inst_executed_pipe_fma_type_fp16_realtime.avg.pct_of_peak_sustained_elapsed is supported. [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1261] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::BuildLopMetricList]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] gpuId 3, id PcieTxBytes, lopTag pcie__write_bytes.sum.per_second is supported. [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1261] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::BuildLopMetricList]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] gpuId 3, id PcieRxBytes, lopTag pcie__read_bytes.sum.per_second is supported. [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1261] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::BuildLopMetricList]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] gpuId 3, id NvLinkTxBytes, lopTag NULL is not supported. [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1245] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::BuildLopMetricList]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] gpuId 3, id NvLinkRxBytes, lopTag NULL is not supported. [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1245] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::BuildLopMetricList]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] gpuId 3, id PipeTensorImmaActive, lopTag NULL is not supported. [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1245] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::BuildLopMetricList]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] gpuId 3, id PipeTensorHmmaActive, lopTag NULL is not supported. [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1245] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::BuildLopMetricList]
2025-01-31 09:26:06.613 INFO [2141:2143] [[Profiling]] fieldId FieldId {1001} is supported for gpuId 3 as metricIndex GrActive [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1162] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::TryAddingFieldIdWithMetric]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] Assigning metric GrActive entity level 5 [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1178] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::TryAddingFieldIdWithMetric]
2025-01-31 09:26:06.613 INFO [2141:2143] [[Profiling]] fieldId FieldId {1002} is supported for gpuId 3 as metricIndex SmActive [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1162] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::TryAddingFieldIdWithMetric]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] Assigning metric SmActive entity level 5 [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1178] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::TryAddingFieldIdWithMetric]
2025-01-31 09:26:06.613 INFO [2141:2143] [[Profiling]] fieldId FieldId {1003} is supported for gpuId 3 as metricIndex SmOccupancy [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1162] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::TryAddingFieldIdWithMetric]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] Assigning metric SmOccupancy entity level 5 [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1178] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::TryAddingFieldIdWithMetric]
2025-01-31 09:26:06.613 INFO [2141:2143] [[Profiling]] fieldId FieldId {1004} is supported for gpuId 3 as metricIndex PipeTensorActive [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1162] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::TryAddingFieldIdWithMetric]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] Assigning metric PipeTensorActive entity level 5 [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1178] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::TryAddingFieldIdWithMetric]
2025-01-31 09:26:06.613 INFO [2141:2143] [[Profiling]] fieldId FieldId {1005} is supported for gpuId 3 as metricIndex DramActive [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1162] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::TryAddingFieldIdWithMetric]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] Assigning metric DramActive entity level 4 [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1178] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::TryAddingFieldIdWithMetric]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] FieldId FieldId {1006} is not supported for gpuId 3 due to metricIndex PipeFp64Active [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1157] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::TryAddingFieldIdWithMetric]
2025-01-31 09:26:06.613 INFO [2141:2143] [[Profiling]] fieldId FieldId {1007} is supported for gpuId 3 as metricIndex PipeFp32Active [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1162] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::TryAddingFieldIdWithMetric]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] Assigning metric PipeFp32Active entity level 5 [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1178] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::TryAddingFieldIdWithMetric]
2025-01-31 09:26:06.613 INFO [2141:2143] [[Profiling]] fieldId FieldId {1008} is supported for gpuId 3 as metricIndex PipeFp16Active [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1162] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::TryAddingFieldIdWithMetric]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] Assigning metric PipeFp16Active entity level 5 [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1178] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::TryAddingFieldIdWithMetric]
2025-01-31 09:26:06.613 INFO [2141:2143] [[Profiling]] fieldId FieldId {1009} is supported for gpuId 3 as metricIndex PcieTxBytes [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1162] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::TryAddingFieldIdWithMetric]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] Assigning metric PcieTxBytes entity level 1 [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1178] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::TryAddingFieldIdWithMetric]
2025-01-31 09:26:06.613 INFO [2141:2143] [[Profiling]] fieldId FieldId {1010} is supported for gpuId 3 as metricIndex PcieRxBytes [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1162] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::TryAddingFieldIdWithMetric]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] Assigning metric PcieRxBytes entity level 1 [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1178] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::TryAddingFieldIdWithMetric]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] FieldId FieldId {1011} is not supported for gpuId 3 due to metricIndex NvLinkTxBytes [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1157] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::TryAddingFieldIdWithMetric]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] FieldId FieldId {1012} is not supported for gpuId 3 due to metricIndex NvLinkRxBytes [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1157] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::TryAddingFieldIdWithMetric]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] FieldId FieldId {1013} is not supported for gpuId 3 due to metricIndex PipeTensorImmaActive [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1157] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::TryAddingFieldIdWithMetric]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] FieldId FieldId {1014} is not supported for gpuId 3 due to metricIndex PipeTensorHmmaActive [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1157] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::TryAddingFieldIdWithMetric]
2025-01-31 09:26:06.613 ERROR [2141:2143] [[Profiling]] Unsupported GPU architecture: 8(device: 0, bus: 181, domain: 0) [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1489] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] Ignoring StopSampling for pwDeviceIndex 3 since we haven't started sampling. [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmLopGpu.cpp:464] [DcgmLopGpu::StopSampling]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] gpuId 4 chip arch 9 is not Volta, Turing, Ampere or Ada. [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:2508] [DcgmNs::Modules::Profiling::ValidateGpuArch]
2025-01-31 09:26:06.613 ERROR [2141:2143] [[Profiling]] DCP metrics are supported for Volta, Turing, Ampere or Ada GPUs architectures only [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:2509] [DcgmNs::Modules::Profiling::ValidateGpuArch]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] Skipping GPU that cannot be used for DCP monitoring. IsGpuSuitableForDcpWatching returned -36 Profiling is not supported for this group of GPUs or GPU [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1445] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] gpuId 5 chip arch 9 is not Volta, Turing, Ampere or Ada. [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:2508] [DcgmNs::Modules::Profiling::ValidateGpuArch]
2025-01-31 09:26:06.613 ERROR [2141:2143] [[Profiling]] DCP metrics are supported for Volta, Turing, Ampere or Ada GPUs architectures only [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:2509] [DcgmNs::Modules::Profiling::ValidateGpuArch]
2025-01-31 09:26:06.613 DEBUG [2141:2143] [[Profiling]] Skipping GPU that cannot be used for DCP monitoring. IsGpuSuitableForDcpWatching returned -36 Profiling is not supported for this group of GPUs or GPU [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1445] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2025-01-31 09:26:06.613 ERROR [2141:2143] [[Profiling]] No GPU with LOP support were found. [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1548] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2025-01-31 09:26:06.613 ERROR [2141:2143] [[Profiling]] DcgmModuleProfiling failed to initialize. See the logs. [/builds/dcgm/dcgm/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:499] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::DcgmModuleProfiling]
2025-01-31 09:26:06.613 ERROR [2141:2143] [[Profiling]] A runtime exception occured when creating module. Ex: DcgmModuleProfiling failed to initialize. See the logs. [/builds/dcgm/dcgm/modules/DcgmModule.h:146] [{anonymous}::SafeWrapper]
The text was updated successfully, but these errors were encountered:
Good day, seeing a similar profiling issue with the L4s and H100s
Tried with version 4.1.1 at the nvidia cuda repos and hitting issues with the new version. Everything seemingly works well with the 3.3.9 version but having to move away from that due to this bug on the dcgm-exporter available at that version NVIDIA/dcgm-exporter#412. The errors generally state
To also note, tested on SLES15 SP5 with the A2s and profiling seems to work fine on that generation. But the Hopper and Ada architectures are failing. Seems similar to #210.
The things that stand out a bit more in the debug logs are:
The logs show the steps that repeats for a single L4 and then what happens for the 2 H100s on the same host
The text was updated successfully, but these errors were encountered: