Grouped Quantized Transposed Conv is rejected by trtexec. #1934

tzskp1 · 2022-04-19T10:18:54Z

Description

I tried quantization via pytorch-quantization library.

Then I met the error while atempting to convert grouped quantized transposed convolution layer.

Thank you.

&&&& RUNNING TensorRT.trtexec [TensorRT v8003] # trtexec --onnx=./bug_on.onnx --int8
[04/19/2022-09:59:10] [I] === Model Options ===
[04/19/2022-09:59:10] [I] Format: ONNX
[04/19/2022-09:59:10] [I] Model: ./bug_on.onnx
[04/19/2022-09:59:10] [I] Output:
[04/19/2022-09:59:10] [I] === Build Options ===
[04/19/2022-09:59:10] [I] Max batch: explicit
[04/19/2022-09:59:10] [I] Workspace: 16 MiB
[04/19/2022-09:59:10] [I] minTiming: 1
[04/19/2022-09:59:10] [I] avgTiming: 8
[04/19/2022-09:59:10] [I] Precision: FP32+INT8
[04/19/2022-09:59:10] [I] Calibration: Dynamic
[04/19/2022-09:59:10] [I] Refit: Disabled
[04/19/2022-09:59:10] [I] Sparsity: Disabled
[04/19/2022-09:59:10] [I] Safe mode: Disabled
[04/19/2022-09:59:10] [I] Restricted mode: Disabled
[04/19/2022-09:59:10] [I] Save engine:
[04/19/2022-09:59:10] [I] Load engine:
[04/19/2022-09:59:10] [I] NVTX verbosity: 0
[04/19/2022-09:59:10] [I] Tactic sources: Using default tactic sources
[04/19/2022-09:59:10] [I] timingCacheMode: local
[04/19/2022-09:59:10] [I] timingCacheFile:
[04/19/2022-09:59:10] [I] Input(s)s format: fp32:CHW
[04/19/2022-09:59:10] [I] Output(s)s format: fp32:CHW
[04/19/2022-09:59:10] [I] Input build shapes: model
[04/19/2022-09:59:10] [I] Input calibration shapes: model
[04/19/2022-09:59:10] [I] === System Options ===
[04/19/2022-09:59:10] [I] Device: 0
[04/19/2022-09:59:10] [I] DLACore:
[04/19/2022-09:59:10] [I] Plugins:
[04/19/2022-09:59:10] [I] === Inference Options ===
[04/19/2022-09:59:10] [I] Batch: Explicit
[04/19/2022-09:59:10] [I] Input inference shapes: model
[04/19/2022-09:59:10] [I] Iterations: 10
[04/19/2022-09:59:10] [I] Duration: 3s (+ 200ms warm up)
[04/19/2022-09:59:10] [I] Sleep time: 0ms
[04/19/2022-09:59:10] [I] Streams: 1
[04/19/2022-09:59:10] [I] ExposeDMA: Disabled
[04/19/2022-09:59:10] [I] Data transfers: Enabled
[04/19/2022-09:59:10] [I] Spin-wait: Disabled
[04/19/2022-09:59:10] [I] Multithreading: Disabled
[04/19/2022-09:59:10] [I] CUDA Graph: Disabled
[04/19/2022-09:59:10] [I] Separate profiling: Disabled
[04/19/2022-09:59:10] [I] Time Deserialize: Disabled
[04/19/2022-09:59:10] [I] Time Refit: Disabled
[04/19/2022-09:59:10] [I] Skip inference: Disabled
[04/19/2022-09:59:10] [I] Inputs:
[04/19/2022-09:59:10] [I] === Reporting Options ===
[04/19/2022-09:59:10] [I] Verbose: Disabled
[04/19/2022-09:59:10] [I] Averages: 10 inferences
[04/19/2022-09:59:10] [I] Percentile: 99
[04/19/2022-09:59:10] [I] Dump refittable layers:Disabled
[04/19/2022-09:59:10] [I] Dump output: Disabled
[04/19/2022-09:59:10] [I] Profile: Disabled
[04/19/2022-09:59:10] [I] Export timing to JSON file:
[04/19/2022-09:59:10] [I] Export output to JSON file:
[04/19/2022-09:59:10] [I] Export profile to JSON file:
[04/19/2022-09:59:10] [I]
[04/19/2022-09:59:10] [I] === Device Information ===
[04/19/2022-09:59:10] [I] Selected Device: NVIDIA GeForce GTX 1650
[04/19/2022-09:59:10] [I] Compute Capability: 7.5
[04/19/2022-09:59:10] [I] SMs: 14
[04/19/2022-09:59:10] [I] Compute Clock Rate: 1.74 GHz
[04/19/2022-09:59:10] [I] Device Global Memory: 3912 MiB
[04/19/2022-09:59:10] [I] Shared Memory per SM: 64 KiB
[04/19/2022-09:59:10] [I] Memory Bus Width: 128 bits (ECC disabled)
[04/19/2022-09:59:10] [I] Memory Clock Rate: 4.001 GHz
[04/19/2022-09:59:10] [I]
[04/19/2022-09:59:10] [I] TensorRT version: 8003
[04/19/2022-09:59:10] [I] [TRT] [MemUsageChange] Init CUDA: CPU +328, GPU +0, now: CPU 335, GPU 326 (MiB)
[04/19/2022-09:59:10] [I] Start parsing network model
[04/19/2022-09:59:10] [I] [TRT] ----------------------------------------------------------------
[04/19/2022-09:59:10] [I] [TRT] Input filename:   ./bug_on.onnx
[04/19/2022-09:59:10] [I] [TRT] ONNX IR version:  0.0.7
[04/19/2022-09:59:10] [I] [TRT] Opset version:    13
[04/19/2022-09:59:10] [I] [TRT] Producer name:    pytorch
[04/19/2022-09:59:10] [I] [TRT] Producer version: 1.11
[04/19/2022-09:59:10] [I] [TRT] Domain:
[04/19/2022-09:59:10] [I] [TRT] Model version:    0
[04/19/2022-09:59:10] [I] [TRT] Doc string:
[04/19/2022-09:59:10] [I] [TRT] ----------------------------------------------------------------
[04/19/2022-09:59:10] [I] Finish parsing network model
[04/19/2022-09:59:10] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 335, GPU 326 (MiB)
[04/19/2022-09:59:10] [I] FP32 and INT8 precisions have been specified - more performance might be enabled by additionally specifying --fp16 or --best
[04/19/2022-09:59:10] [I] [TRT] [MemUsageSnapshot] Builder begin: CPU 335 MiB, GPU 328 MiB
[04/19/2022-09:59:10] [W] [TRT] Calibrator won't be used in explicit precision mode. Use quantization aware training to generate network with Quantize/Dequantize nodes.
[04/19/2022-09:59:10] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +508, GPU +220, now: CPU 843, GPU 548 (MiB)
[04/19/2022-09:59:10] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +114, GPU +52, now: CPU 957, GPU 600 (MiB)
[04/19/2022-09:59:10] [W] [TRT] Detected invalid timing cache, setup a local cache instead
[04/19/2022-09:59:11] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 957, GPU 584 (MiB)
[04/19/2022-09:59:11] [E] Error[10]: [optimizer.cpp::computeCosts::1855] Error Code 10: Internal Error (Could not find any implementation for node weight + QuantizeLinear_8_quantize_scale_node + ConvTranspose_12.)
[04/19/2022-09:59:11] [E] Error[2]: [builder.cpp::buildSerializedNetwork::417] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed.)
Segmentation fault (core dumped)

Environment

TensorRT Version: 8003
NVIDIA GPU: GTX 1650
NVIDIA Driver Version: NVIDIA UNIX x86_64 Kernel Module 510.54
CUDA Version: V11.5.50
CUDNN Version:
Operating System: Ubuntu 20.04.3
Python Version (if applicable): 3.8.12
Tensorflow Version (if applicable):
PyTorch Version (if applicable): '1.11.0a0+b6df043'
Baremetal or Container (if so, version): nvcr.io/nvidia/pytorch:21.11-py3

Relevant Files

Steps To Reproduce

Create the onnx file of transconv layer via this script.

import numpy as np
import torch
import onnx
from pytorch_quantization import nn as quant_nn
from pytorch_quantization import calib
from pytorch_quantization.tensor_quant import QuantDescriptor
from pytorch_quantization import quant_modules

quant_desc_input = QuantDescriptor(calib_method='histogram')
quant_nn.QuantConv2d.set_default_quant_desc_input(quant_desc_input)
quant_nn.QuantConvTranspose2d.set_default_quant_desc_input(quant_desc_input)
quant_nn.QuantLinear.set_default_quant_desc_input(quant_desc_input)

quant_modules.initialize()

model = torch.nn.ConvTranspose2d(
      64, 64, 4, stride=2, padding=1,
      output_padding=0, groups=64, bias=False)
model.cuda()

quant_nn.TensorQuantizer.use_fb_fake_quant = True
dummy_input = torch.randn(1, 64, 64, 64).cuda()
torch.onnx.export(model, dummy_input, './bug_on.onnx', verbose=True, opset_version=13)

model = onnx.load('./bug_on.onnx')
onnx.checker.check_model(model)

trtexec --onnx=./bug_on.onnx --int8

The text was updated successfully, but these errors were encountered:

zerollzeng · 2022-04-24T03:55:55Z

I can reproduce this on 8.4. @ttyio do we support group convtranpose in QAT?

ttyio · 2022-04-25T02:55:02Z

@tzskp1 , currently we only enabled loop based convtranspose. The limitation is that for the INT8 IO, we have to fit the requirement that channel at least 4 times groups number, like in your case, you have to modify to:

 model = torch.nn.ConvTranspose2d(
  256, 64, 4, stride=2, padding=1,
  output_padding=0, groups=64, bias=False)

I will create RFE to enable non loop based convtranspose feature. Thanks!

tzskp1 · 2022-04-28T05:24:08Z

Thank you for your response.
I tried to export 1 / 4 groups conv layer.
But performance of 1 / 4 groups conv layer worse with non grouped conv layer.
Is it caused by loop based implementation?

ttyio · 2022-05-06T12:25:16Z

Hello @tzskp1 , sorry for the delay response, it is partially because of the loop base implementation, But if you try 1 / 32, you can get more speed up from tensorcore. thanks!

nvpohanh · 2022-07-01T06:34:15Z

Closing due to >14 days without activity. Please feel free to reopen if the issue still exists. Thanks

oxana-nvidia · 2025-03-03T21:40:25Z

I've tested with TRT 10.9 and the issue is resolved. I'm not sure what is the earliest version when is was implemented.

[03/03/2025-13:36:52] [I] === Performance summary ===
[03/03/2025-13:36:52] [I] Throughput: 2975.87 qps
[03/03/2025-13:36:52] [I] Latency: min = 0.44873 ms, max = 0.50415 ms, mean = 0.462956 ms, median = 0.462646 ms, percentile(90%) = 0.463745 ms, percentile(95%) = 0.468536 ms, percentile(99%) = 0.471313 ms
[03/03/2025-13:36:52] [I] Enqueue Time: min = 0.0231934 ms, max = 0.0812988 ms, mean = 0.0372808 ms, median = 0.0368042 ms, percentile(90%) = 0.0385742 ms, percentile(95%) = 0.0395355 ms, percentile(99%) = 0.0512695 ms
[03/03/2025-13:36:52] [I] H2D Latency: min = 0.0942383 ms, max = 0.141357 ms, mean = 0.099128 ms, median = 0.098877 ms, percentile(90%) = 0.0994873 ms, percentile(95%) = 0.0998535 ms, percentile(99%) = 0.106812 ms
[03/03/2025-13:36:52] [I] GPU Compute Time: min = 0.0285645 ms, max = 0.0583496 ms, mean = 0.0292459 ms, median = 0.029541 ms, percentile(90%) = 0.0297852 ms, percentile(95%) = 0.0297852 ms, percentile(99%) = 0.0297852 ms
[03/03/2025-13:36:52] [I] D2H Latency: min = 0.320801 ms, max = 0.351318 ms, mean = 0.334587 ms, median = 0.334229 ms, percentile(90%) = 0.334473 ms, percentile(95%) = 0.335266 ms, percentile(99%) = 0.342651 ms
[03/03/2025-13:36:52] [I] Total Host Walltime: 3.00081 s
[03/03/2025-13:36:52] [I] Total GPU Compute Time: 0.261166 s
[03/03/2025-13:36:52] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[03/03/2025-13:36:52] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[03/03/2025-13:36:52] [W] * Throughput may be bound by host-to-device transfers for the inputs rather than GPU Compute and the GPU may be under-utilized.
[03/03/2025-13:36:52] [W]   Add --noDataTransfers flag to disable data transfers.
[03/03/2025-13:36:52] [W] * Throughput may be bound by device-to-host transfers for the outputs rather than GPU Compute and the GPU may be under-utilized.
[03/03/2025-13:36:52] [W]   Add --noDataTransfers flag to disable data transfers.
[03/03/2025-13:36:52] [W] * GPU compute time is unstable, with coefficient of variance = 2.53877%.
[03/03/2025-13:36:52] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[03/03/2025-13:36:52] [I] Explanations of the performance metrics are printed in the verbose logs.
[03/03/2025-13:36:52] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v100900] [b32] # trtexec --onnx=/tmp/bug_on.onnx --int8

ttyio added Feature Request Request for new functionality triaged Issue has been triaged by maintainers labels Apr 25, 2022

nvpohanh closed this as completed Jul 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grouped Quantized Transposed Conv is rejected by trtexec. #1934

Grouped Quantized Transposed Conv is rejected by trtexec. #1934

tzskp1 commented Apr 19, 2022

zerollzeng commented Apr 24, 2022

ttyio commented Apr 25, 2022

tzskp1 commented Apr 28, 2022 •

edited

Loading

ttyio commented May 6, 2022

nvpohanh commented Jul 1, 2022

oxana-nvidia commented Mar 3, 2025

Grouped Quantized Transposed Conv is rejected by trtexec. #1934

Grouped Quantized Transposed Conv is rejected by trtexec. #1934

Comments

tzskp1 commented Apr 19, 2022

Description

Environment

Relevant Files

Steps To Reproduce

zerollzeng commented Apr 24, 2022

ttyio commented Apr 25, 2022

tzskp1 commented Apr 28, 2022 • edited Loading

ttyio commented May 6, 2022

nvpohanh commented Jul 1, 2022

oxana-nvidia commented Mar 3, 2025

tzskp1 commented Apr 28, 2022 •

edited

Loading