Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grouped Quantized Transposed Conv is rejected by trtexec. #1934

Closed
tzskp1 opened this issue Apr 19, 2022 · 6 comments
Closed

Grouped Quantized Transposed Conv is rejected by trtexec. #1934

tzskp1 opened this issue Apr 19, 2022 · 6 comments
Labels
Feature Request Request for new functionality triaged Issue has been triaged by maintainers

Comments

@tzskp1
Copy link

tzskp1 commented Apr 19, 2022

Description

I tried quantization via pytorch-quantization library.

Then I met the error while atempting to convert grouped quantized transposed convolution layer.

Thank you.

&&&& RUNNING TensorRT.trtexec [TensorRT v8003] # trtexec --onnx=./bug_on.onnx --int8
[04/19/2022-09:59:10] [I] === Model Options ===
[04/19/2022-09:59:10] [I] Format: ONNX
[04/19/2022-09:59:10] [I] Model: ./bug_on.onnx
[04/19/2022-09:59:10] [I] Output:
[04/19/2022-09:59:10] [I] === Build Options ===
[04/19/2022-09:59:10] [I] Max batch: explicit
[04/19/2022-09:59:10] [I] Workspace: 16 MiB
[04/19/2022-09:59:10] [I] minTiming: 1
[04/19/2022-09:59:10] [I] avgTiming: 8
[04/19/2022-09:59:10] [I] Precision: FP32+INT8
[04/19/2022-09:59:10] [I] Calibration: Dynamic
[04/19/2022-09:59:10] [I] Refit: Disabled
[04/19/2022-09:59:10] [I] Sparsity: Disabled
[04/19/2022-09:59:10] [I] Safe mode: Disabled
[04/19/2022-09:59:10] [I] Restricted mode: Disabled
[04/19/2022-09:59:10] [I] Save engine:
[04/19/2022-09:59:10] [I] Load engine:
[04/19/2022-09:59:10] [I] NVTX verbosity: 0
[04/19/2022-09:59:10] [I] Tactic sources: Using default tactic sources
[04/19/2022-09:59:10] [I] timingCacheMode: local
[04/19/2022-09:59:10] [I] timingCacheFile:
[04/19/2022-09:59:10] [I] Input(s)s format: fp32:CHW
[04/19/2022-09:59:10] [I] Output(s)s format: fp32:CHW
[04/19/2022-09:59:10] [I] Input build shapes: model
[04/19/2022-09:59:10] [I] Input calibration shapes: model
[04/19/2022-09:59:10] [I] === System Options ===
[04/19/2022-09:59:10] [I] Device: 0
[04/19/2022-09:59:10] [I] DLACore:
[04/19/2022-09:59:10] [I] Plugins:
[04/19/2022-09:59:10] [I] === Inference Options ===
[04/19/2022-09:59:10] [I] Batch: Explicit
[04/19/2022-09:59:10] [I] Input inference shapes: model
[04/19/2022-09:59:10] [I] Iterations: 10
[04/19/2022-09:59:10] [I] Duration: 3s (+ 200ms warm up)
[04/19/2022-09:59:10] [I] Sleep time: 0ms
[04/19/2022-09:59:10] [I] Streams: 1
[04/19/2022-09:59:10] [I] ExposeDMA: Disabled
[04/19/2022-09:59:10] [I] Data transfers: Enabled
[04/19/2022-09:59:10] [I] Spin-wait: Disabled
[04/19/2022-09:59:10] [I] Multithreading: Disabled
[04/19/2022-09:59:10] [I] CUDA Graph: Disabled
[04/19/2022-09:59:10] [I] Separate profiling: Disabled
[04/19/2022-09:59:10] [I] Time Deserialize: Disabled
[04/19/2022-09:59:10] [I] Time Refit: Disabled
[04/19/2022-09:59:10] [I] Skip inference: Disabled
[04/19/2022-09:59:10] [I] Inputs:
[04/19/2022-09:59:10] [I] === Reporting Options ===
[04/19/2022-09:59:10] [I] Verbose: Disabled
[04/19/2022-09:59:10] [I] Averages: 10 inferences
[04/19/2022-09:59:10] [I] Percentile: 99
[04/19/2022-09:59:10] [I] Dump refittable layers:Disabled
[04/19/2022-09:59:10] [I] Dump output: Disabled
[04/19/2022-09:59:10] [I] Profile: Disabled
[04/19/2022-09:59:10] [I] Export timing to JSON file:
[04/19/2022-09:59:10] [I] Export output to JSON file:
[04/19/2022-09:59:10] [I] Export profile to JSON file:
[04/19/2022-09:59:10] [I]
[04/19/2022-09:59:10] [I] === Device Information ===
[04/19/2022-09:59:10] [I] Selected Device: NVIDIA GeForce GTX 1650
[04/19/2022-09:59:10] [I] Compute Capability: 7.5
[04/19/2022-09:59:10] [I] SMs: 14
[04/19/2022-09:59:10] [I] Compute Clock Rate: 1.74 GHz
[04/19/2022-09:59:10] [I] Device Global Memory: 3912 MiB
[04/19/2022-09:59:10] [I] Shared Memory per SM: 64 KiB
[04/19/2022-09:59:10] [I] Memory Bus Width: 128 bits (ECC disabled)
[04/19/2022-09:59:10] [I] Memory Clock Rate: 4.001 GHz
[04/19/2022-09:59:10] [I]
[04/19/2022-09:59:10] [I] TensorRT version: 8003
[04/19/2022-09:59:10] [I] [TRT] [MemUsageChange] Init CUDA: CPU +328, GPU +0, now: CPU 335, GPU 326 (MiB)
[04/19/2022-09:59:10] [I] Start parsing network model
[04/19/2022-09:59:10] [I] [TRT] ----------------------------------------------------------------
[04/19/2022-09:59:10] [I] [TRT] Input filename:   ./bug_on.onnx
[04/19/2022-09:59:10] [I] [TRT] ONNX IR version:  0.0.7
[04/19/2022-09:59:10] [I] [TRT] Opset version:    13
[04/19/2022-09:59:10] [I] [TRT] Producer name:    pytorch
[04/19/2022-09:59:10] [I] [TRT] Producer version: 1.11
[04/19/2022-09:59:10] [I] [TRT] Domain:
[04/19/2022-09:59:10] [I] [TRT] Model version:    0
[04/19/2022-09:59:10] [I] [TRT] Doc string:
[04/19/2022-09:59:10] [I] [TRT] ----------------------------------------------------------------
[04/19/2022-09:59:10] [I] Finish parsing network model
[04/19/2022-09:59:10] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 335, GPU 326 (MiB)
[04/19/2022-09:59:10] [I] FP32 and INT8 precisions have been specified - more performance might be enabled by additionally specifying --fp16 or --best
[04/19/2022-09:59:10] [I] [TRT] [MemUsageSnapshot] Builder begin: CPU 335 MiB, GPU 328 MiB
[04/19/2022-09:59:10] [W] [TRT] Calibrator won't be used in explicit precision mode. Use quantization aware training to generate network with Quantize/Dequantize nodes.
[04/19/2022-09:59:10] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +508, GPU +220, now: CPU 843, GPU 548 (MiB)
[04/19/2022-09:59:10] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +114, GPU +52, now: CPU 957, GPU 600 (MiB)
[04/19/2022-09:59:10] [W] [TRT] Detected invalid timing cache, setup a local cache instead
[04/19/2022-09:59:11] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 957, GPU 584 (MiB)
[04/19/2022-09:59:11] [E] Error[10]: [optimizer.cpp::computeCosts::1855] Error Code 10: Internal Error (Could not find any implementation for node weight + QuantizeLinear_8_quantize_scale_node + ConvTranspose_12.)
[04/19/2022-09:59:11] [E] Error[2]: [builder.cpp::buildSerializedNetwork::417] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed.)
Segmentation fault (core dumped)

Environment

TensorRT Version: 8003
NVIDIA GPU: GTX 1650
NVIDIA Driver Version: NVIDIA UNIX x86_64 Kernel Module 510.54
CUDA Version: V11.5.50
CUDNN Version:
Operating System: Ubuntu 20.04.3
Python Version (if applicable): 3.8.12
Tensorflow Version (if applicable):
PyTorch Version (if applicable): '1.11.0a0+b6df043'
Baremetal or Container (if so, version): nvcr.io/nvidia/pytorch:21.11-py3

Relevant Files

Steps To Reproduce

  1. Create the onnx file of transconv layer via this script.
import numpy as np
import torch
import onnx
from pytorch_quantization import nn as quant_nn
from pytorch_quantization import calib
from pytorch_quantization.tensor_quant import QuantDescriptor
from pytorch_quantization import quant_modules

quant_desc_input = QuantDescriptor(calib_method='histogram')
quant_nn.QuantConv2d.set_default_quant_desc_input(quant_desc_input)
quant_nn.QuantConvTranspose2d.set_default_quant_desc_input(quant_desc_input)
quant_nn.QuantLinear.set_default_quant_desc_input(quant_desc_input)

quant_modules.initialize()

model = torch.nn.ConvTranspose2d(
      64, 64, 4, stride=2, padding=1,
      output_padding=0, groups=64, bias=False)
model.cuda()

quant_nn.TensorQuantizer.use_fb_fake_quant = True
dummy_input = torch.randn(1, 64, 64, 64).cuda()
torch.onnx.export(model, dummy_input, './bug_on.onnx', verbose=True, opset_version=13)

model = onnx.load('./bug_on.onnx')
onnx.checker.check_model(model)
  1. trtexec --onnx=./bug_on.onnx --int8
@zerollzeng
Copy link
Collaborator

I can reproduce this on 8.4. @ttyio do we support group convtranpose in QAT?

@ttyio
Copy link
Collaborator

ttyio commented Apr 25, 2022

@tzskp1 , currently we only enabled loop based convtranspose. The limitation is that for the INT8 IO, we have to fit the requirement that channel at least 4 times groups number, like in your case, you have to modify to:

 model = torch.nn.ConvTranspose2d(
  256, 64, 4, stride=2, padding=1,
  output_padding=0, groups=64, bias=False)

I will create RFE to enable non loop based convtranspose feature. Thanks!

@ttyio ttyio added Feature Request Request for new functionality triaged Issue has been triaged by maintainers labels Apr 25, 2022
@tzskp1
Copy link
Author

tzskp1 commented Apr 28, 2022

Thank you for your response.
I tried to export 1 / 4 groups conv layer.
But performance of 1 / 4 groups conv layer worse with non grouped conv layer.
Is it caused by loop based implementation?

@ttyio
Copy link
Collaborator

ttyio commented May 6, 2022

Hello @tzskp1 , sorry for the delay response, it is partially because of the loop base implementation, But if you try 1 / 32, you can get more speed up from tensorcore. thanks!

@nvpohanh
Copy link
Collaborator

nvpohanh commented Jul 1, 2022

Closing due to >14 days without activity. Please feel free to reopen if the issue still exists. Thanks

@nvpohanh nvpohanh closed this as completed Jul 1, 2022
@oxana-nvidia
Copy link
Collaborator

I've tested with TRT 10.9 and the issue is resolved. I'm not sure what is the earliest version when is was implemented.

[03/03/2025-13:36:52] [I] === Performance summary ===
[03/03/2025-13:36:52] [I] Throughput: 2975.87 qps
[03/03/2025-13:36:52] [I] Latency: min = 0.44873 ms, max = 0.50415 ms, mean = 0.462956 ms, median = 0.462646 ms, percentile(90%) = 0.463745 ms, percentile(95%) = 0.468536 ms, percentile(99%) = 0.471313 ms
[03/03/2025-13:36:52] [I] Enqueue Time: min = 0.0231934 ms, max = 0.0812988 ms, mean = 0.0372808 ms, median = 0.0368042 ms, percentile(90%) = 0.0385742 ms, percentile(95%) = 0.0395355 ms, percentile(99%) = 0.0512695 ms
[03/03/2025-13:36:52] [I] H2D Latency: min = 0.0942383 ms, max = 0.141357 ms, mean = 0.099128 ms, median = 0.098877 ms, percentile(90%) = 0.0994873 ms, percentile(95%) = 0.0998535 ms, percentile(99%) = 0.106812 ms
[03/03/2025-13:36:52] [I] GPU Compute Time: min = 0.0285645 ms, max = 0.0583496 ms, mean = 0.0292459 ms, median = 0.029541 ms, percentile(90%) = 0.0297852 ms, percentile(95%) = 0.0297852 ms, percentile(99%) = 0.0297852 ms
[03/03/2025-13:36:52] [I] D2H Latency: min = 0.320801 ms, max = 0.351318 ms, mean = 0.334587 ms, median = 0.334229 ms, percentile(90%) = 0.334473 ms, percentile(95%) = 0.335266 ms, percentile(99%) = 0.342651 ms
[03/03/2025-13:36:52] [I] Total Host Walltime: 3.00081 s
[03/03/2025-13:36:52] [I] Total GPU Compute Time: 0.261166 s
[03/03/2025-13:36:52] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[03/03/2025-13:36:52] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[03/03/2025-13:36:52] [W] * Throughput may be bound by host-to-device transfers for the inputs rather than GPU Compute and the GPU may be under-utilized.
[03/03/2025-13:36:52] [W]   Add --noDataTransfers flag to disable data transfers.
[03/03/2025-13:36:52] [W] * Throughput may be bound by device-to-host transfers for the outputs rather than GPU Compute and the GPU may be under-utilized.
[03/03/2025-13:36:52] [W]   Add --noDataTransfers flag to disable data transfers.
[03/03/2025-13:36:52] [W] * GPU compute time is unstable, with coefficient of variance = 2.53877%.
[03/03/2025-13:36:52] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[03/03/2025-13:36:52] [I] Explanations of the performance metrics are printed in the verbose logs.
[03/03/2025-13:36:52] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v100900] [b32] # trtexec --onnx=/tmp/bug_on.onnx --int8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Request Request for new functionality triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

5 participants