TRT inference poor performance v.s. pytorch with dino model #3398

chenrui17 · 2023-10-23T07:09:08Z

train model : dino link
firstly, use mmdeploy convert pytorch model to onnx format,
secondly, use Trt builder to generate engine.
finally, use execute_async_v2 method to inference, but result performance is too bad compared to pytorch.
nsight profilling is below, forward time is about 420ms+, but pytorch infer time is about

but pytorch infer time is about 180ms, nsys files is below

my question is what is the problem ? how to further analyze the performance and optimization ?

btw, below is my trt inference code, please check. thanks.

from PIL import Image
import numpy as np
import pycuda.driver as cuda
import tensorrt as trt
import cv2
import ctypes

TRT_LOGGER=trt.Logger(trt.Logger.WARNING)

def allocate_buffers(engine):
    h_input=cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)),dtype=trt.nptype(trt.float32))
    h_output1=cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(1)),dtype=trt.nptype(trt.float32))
    h_output2=cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(2)),dtype=trt.nptype(trt.float32))
    h_output3=cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(3)),dtype=trt.nptype(trt.float32))
    h_output4=cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(4)),dtype=trt.nptype(trt.float32))
    h_output5=cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(5)),dtype=trt.nptype(trt.float32))
    d_input=cuda.mem_alloc(h_input.nbytes)
    d_output1=cuda.mem_alloc(h_output1.nbytes)
    d_output2=cuda.mem_alloc(h_output2.nbytes)
    d_output3=cuda.mem_alloc(h_output3.nbytes)
    d_output4=cuda.mem_alloc(h_output4.nbytes)
    d_output5=cuda.mem_alloc(h_output5.nbytes)
    stream=cuda.Stream()
    return h_input,d_input,h_output1,d_output1,h_output2,d_output2,h_output3,d_output3,h_output4,d_output4,h_output5,d_output5,stream

def do_inference(context,h_input,d_input,h_output1,d_output1,h_output2,d_output2,h_output3,d_output3,h_output4,d_output4,h_output5,d_output5,stream):
    cuda.memcpy_htod_async(d_input,h_input,stream)
    context.execute_async_v2(bindings=[int(d_input),int(d_output1),int(d_output2),int(d_output3),int(d_output4),int(d_output5)],stream_handle=stream.handle)
    cuda.memcpy_dtoh_async(h_output1,d_output1,stream)
    cuda.memcpy_dtoh_async(h_output2,d_output2,stream)
    cuda.memcpy_dtoh_async(h_output3,d_output3,stream)
    cuda.memcpy_dtoh_async(h_output4,d_output4,stream)
    cuda.memcpy_dtoh_async(h_output5,d_output5,stream)
    stream.synchronize()

def load_normalized_test_case(test_image,pagelocked_buffer):
    def normalize_image(image):
        img_src=cv2.imread(image)
        resized=cv2.resize(img_src,(750,1333),interpolation=cv2.INTER_LINEAR)
        img_in=cv2.cvtColor(resized,cv2.COLOR_BGR2RGB)
        img_in=np.transpose(img_in,(2,0,1)).astype(np.float32)
        img_in=np.expand_dims(img_in,axis=0)
        img_in/=255.0
        img_out=img_in.flatten()
        return img_out
    np.copyto(pagelocked_buffer,normalize_image(test_image))

def load_engine(engine_path):
    with open(engine_path,'rb') as f:
        runtime=trt.Runtime(TRT_LOGGER)
        runtime.max_threads=10
        engine_data=f.read()
        return runtime.deserialize_cuda_engine(engine_data)
    
def build_engine():
    with trt.Builder(TRT_LOGGER) as builder,builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) as network,trt.OnnxParser(network,TRT_LOGGER) as parser:
        config = builder.create_builder_config()
        config.max_workspace_size = 1 << 30
        config.set_flag(trt.BuilderFlag.FP16)
        
        with open("./end2end.onnx",'rb') as model:
            if not parser.parse(model.read()):
                for error in range(parser.num_errors):
                    print(parser.get_error(error))
        
        engine=builder.build_engine(network, config)
        
        engine_file="./end2end.engine"
        if engine_file:
            with open(engine_file,'wb') as f:
                f.write(engine.serialize())
        
        return engine

def main():
    test_image="./1.jpg"
    #build_engine()
    with load_engine("./end2end.engine") as engine:
        
        h_input,d_input,h_output1,d_output1,h_output2,d_output2,h_output3,d_output3,h_output4,d_output4,h_output5,d_output5,stream=allocate_buffers(engine)
        import torch.cuda.nvtx as nvtx
        nvtx.range_push("prepare Data")
        load_normalized_test_case(test_image,h_input)
        nvtx.range_pop()
        with engine.create_execution_context() as context:
            for i in range(100):
                nvtx.range_push("Forward")
                do_inference(context,h_input,d_input,h_output1,d_output1,h_output2,d_output2,h_output3,d_output3,h_output4,d_output4,h_output5,d_output5,stream)
                nvtx.range_pop()

if __name__=='__main__':
    lib_path="./libmmdeploy_tensorrt_ops.so"
    ctypes.CDLL(lib_path)
    trt.init_libnvinfer_plugins(TRT_LOGGER,"") 
    main()

The text was updated successfully, but these errors were encountered:

zerollzeng · 2023-10-25T13:19:42Z

Can you share the onnx and plugin.so here for quick reproduce?
Which GPU you are using? also Nvidia Driver version and CUDA version etc. please provide these follow our bug template.

zerollzeng · 2023-10-25T13:20:37Z

If possible please use trtexec to benchmark the TRT performance, a sample command would be like trtexec --onnx=model.onnx --plugins=./libmmdeploy_tensorrt_ops.so --fp16

chenrui17 · 2023-11-14T00:20:01Z

Can you share the onnx and plugin.so here for quick reproduce?

Which GPU you are using? also Nvidia Driver version and CUDA version etc. please provide these follow our bug template.

@zerollzeng

I uploaded my onnx file and plugin.so, you can download from https://drive.google.com/file/d/11woAWMIUNf3VYO2-hdZtmIk7udQ_lA18/view?usp=drive_link
I use A100 & cuda 12.2

thanks.

zerollzeng · 2023-11-14T09:47:32Z

I‘ve requested access.

zerollzeng · 2023-11-14T11:34:02Z

Check with TRT 8.6(TRT docker 23.10) on A100. the mean gpu time is 102.96ms. So this doesn't looks like the bug in TRT

[11/14/2023-11:22:42] [I] H2D Latency: min = 0.717041 ms, max = 0.864014 ms, mean = 0.803251 ms, median = 0.814941 ms, percentile(90%) = 0.8479 ms, percentile(95%) = 0.861694 ms, percentile(99%) = 0.864014 ms
[11/14/2023-11:22:42] [I] GPU Compute Time: min = 102.208 ms, max = 103.756 ms, mean = 102.96 ms, median = 102.987 ms, percentile(90%) = 103.37 ms, percentile(95%) = 103.458 ms, percentile(99%) = 103.756 ms
[11/14/2023-11:22:42] [I] D2H Latency: min = 0.0119629 ms, max = 0.0164185 ms, mean = 0.0146327 ms, median = 0.0146484 ms, percentile(90%) = 0.0161133 ms, percentile(95%) = 0.0163574 ms, percentile(99%) = 0.0164185 ms
...
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=end2end.onnx --plugins=./libmmdeploy_tensorrt_ops_cuda12.so --dumpProfile --separateProfileRun

chenrui17 · 2023-11-21T02:12:10Z

Check with TRT 8.6(TRT docker 23.10) on A100. the mean gpu time is 102.96ms. So this doesn't looks like the bug in TRT

[11/14/2023-11:22:42] [I] H2D Latency: min = 0.717041 ms, max = 0.864014 ms, mean = 0.803251 ms, median = 0.814941 ms, percentile(90%) = 0.8479 ms, percentile(95%) = 0.861694 ms, percentile(99%) = 0.864014 ms
[11/14/2023-11:22:42] [I] GPU Compute Time: min = 102.208 ms, max = 103.756 ms, mean = 102.96 ms, median = 102.987 ms, percentile(90%) = 103.37 ms, percentile(95%) = 103.458 ms, percentile(99%) = 103.756 ms
[11/14/2023-11:22:42] [I] D2H Latency: min = 0.0119629 ms, max = 0.0164185 ms, mean = 0.0146327 ms, median = 0.0146484 ms, percentile(90%) = 0.0161133 ms, percentile(95%) = 0.0163574 ms, percentile(99%) = 0.0164185 ms
...
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=end2end.onnx --plugins=./libmmdeploy_tensorrt_ops_cuda12.so --dumpProfile --separateProfileRun

But the performance of fp16 and tf32 is basically the same, is this normal? It doesn't seem to meet expectations very well. @zerollzeng

Desperado721 · 2025-03-06T01:49:45Z

Hi @chenrui17 , did you figure it out by any chance? I'm running into the same problem although it is an old issue

zerollzeng self-assigned this Oct 25, 2023

zerollzeng added the triaged Issue has been triaged by maintainers label Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TRT inference poor performance v.s. pytorch with dino model #3398

TRT inference poor performance v.s. pytorch with dino model #3398

chenrui17 commented Oct 23, 2023 •

edited

Loading

zerollzeng commented Oct 25, 2023

zerollzeng commented Oct 25, 2023

chenrui17 commented Nov 14, 2023

zerollzeng commented Nov 14, 2023

zerollzeng commented Nov 14, 2023

chenrui17 commented Nov 21, 2023 •

edited

Loading

Desperado721 commented Mar 6, 2025

TRT inference poor performance v.s. pytorch with dino model #3398

TRT inference poor performance v.s. pytorch with dino model #3398

Comments

chenrui17 commented Oct 23, 2023 • edited Loading

zerollzeng commented Oct 25, 2023

zerollzeng commented Oct 25, 2023

chenrui17 commented Nov 14, 2023

zerollzeng commented Nov 14, 2023

zerollzeng commented Nov 14, 2023

chenrui17 commented Nov 21, 2023 • edited Loading

Desperado721 commented Mar 6, 2025

chenrui17 commented Oct 23, 2023 •

edited

Loading

chenrui17 commented Nov 21, 2023 •

edited

Loading