Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRT inference poor performance v.s. pytorch with dino model #3398

Open
chenrui17 opened this issue Oct 23, 2023 · 7 comments
Open

TRT inference poor performance v.s. pytorch with dino model #3398

chenrui17 opened this issue Oct 23, 2023 · 7 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@chenrui17
Copy link

chenrui17 commented Oct 23, 2023

train model : dino link
firstly, use mmdeploy convert pytorch model to onnx format,
secondly, use Trt builder to generate engine.
finally, use execute_async_v2 method to inference, but result performance is too bad compared to pytorch.
nsight profilling is below, forward time is about 420ms+, but pytorch infer time is about
image
but pytorch infer time is about 180ms, nsys files is below
image

my question is what is the problem ? how to further analyze the performance and optimization ?

btw, below is my trt inference code, please check. thanks.

from PIL import Image
import numpy as np
import pycuda.driver as cuda
import tensorrt as trt
import cv2
import ctypes

TRT_LOGGER=trt.Logger(trt.Logger.WARNING)

def allocate_buffers(engine):
    h_input=cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)),dtype=trt.nptype(trt.float32))
    h_output1=cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(1)),dtype=trt.nptype(trt.float32))
    h_output2=cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(2)),dtype=trt.nptype(trt.float32))
    h_output3=cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(3)),dtype=trt.nptype(trt.float32))
    h_output4=cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(4)),dtype=trt.nptype(trt.float32))
    h_output5=cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(5)),dtype=trt.nptype(trt.float32))
    d_input=cuda.mem_alloc(h_input.nbytes)
    d_output1=cuda.mem_alloc(h_output1.nbytes)
    d_output2=cuda.mem_alloc(h_output2.nbytes)
    d_output3=cuda.mem_alloc(h_output3.nbytes)
    d_output4=cuda.mem_alloc(h_output4.nbytes)
    d_output5=cuda.mem_alloc(h_output5.nbytes)
    stream=cuda.Stream()
    return h_input,d_input,h_output1,d_output1,h_output2,d_output2,h_output3,d_output3,h_output4,d_output4,h_output5,d_output5,stream

def do_inference(context,h_input,d_input,h_output1,d_output1,h_output2,d_output2,h_output3,d_output3,h_output4,d_output4,h_output5,d_output5,stream):
    cuda.memcpy_htod_async(d_input,h_input,stream)
    context.execute_async_v2(bindings=[int(d_input),int(d_output1),int(d_output2),int(d_output3),int(d_output4),int(d_output5)],stream_handle=stream.handle)
    cuda.memcpy_dtoh_async(h_output1,d_output1,stream)
    cuda.memcpy_dtoh_async(h_output2,d_output2,stream)
    cuda.memcpy_dtoh_async(h_output3,d_output3,stream)
    cuda.memcpy_dtoh_async(h_output4,d_output4,stream)
    cuda.memcpy_dtoh_async(h_output5,d_output5,stream)
    stream.synchronize()

def load_normalized_test_case(test_image,pagelocked_buffer):
    def normalize_image(image):
        img_src=cv2.imread(image)
        resized=cv2.resize(img_src,(750,1333),interpolation=cv2.INTER_LINEAR)
        img_in=cv2.cvtColor(resized,cv2.COLOR_BGR2RGB)
        img_in=np.transpose(img_in,(2,0,1)).astype(np.float32)
        img_in=np.expand_dims(img_in,axis=0)
        img_in/=255.0
        img_out=img_in.flatten()
        return img_out
    np.copyto(pagelocked_buffer,normalize_image(test_image))

def load_engine(engine_path):
    with open(engine_path,'rb') as f:
        runtime=trt.Runtime(TRT_LOGGER)
        runtime.max_threads=10
        engine_data=f.read()
        return runtime.deserialize_cuda_engine(engine_data)
    
def build_engine():
    with trt.Builder(TRT_LOGGER) as builder,builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) as network,trt.OnnxParser(network,TRT_LOGGER) as parser:
        config = builder.create_builder_config()
        config.max_workspace_size = 1 << 30
        config.set_flag(trt.BuilderFlag.FP16)
        
        with open("./end2end.onnx",'rb') as model:
            if not parser.parse(model.read()):
                for error in range(parser.num_errors):
                    print(parser.get_error(error))
        
        engine=builder.build_engine(network, config)
        
        engine_file="./end2end.engine"
        if engine_file:
            with open(engine_file,'wb') as f:
                f.write(engine.serialize())
        
        return engine

def main():
    test_image="./1.jpg"
    #build_engine()
    with load_engine("./end2end.engine") as engine:
        
        h_input,d_input,h_output1,d_output1,h_output2,d_output2,h_output3,d_output3,h_output4,d_output4,h_output5,d_output5,stream=allocate_buffers(engine)
        import torch.cuda.nvtx as nvtx
        nvtx.range_push("prepare Data")
        load_normalized_test_case(test_image,h_input)
        nvtx.range_pop()
        with engine.create_execution_context() as context:
            for i in range(100):
                nvtx.range_push("Forward")
                do_inference(context,h_input,d_input,h_output1,d_output1,h_output2,d_output2,h_output3,d_output3,h_output4,d_output4,h_output5,d_output5,stream)
                nvtx.range_pop()

if __name__=='__main__':
    lib_path="./libmmdeploy_tensorrt_ops.so"
    ctypes.CDLL(lib_path)
    trt.init_libnvinfer_plugins(TRT_LOGGER,"") 
    main()

@zerollzeng
Copy link
Collaborator

  1. Can you share the onnx and plugin.so here for quick reproduce?
  2. Which GPU you are using? also Nvidia Driver version and CUDA version etc. please provide these follow our bug template.

@zerollzeng
Copy link
Collaborator

If possible please use trtexec to benchmark the TRT performance, a sample command would be like trtexec --onnx=model.onnx --plugins=./libmmdeploy_tensorrt_ops.so --fp16

@zerollzeng zerollzeng self-assigned this Oct 25, 2023
@zerollzeng zerollzeng added the triaged Issue has been triaged by maintainers label Oct 25, 2023
@chenrui17
Copy link
Author

  1. Can you share the onnx and plugin.so here for quick reproduce?
  2. Which GPU you are using? also Nvidia Driver version and CUDA version etc. please provide these follow our bug template.

@zerollzeng

  1. I uploaded my onnx file and plugin.so, you can download from https://drive.google.com/file/d/11woAWMIUNf3VYO2-hdZtmIk7udQ_lA18/view?usp=drive_link
  2. I use A100 & cuda 12.2

thanks.

@zerollzeng
Copy link
Collaborator

I‘ve requested access.

@zerollzeng
Copy link
Collaborator

Check with TRT 8.6(TRT docker 23.10) on A100. the mean gpu time is 102.96ms. So this doesn't looks like the bug in TRT

[11/14/2023-11:22:42] [I] H2D Latency: min = 0.717041 ms, max = 0.864014 ms, mean = 0.803251 ms, median = 0.814941 ms, percentile(90%) = 0.8479 ms, percentile(95%) = 0.861694 ms, percentile(99%) = 0.864014 ms
[11/14/2023-11:22:42] [I] GPU Compute Time: min = 102.208 ms, max = 103.756 ms, mean = 102.96 ms, median = 102.987 ms, percentile(90%) = 103.37 ms, percentile(95%) = 103.458 ms, percentile(99%) = 103.756 ms
[11/14/2023-11:22:42] [I] D2H Latency: min = 0.0119629 ms, max = 0.0164185 ms, mean = 0.0146327 ms, median = 0.0146484 ms, percentile(90%) = 0.0161133 ms, percentile(95%) = 0.0163574 ms, percentile(99%) = 0.0164185 ms
...
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=end2end.onnx --plugins=./libmmdeploy_tensorrt_ops_cuda12.so --dumpProfile --separateProfileRun

@chenrui17
Copy link
Author

chenrui17 commented Nov 21, 2023

Check with TRT 8.6(TRT docker 23.10) on A100. the mean gpu time is 102.96ms. So this doesn't looks like the bug in TRT

[11/14/2023-11:22:42] [I] H2D Latency: min = 0.717041 ms, max = 0.864014 ms, mean = 0.803251 ms, median = 0.814941 ms, percentile(90%) = 0.8479 ms, percentile(95%) = 0.861694 ms, percentile(99%) = 0.864014 ms
[11/14/2023-11:22:42] [I] GPU Compute Time: min = 102.208 ms, max = 103.756 ms, mean = 102.96 ms, median = 102.987 ms, percentile(90%) = 103.37 ms, percentile(95%) = 103.458 ms, percentile(99%) = 103.756 ms
[11/14/2023-11:22:42] [I] D2H Latency: min = 0.0119629 ms, max = 0.0164185 ms, mean = 0.0146327 ms, median = 0.0146484 ms, percentile(90%) = 0.0161133 ms, percentile(95%) = 0.0163574 ms, percentile(99%) = 0.0164185 ms
...
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=end2end.onnx --plugins=./libmmdeploy_tensorrt_ops_cuda12.so --dumpProfile --separateProfileRun

But the performance of fp16 and tf32 is basically the same, is this normal? It doesn't seem to meet expectations very well. @zerollzeng

@Desperado721
Copy link

Hi @chenrui17 , did you figure it out by any chance? I'm running into the same problem although it is an old issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants