Skip to content

CUDA runtime error in cublasLtMatmul, CUBLAS_STATUS_EXECUTION_FAILED #700

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
WhiteDoveBuct opened this issue Dec 19, 2023 · 3 comments
Closed
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@WhiteDoveBuct
Copy link

WhiteDoveBuct commented Dec 19, 2023

build

python build.py \
--model_dir /AIED-data/xxx/Llama-2-70b-hf/ \
--dtype float16 \
--remove_input_padding \
--use_gpt_attention_plugin float16 \
--enable_context_fmha \
--use_gemm_plugin float16 \
--output_dir /AIED-data/xxx/trt_engines/Llama-2-70b-hf-32/ \
--world_size 8 \
--tp_size 4 \
--pp_size 2 \
--max_batch_size 32 \
--max_input_len 1024 \
--max_output_len 3072 \
--parallel_build \
--use_rmsnorm_plugin float16 \
--use_inflight_batching \
--use_fused_mlp \
--paged_kv_cache

benchmark

in_out_sizes=("1:1024:3072" "2:1024:3072" "4:1024:3072" "8:1024:3072", "16:1024:3072", "32:1024:3072")
for in_out in ${in_out_sizes[@]}
do
batch_size=$(echo $in_out | awk -F':' '{ print $1 }')
in_out_dims=$(echo $in_out | awk -F':' '{ print $2 }')
echo "BS: $batch_size, ISL/OSL: $in_out_dims"

    mpirun -n 8 --allow-run-as-root --oversubscribe \                                                                                                                      
./cpp/build/benchmarks/gptSessionBenchmark \                                                                                                                               
--model llama \                                                                                                                                                            
--engine_dir /AIED-data/xxx/trt_engines/Llama-2-70b-hf-32 \                                                                                                         
--warm_up 1 \                                                                                                                                                              
--batch_size $batch_size \                                                                                                                                                 
--duration 0 \                                                                                                                                                             
--num_runs 5 \                                                                                                                                                             
--input_output_len $in_out_dims

done

error log

[1702983381.525127] [AI-99-141-release:95101:f] vfs_fuse.c:281 UCX ERROR inotify_add_wat
ch(/tmp) failed: No space left on device
[1702983381.525298] [AI-99-141-release:95102:f] vfs_fuse.c:281 UCX ERROR inotify_add_wat
ch(/tmp) failed: No space left on device
[1702983381.525336] [AI-99-141-release:95098:f] vfs_fuse.c:281 UCX ERROR inotify_add_wat
ch(/tmp) failed: No space left on device
[1702983381.525344] [AI-99-141-release:95097:f] vfs_fuse.c:281 UCX ERROR inotify_add_wat
ch(/tmp) failed: No space left on device
[1702983381.525349] [AI-99-141-release:95095:f] vfs_fuse.c:281 UCX ERROR inotify_add_wat
ch(/tmp) failed: No space left on device
[1702983381.525352] [AI-99-141-release:95100:f] vfs_fuse.c:281 UCX ERROR inotify_add_wat
ch(/tmp) failed: No space left on device
[1702983381.525353] [AI-99-141-release:95099:f] vfs_fuse.c:281 UCX ERROR inotify_add_wat
ch(/tmp) failed: No space left on device
[1702983381.525355] [AI-99-141-release:95096:f] vfs_fuse.c:281 UCX ERROR inotify_add_wat
ch(/tmp) failed: No space left on device
Benchmarking done. Iteration: 5, duration: 694.92 sec.
Benchmarking done. Iteration: 5, duration: 694.92 sec.
Benchmarking done. Iteration: 5, duration: 694.92 sec.
Benchmarking done. Iteration: 5, duration: 694.92 sec.
Benchmarking done. Iteration: 5, duration: 694.92 sec.
Benchmarking done. Iteration: 5, duration: 694.92 sec.
Benchmarking done. Iteration: 5, duration: 694.92 sec.
Benchmarking done. Iteration: 5, duration: 694.92 sec.
[BENCHMARK] batch_size 1 input_length 1024 output_length 3072 latency(ms) 138983.22 tokensPerSec
22.10
BS: 2, ISL/OSL: 1024,3072
[1702984696.791064] [AI-99-141-release:13206:f] vfs_fuse.c:281 UCX ERROR inotify_add_wat
ch(/tmp) failed: No space left on device
[1702984696.791721] [AI-99-141-release:13205:f] vfs_fuse.c:281 UCX ERROR inotify_add_wat
ch(/tmp) failed: No space left on device
[1702984696.791813] [AI-99-141-release:13209:f] vfs_fuse.c:281 UCX ERROR inotify_add_wat
ch(/tmp) failed: No space left on device
[1702984696.792377] [AI-99-141-release:13204:f] vfs_fuse.c:281 UCX ERROR inotify_add_wat
ch(/tmp) failed: No space left on device
[1702984696.792585] [AI-99-141-release:13208:f] vfs_fuse.c:281 UCX ERROR inotify_add_wat
ch(/tmp) failed: No space left on device
[1702984696.792633] [AI-99-141-release:13203:f] vfs_fuse.c:281 UCX ERROR inotify_add_wat
ch(/tmp) failed: No space left on device
[1702984696.792756] [AI-99-141-release:13210:f] vfs_fuse.c:281 UCX ERROR inotify_add_wat
ch(/tmp) failed: No space left on device
[1702984696.792766] [AI-99-141-release:13207:f] vfs_fuse.c:281 UCX ERROR inotify_add_wat
ch(/tmp) failed: No space left on device
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
what(): [TensorRT-LLM][ERROR] CUDA runtime error in cublasLtMatmul(getCublasLtHandle(), mOpera
tionDesc, alpha, A, mADesc, B, mBDesc, beta, C, mCDesc, C, mCDesc, (hasAlgo ? (&algo) : NULL), mC
ublasWorkspace, workspaceSize, mStream): CUBLAS_STATUS_EXECUTION_FAILED (/code/tensorrt_llm/cpp/t
ensorrt_llm/common/cublasMMWrapper.cpp:140)
1 0x7f5e902009ce /code/tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensor
rt_llm.so.9(+0xac9ce) [0x7f5e902009ce]
2 0x7f5e90254dc6 /code/tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensor
rt_llm.so.9(+0x100dc6) [0x7f5e90254dc6]
3 0x7f5e9025519b /code/tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensor
rt_llm.so.9(+0x10119b) [0x7f5e9025519b]
4 0x7f5e902262d1 /code/tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensor
rt_llm.so.9(+0xd22d1) [0x7f5e902262d1]
5 0x7f5e90226bba tensorrt_llm::plugins::GemmPlugin::enqueue(nvinfer1::PluginTensorDesc cons
t*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 2
66
6 0x7f5e46d3cba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7f5e46d3cba9]
7 0x7f5e46d126af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7f5e46d126af]
8 0x7f5e46d14320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7f5e46d14320]
9 0x7f5ed5ee787f tensorrt_llm::runtime::GptSession::executeGenerationStep(int, std::vector<
tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> >
const&, std::vector<tensorrt_llm::runtime::GenerationOutput, std::allocator<tensorrt_llm::runtime
::GenerationOutput> >&, std::vector<int, std::allocator > const&, tensorrt_llm::batch_man
age r::kv_cache_manager::KVCacheManager*, std::vector<bool, std::alloca
tor >&) + 1903
10 0x7f5ed5ee912e tensorrt_llm::runtime::GptSession::generateBatched(std::vector<tensorrt_ll
m::runtime::GenerationOutput, std::allocator<tensorrt_llm::runtime::GenerationOutput> >&, std::ve
ctor<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInpu
t> > const&, tensorrt_llm::runtime::SamplingConfig const&, std::function<void (int, bool)> con
st& ) + 3070
11 0x7f5ed5eeb18b tensorrt_llm::runtime::GptSession::generate(tensorrt_llm::runtime::Generat
ionOutput&, tensorrt_llm::runtime::GenerationInput const&, tensorrt_llm::runtime::SamplingConfig
const&) + 7003
12 0x5556de223dff ./cpp/build/benchmarks/gptSessionBenchmark(+0x19dff) [0x5556de223dff]
13 0x7f5e8fcfad90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f5e8fcfad90]
14 0x7f5e8fcfae40 __libc_start_main + 128
15 0x5556de225ef5 ./cpp/build/benchmarks/gptSessionBenchmark(+0x1bef5) [0x5556de225ef5]

@byshiue
Copy link
Collaborator

byshiue commented Dec 25, 2023

From error

[1702984696.792585] [AI-99-141-release:13208:f] vfs_fuse.c:281 UCX ERROR inotify_add_wat
ch(/tmp) failed: No space left on device

it looks like a issue of your device. Could you try on another device?

@byshiue byshiue self-assigned this Dec 25, 2023
@byshiue byshiue added the triaged Issue has been triaged by maintainers label Dec 25, 2023
@WhiteDoveBuct
Copy link
Author

WhiteDoveBuct commented Dec 25, 2023 via email

@WhiteDoveBuct
Copy link
Author

WhiteDoveBuct commented Nov 18, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants