Consistently getting CUDNN_STATUS_INTERNAL_ERROR during second_stage training after diff_epoch #319

hanshounsu · 2025-03-08T08:12:06Z

I'm currently training the first_stage and second_stage on 2 4090GPUs.
Currently the following error is randomly occurring during the second_stage, when epoch > diff_epoch;

Traceback (most recent call last):
File "/home/hounsu/voice/StyleTTS2/train_second_faster.py", line 842, in
main()
File "/home/hounsu/anaconda3/envs/styletts2/lib/python3.9/site-packages/click/core.py", line 1161, in call
return self.main(*args, **kwargs)
File "/home/hounsu/anaconda3/envs/styletts2/lib/python3.9/site-packages/click/core.py", line 1082, in main
rv = self.invoke(ctx)
File "/home/hounsu/anaconda3/envs/styletts2/lib/python3.9/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/hounsu/anaconda3/envs/styletts2/lib/python3.9/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
File "/home/hounsu/voice/StyleTTS2/train_second_faster.py", line 488, in main
g_loss.backward()
File "/home/hounsu/anaconda3/envs/styletts2/lib/python3.9/site-packages/torch/_tensor.py", line 581, in backward
torch.autograd.backward(
File "/home/hounsu/anaconda3/envs/styletts2/lib/python3.9/site-packages/torch/autograd/init.py", line 347, in backward
_engine_run_backward(
File "/home/hounsu/anaconda3/envs/styletts2/lib/python3.9/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

Is anyone else suffering this problem?
Thanks in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistently getting CUDNN_STATUS_INTERNAL_ERROR during second_stage training after diff_epoch #319

Consistently getting CUDNN_STATUS_INTERNAL_ERROR during second_stage training after diff_epoch #319

hanshounsu commented Mar 8, 2025

Consistently getting CUDNN_STATUS_INTERNAL_ERROR during second_stage training after diff_epoch #319

Consistently getting CUDNN_STATUS_INTERNAL_ERROR during second_stage training after diff_epoch #319

Comments

hanshounsu commented Mar 8, 2025