Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistently getting CUDNN_STATUS_INTERNAL_ERROR during second_stage training after diff_epoch #319

Open
hanshounsu opened this issue Mar 8, 2025 · 0 comments

Comments

@hanshounsu
Copy link

I'm currently training the first_stage and second_stage on 2 4090GPUs.
Currently the following error is randomly occurring during the second_stage, when epoch > diff_epoch;

Traceback (most recent call last):
File "/home/hounsu/voice/StyleTTS2/train_second_faster.py", line 842, in
main()
File "/home/hounsu/anaconda3/envs/styletts2/lib/python3.9/site-packages/click/core.py", line 1161, in call
return self.main(*args, **kwargs)
File "/home/hounsu/anaconda3/envs/styletts2/lib/python3.9/site-packages/click/core.py", line 1082, in main
rv = self.invoke(ctx)
File "/home/hounsu/anaconda3/envs/styletts2/lib/python3.9/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/hounsu/anaconda3/envs/styletts2/lib/python3.9/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
File "/home/hounsu/voice/StyleTTS2/train_second_faster.py", line 488, in main
g_loss.backward()
File "/home/hounsu/anaconda3/envs/styletts2/lib/python3.9/site-packages/torch/_tensor.py", line 581, in backward
torch.autograd.backward(
File "/home/hounsu/anaconda3/envs/styletts2/lib/python3.9/site-packages/torch/autograd/init.py", line 347, in backward
_engine_run_backward(
File "/home/hounsu/anaconda3/envs/styletts2/lib/python3.9/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

Is anyone else suffering this problem?
Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant