-
Notifications
You must be signed in to change notification settings - Fork 518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First step training very slow and high GPU memory #318
Comments
Did you do this? Yesterday I worked it with 8xA100 GPUs using batch-size=16 and got a memory error. The max_len was high though. Still, I think something's wrong. |
@kadirnar No, I didn't load a pretrained model. |
I set the batch size here to 16 and each GPU uses 70GB. It gives a memory error at epoch 50. Now I've set the batch size to 8. It uses 50GB of memory. However, having a batch size of 8 for 8xH100 is really poor. https://github.com/yl4579/StyleTTS2/blob/main/Configs/config.yml#L8 |
@zaidato I got the same error as you when I set the batch-size to 8 😆 |
You got an error at epoch 50. I think it's because you set TMA_epoch: 50 # TMA starting epoch (1st stage). You need to decrease batch size to fix that |
I added FP16 support to the StyleTTS2 model and made a few adjustments. I trained it with a batch size of 32. Currently at epoch 86, the loss value started returning nan. Do you have any knowledge about this issue? How many GPUs are you training on? |
I faced the same problem when using FP16 |
@zaidato I managed to train using this repo. There was only a bug with context_length. I fixed that by updating the mel_dataset. If there are successful results after training, I will create a new repo and explain it in detail. I just need to experiment with epoch values. |
Thank you for your great works.
I am using 3x40GB A100 GPUs to train the first step. I set max_len=300 frames. These GPUs just enough for batch_size=5 and training speed is very slow for both mix_precision=fp16 and fp32. Is this normal with styletts2?
Another problem is that loss become NaN after more than 10k steps
INFO:2025-03-05 18:39:06,022: Epoch [1/10], Step [13340/394954], Mel Loss: 0.73000, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000
INFO:2025-03-05 18:39:18,793: Epoch [1/10], Step [13350/394954], Mel Loss: 0.71747, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000
INFO:2025-03-05 18:39:30,779: Epoch [1/10], Step [13360/394954], Mel Loss: 0.72086, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000
INFO:2025-03-05 18:39:43,441: Epoch [1/10], Step [13370/394954], Mel Loss: nan, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000
INFO:2025-03-05 18:39:55,656: Epoch [1/10], Step [13380/394954], Mel Loss: nan, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000
The text was updated successfully, but these errors were encountered: