First step training very slow and high GPU memory #318

zaidato · 2025-03-07T07:05:42Z

Thank you for your great works.
I am using 3x40GB A100 GPUs to train the first step. I set max_len=300 frames. These GPUs just enough for batch_size=5 and training speed is very slow for both mix_precision=fp16 and fp32. Is this normal with styletts2?

Another problem is that loss become NaN after more than 10k steps

INFO:2025-03-05 18:39:06,022: Epoch [1/10], Step [13340/394954], Mel Loss: 0.73000, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000
INFO:2025-03-05 18:39:18,793: Epoch [1/10], Step [13350/394954], Mel Loss: 0.71747, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000
INFO:2025-03-05 18:39:30,779: Epoch [1/10], Step [13360/394954], Mel Loss: 0.72086, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000
INFO:2025-03-05 18:39:43,441: Epoch [1/10], Step [13370/394954], Mel Loss: nan, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000
INFO:2025-03-05 18:39:55,656: Epoch [1/10], Step [13380/394954], Mel Loss: nan, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000

kadirnar · 2025-03-07T12:01:10Z

Did you do this?

#253

Yesterday I worked it with 8xA100 GPUs using batch-size=16 and got a memory error. The max_len was high though. Still, I think something's wrong.

zaidato · 2025-03-09T11:11:26Z

@kadirnar No, I didn't load a pretrained model.
You mean your total batch_size=16 (for 8 gpus) or each gpu has batch_size=16 (total batch size = 16*8)?

kadirnar · 2025-03-09T11:14:55Z

@kadirnar No, I didn't load a pretrained model. You mean your total batch_size=16 (for 8 gpus) or each gpu has batch_size=16 (total batch size = 16*8)?

I set the batch size here to 16 and each GPU uses 70GB. It gives a memory error at epoch 50. Now I've set the batch size to 8. It uses 50GB of memory. However, having a batch size of 8 for 8xH100 is really poor.

https://github.com/yl4579/StyleTTS2/blob/main/Configs/config.yml#L8
Dataset: https://huggingface.co/datasets/shb777/gemini-flash-2.0-speech

kadirnar · 2025-03-09T21:19:40Z

@zaidato I got the same error as you when I set the batch-size to 8 😆

zaidato · 2025-03-11T02:28:15Z

You got an error at epoch 50. I think it's because you set TMA_epoch: 50 # TMA starting epoch (1st stage). You need to decrease batch size to fix that
In my case, A100x40GB I set batch_size=3 and max_len = 300. Very small batch size for A100 GPU

kadirnar · 2025-03-11T10:22:45Z

You got an error at epoch 50. I think it's because you set TMA_epoch: 50 # TMA starting epoch (1st stage). You need to decrease batch size to fix that In my case, A100x40GB I set batch_size=3 and max_len = 300. Very small batch size for A100 GPU

I added FP16 support to the StyleTTS2 model and made a few adjustments. I trained it with a batch size of 32. Currently at epoch 86, the loss value started returning nan. Do you have any knowledge about this issue? How many GPUs are you training on?

zaidato · 2025-03-11T15:35:34Z

I faced the same problem when using FP16

kadirnar · 2025-03-12T21:25:50Z

@zaidato I managed to train using this repo. There was only a bug with context_length. I fixed that by updating the mel_dataset. If there are successful results after training, I will create a new repo and explain it in detail. I just need to experiment with epoch values.

https://github.com/Respaired/Tsukasa-Speech

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First step training very slow and high GPU memory #318

First step training very slow and high GPU memory #318

zaidato commented Mar 7, 2025 •

edited

Loading

kadirnar commented Mar 7, 2025

zaidato commented Mar 9, 2025

kadirnar commented Mar 9, 2025

kadirnar commented Mar 9, 2025

zaidato commented Mar 11, 2025 •

edited

Loading

kadirnar commented Mar 11, 2025

zaidato commented Mar 11, 2025

kadirnar commented Mar 12, 2025

First step training very slow and high GPU memory #318

First step training very slow and high GPU memory #318

Comments

zaidato commented Mar 7, 2025 • edited Loading

kadirnar commented Mar 7, 2025

zaidato commented Mar 9, 2025

kadirnar commented Mar 9, 2025

kadirnar commented Mar 9, 2025

zaidato commented Mar 11, 2025 • edited Loading

kadirnar commented Mar 11, 2025

zaidato commented Mar 11, 2025

kadirnar commented Mar 12, 2025

zaidato commented Mar 7, 2025 •

edited

Loading

zaidato commented Mar 11, 2025 •

edited

Loading