Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First step training very slow and high GPU memory #318

Open
zaidato opened this issue Mar 7, 2025 · 8 comments
Open

First step training very slow and high GPU memory #318

zaidato opened this issue Mar 7, 2025 · 8 comments

Comments

@zaidato
Copy link

zaidato commented Mar 7, 2025

Thank you for your great works.
I am using 3x40GB A100 GPUs to train the first step. I set max_len=300 frames. These GPUs just enough for batch_size=5 and training speed is very slow for both mix_precision=fp16 and fp32. Is this normal with styletts2?

Another problem is that loss become NaN after more than 10k steps

INFO:2025-03-05 18:39:06,022: Epoch [1/10], Step [13340/394954], Mel Loss: 0.73000, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000
INFO:2025-03-05 18:39:18,793: Epoch [1/10], Step [13350/394954], Mel Loss: 0.71747, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000
INFO:2025-03-05 18:39:30,779: Epoch [1/10], Step [13360/394954], Mel Loss: 0.72086, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000
INFO:2025-03-05 18:39:43,441: Epoch [1/10], Step [13370/394954], Mel Loss: nan, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000
INFO:2025-03-05 18:39:55,656: Epoch [1/10], Step [13380/394954], Mel Loss: nan, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000

@kadirnar
Copy link

kadirnar commented Mar 7, 2025

Did you do this?

#253

Yesterday I worked it with 8xA100 GPUs using batch-size=16 and got a memory error. The max_len was high though. Still, I think something's wrong.

@zaidato
Copy link
Author

zaidato commented Mar 9, 2025

@kadirnar No, I didn't load a pretrained model.
You mean your total batch_size=16 (for 8 gpus) or each gpu has batch_size=16 (total batch size = 16*8)?

@kadirnar
Copy link

kadirnar commented Mar 9, 2025

@kadirnar No, I didn't load a pretrained model. You mean your total batch_size=16 (for 8 gpus) or each gpu has batch_size=16 (total batch size = 16*8)?

I set the batch size here to 16 and each GPU uses 70GB. It gives a memory error at epoch 50. Now I've set the batch size to 8. It uses 50GB of memory. However, having a batch size of 8 for 8xH100 is really poor.

https://github.com/yl4579/StyleTTS2/blob/main/Configs/config.yml#L8
Dataset: https://huggingface.co/datasets/shb777/gemini-flash-2.0-speech

@kadirnar
Copy link

kadirnar commented Mar 9, 2025

@zaidato I got the same error as you when I set the batch-size to 8 😆

@zaidato
Copy link
Author

zaidato commented Mar 11, 2025

You got an error at epoch 50. I think it's because you set TMA_epoch: 50 # TMA starting epoch (1st stage). You need to decrease batch size to fix that
In my case, A100x40GB I set batch_size=3 and max_len = 300. Very small batch size for A100 GPU

@kadirnar
Copy link

You got an error at epoch 50. I think it's because you set TMA_epoch: 50 # TMA starting epoch (1st stage). You need to decrease batch size to fix that In my case, A100x40GB I set batch_size=3 and max_len = 300. Very small batch size for A100 GPU

I added FP16 support to the StyleTTS2 model and made a few adjustments. I trained it with a batch size of 32. Currently at epoch 86, the loss value started returning nan. Do you have any knowledge about this issue? How many GPUs are you training on?

@zaidato
Copy link
Author

zaidato commented Mar 11, 2025

I faced the same problem when using FP16

@kadirnar
Copy link

@zaidato I managed to train using this repo. There was only a bug with context_length. I fixed that by updating the mel_dataset. If there are successful results after training, I will create a new repo and explain it in detail. I just need to experiment with epoch values.

https://github.com/Respaired/Tsukasa-Speech

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants