Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

计算 BLEU 和 ROUGE 分数的predict报错 #6952

Closed
1 task done
ac-automata opened this issue Feb 16, 2025 · 2 comments · Fixed by #6972
Closed
1 task done

计算 BLEU 和 ROUGE 分数的predict报错 #6952

ac-automata opened this issue Feb 16, 2025 · 2 comments · Fixed by #6972
Labels
solved This problem has been already solved

Comments

@ac-automata
Copy link

Reminder

  • I have read the above rules and searched the existing issues.

System Info

  • llamafactory version: 0.9.2.dev0
  • Platform: Linux-6.8.0-40-generic-x86_64-with-glibc2.35
  • Python version: 3.10.16
  • PyTorch version: 2.5.1+cu121 (GPU)
  • Transformers version: 4.45.2
  • Datasets version: 3.2.0
  • Accelerate version: 1.2.1
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA GeForce RTX 3090
  • GPU number: 4
  • GPU memory: 23.69GB
  • DeepSpeed version: 0.14.4
  • Bitsandbytes version: 0.43.1
  • vLLM version: 0.6.6

Reproduction

llamafactory-cli train myscripts/llama3_lora_predict.yaml

yaml文件配置

model

model_name_or_path: /home/ly/model
adapter_name_or_path: saves/Llama-3.1-8B-Instruct/lora/train_2025-02-14-14-23-47
trust_remote_code: true

method

stage: sft
do_predict: true
finetuning_type: lora

dataset

eval_dataset: alpaca_en_demo
template: llama3
cutoff_len: 2048
max_samples: 100
overwrite_cache: true
preprocessing_num_workers: 16

output

output_dir: saves/llama3-8b/lora/predict
overwrite_output_dir: true

eval

per_device_eval_batch_size: 4
predict_with_generate: true
ddp_timeout: 180000000

报错信息

[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/ly/train/src/llamafactory/launcher.py", line 23, in <module>
[rank3]:     launch()
[rank3]:   File "/home/ly/train/src/llamafactory/launcher.py", line 19, in launch
[rank3]:     run_exp()
[rank3]:   File "/home/ly/train/src/llamafactory/train/tuner.py", line 93, in run_exp
[rank3]:     _training_function(config={"args": args, "callbacks": callbacks})
[rank3]:   File "/home/ly/train/src/llamafactory/train/tuner.py", line 67, in _training_function
[rank3]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank3]:   File "/home/ly/train/src/llamafactory/train/sft/workflow.py", line 127, in run_sft
[rank3]:     predict_results = trainer.predict(dataset_module["eval_dataset"], metric_key_prefix="predict", **gen_kwargs)
[rank3]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 244, in predict
[rank3]:     return super().predict(test_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank3]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/transformers/trainer.py", line 3946, in predict
[rank3]:     output = eval_loop(
[rank3]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/transformers/trainer.py", line 4051, in evaluation_loop
[rank3]:     for step, inputs in enumerate(dataloader):
[rank3]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/accelerate/data_loader.py", line 552, in __iter__
[rank3]:     current_batch = next(dataloader_iter)
[rank3]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 701, in __next__
[rank3]:     data = self._next_data()
[rank3]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 757, in _next_data
[rank3]:     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
[rank3]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
[rank3]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank3]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
[rank3]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank3]: KeyError: 0
[INFO|2025-02-16 13:22:01] llamafactory.model.adapter:157 >> Merged 1 adapter(s).
[INFO|2025-02-16 13:22:01] llamafactory.model.adapter:157 >> Loaded adapter(s): saves/Llama-3.1-8B-Instruct/lora/train_2025-02-14-14-23-47
[INFO|2025-02-16 13:22:01] llamafactory.model.loader:157 >> all params: 8,030,261,248
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/ly/train/src/llamafactory/launcher.py", line 23, in <module>
[rank1]:     launch()
[rank1]:   File "/home/ly/train/src/llamafactory/launcher.py", line 19, in launch
[rank1]:     run_exp()
[rank1]:   File "/home/ly/train/src/llamafactory/train/tuner.py", line 93, in run_exp
[rank1]:     _training_function(config={"args": args, "callbacks": callbacks})
[rank1]:   File "/home/ly/train/src/llamafactory/train/tuner.py", line 67, in _training_function
[rank1]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]:   File "/home/ly/train/src/llamafactory/train/sft/workflow.py", line 127, in run_sft
[rank1]:     predict_results = trainer.predict(dataset_module["eval_dataset"], metric_key_prefix="predict", **gen_kwargs)
[rank1]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 244, in predict
[rank1]:     return super().predict(test_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank1]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/transformers/trainer.py", line 3946, in predict
[rank1]:     output = eval_loop(
[rank1]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/transformers/trainer.py", line 4051, in evaluation_loop
[rank1]:     for step, inputs in enumerate(dataloader):
[rank1]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/accelerate/data_loader.py", line 552, in __iter__
[rank1]:     current_batch = next(dataloader_iter)
[rank1]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 701, in __next__
[rank1]:     data = self._next_data()
[rank1]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 757, in _next_data
[rank1]:     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
[rank1]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
[rank1]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank1]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
[rank1]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank1]: KeyError: 0
[WARNING|2025-02-16 13:22:01] llamafactory.train.sft.workflow:168 >> Batch generation can be very slow. Consider using `scripts/vllm_infer.py` instead.
[INFO|trainer.py:4021] 2025-02-16 13:22:01,870 >> 
***** Running Prediction *****
[INFO|trainer.py:4023] 2025-02-16 13:22:01,870 >>   Num examples = 1
[INFO|trainer.py:4026] 2025-02-16 13:22:01,871 >>   Batch size = 4
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ly/train/src/llamafactory/launcher.py", line 23, in <module>
[rank0]:     launch()
[rank0]:   File "/home/ly/train/src/llamafactory/launcher.py", line 19, in launch
[rank0]:     run_exp()
[rank0]:   File "/home/ly/train/src/llamafactory/train/tuner.py", line 93, in run_exp
[rank0]:     _training_function(config={"args": args, "callbacks": callbacks})
[rank0]:   File "/home/ly/train/src/llamafactory/train/tuner.py", line 67, in _training_function
[rank0]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]:   File "/home/ly/train/src/llamafactory/train/sft/workflow.py", line 127, in run_sft
[rank0]:     predict_results = trainer.predict(dataset_module["eval_dataset"], metric_key_prefix="predict", **gen_kwargs)
[rank0]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 244, in predict
[rank0]:     return super().predict(test_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank0]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/transformers/trainer.py", line 3946, in predict
[rank0]:     output = eval_loop(
[rank0]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/transformers/trainer.py", line 4051, in evaluation_loop
[rank0]:     for step, inputs in enumerate(dataloader):
[rank0]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/accelerate/data_loader.py", line 552, in __iter__
[rank0]:     current_batch = next(dataloader_iter)
[rank0]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 701, in __next__
[rank0]:     data = self._next_data()
[rank0]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 757, in _next_data
[rank0]:     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
[rank0]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
[rank0]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
[rank0]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]: KeyError: 0
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/ly/train/src/llamafactory/launcher.py", line 23, in <module>
[rank2]:     launch()
[rank2]:   File "/home/ly/train/src/llamafactory/launcher.py", line 19, in launch
[rank2]:     run_exp()
[rank2]:   File "/home/ly/train/src/llamafactory/train/tuner.py", line 93, in run_exp
[rank2]:     _training_function(config={"args": args, "callbacks": callbacks})
[rank2]:   File "/home/ly/train/src/llamafactory/train/tuner.py", line 67, in _training_function
[rank2]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank2]:   File "/home/ly/train/src/llamafactory/train/sft/workflow.py", line 127, in run_sft
[rank2]:     predict_results = trainer.predict(dataset_module["eval_dataset"], metric_key_prefix="predict", **gen_kwargs)
[rank2]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 244, in predict
[rank2]:     return super().predict(test_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank2]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/transformers/trainer.py", line 3946, in predict
[rank2]:     output = eval_loop(
[rank2]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/transformers/trainer.py", line 4051, in evaluation_loop
[rank2]:     for step, inputs in enumerate(dataloader):
[rank2]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/accelerate/data_loader.py", line 552, in __iter__
[rank2]:     current_batch = next(dataloader_iter)
[rank2]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 701, in __next__
[rank2]:     data = self._next_data()
[rank2]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 757, in _next_data
[rank2]:     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
[rank2]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
[rank2]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank2]:   File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
[rank2]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank2]: KeyError: 0
[rank0]:[W216 13:22:02.832660849 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
W0216 13:22:03.658000 3110423 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3110522 closing signal SIGTERM
W0216 13:22:03.659000 3110423 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3110523 closing signal SIGTERM
W0216 13:22:03.660000 3110423 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3110524 closing signal SIGTERM
E0216 13:22:04.254000 3110423 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 3 (pid: 3110525) of binary: /home/ly/.conda/envs/train/bin/python
Traceback (most recent call last):
  File "/home/ly/.conda/envs/train/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/ly/train/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-16_13:22:03
  host      : gzz-SYS-7049GP-TRT
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 3110525)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Others

No response

@ac-automata ac-automata added bug Something isn't working pending This problem is yet to be addressed labels Feb 16, 2025
@Roxanne527
Copy link

同报错,蹲蹲解决方案

@hiyouga hiyouga added solved This problem has been already solved and removed bug Something isn't working pending This problem is yet to be addressed labels Feb 17, 2025
@hiyouga
Copy link
Owner

hiyouga commented Feb 17, 2025

fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants