You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[rank3]: Traceback (most recent call last):
[rank3]: File "/home/ly/train/src/llamafactory/launcher.py", line 23, in <module>
[rank3]: launch()
[rank3]: File "/home/ly/train/src/llamafactory/launcher.py", line 19, in launch
[rank3]: run_exp()
[rank3]: File "/home/ly/train/src/llamafactory/train/tuner.py", line 93, in run_exp
[rank3]: _training_function(config={"args": args, "callbacks": callbacks})
[rank3]: File "/home/ly/train/src/llamafactory/train/tuner.py", line 67, in _training_function
[rank3]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank3]: File "/home/ly/train/src/llamafactory/train/sft/workflow.py", line 127, in run_sft
[rank3]: predict_results = trainer.predict(dataset_module["eval_dataset"], metric_key_prefix="predict", **gen_kwargs)
[rank3]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 244, in predict
[rank3]: return super().predict(test_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank3]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/transformers/trainer.py", line 3946, in predict
[rank3]: output = eval_loop(
[rank3]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/transformers/trainer.py", line 4051, in evaluation_loop
[rank3]: for step, inputs in enumerate(dataloader):
[rank3]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/accelerate/data_loader.py", line 552, in __iter__
[rank3]: current_batch = next(dataloader_iter)
[rank3]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 701, in __next__
[rank3]: data = self._next_data()
[rank3]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 757, in _next_data
[rank3]: data = self._dataset_fetcher.fetch(index) # may raise StopIteration
[rank3]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
[rank3]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank3]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
[rank3]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank3]: KeyError: 0
[INFO|2025-02-16 13:22:01] llamafactory.model.adapter:157 >> Merged 1 adapter(s).
[INFO|2025-02-16 13:22:01] llamafactory.model.adapter:157 >> Loaded adapter(s): saves/Llama-3.1-8B-Instruct/lora/train_2025-02-14-14-23-47
[INFO|2025-02-16 13:22:01] llamafactory.model.loader:157 >> all params: 8,030,261,248
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/ly/train/src/llamafactory/launcher.py", line 23, in <module>
[rank1]: launch()
[rank1]: File "/home/ly/train/src/llamafactory/launcher.py", line 19, in launch
[rank1]: run_exp()
[rank1]: File "/home/ly/train/src/llamafactory/train/tuner.py", line 93, in run_exp
[rank1]: _training_function(config={"args": args, "callbacks": callbacks})
[rank1]: File "/home/ly/train/src/llamafactory/train/tuner.py", line 67, in _training_function
[rank1]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]: File "/home/ly/train/src/llamafactory/train/sft/workflow.py", line 127, in run_sft
[rank1]: predict_results = trainer.predict(dataset_module["eval_dataset"], metric_key_prefix="predict", **gen_kwargs)
[rank1]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 244, in predict
[rank1]: return super().predict(test_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank1]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/transformers/trainer.py", line 3946, in predict
[rank1]: output = eval_loop(
[rank1]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/transformers/trainer.py", line 4051, in evaluation_loop
[rank1]: for step, inputs in enumerate(dataloader):
[rank1]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/accelerate/data_loader.py", line 552, in __iter__
[rank1]: current_batch = next(dataloader_iter)
[rank1]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 701, in __next__
[rank1]: data = self._next_data()
[rank1]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 757, in _next_data
[rank1]: data = self._dataset_fetcher.fetch(index) # may raise StopIteration
[rank1]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
[rank1]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank1]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
[rank1]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank1]: KeyError: 0
[WARNING|2025-02-16 13:22:01] llamafactory.train.sft.workflow:168 >> Batch generation can be very slow. Consider using `scripts/vllm_infer.py` instead.
[INFO|trainer.py:4021] 2025-02-16 13:22:01,870 >>
***** Running Prediction *****
[INFO|trainer.py:4023] 2025-02-16 13:22:01,870 >> Num examples = 1
[INFO|trainer.py:4026] 2025-02-16 13:22:01,871 >> Batch size = 4
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/ly/train/src/llamafactory/launcher.py", line 23, in <module>
[rank0]: launch()
[rank0]: File "/home/ly/train/src/llamafactory/launcher.py", line 19, in launch
[rank0]: run_exp()
[rank0]: File "/home/ly/train/src/llamafactory/train/tuner.py", line 93, in run_exp
[rank0]: _training_function(config={"args": args, "callbacks": callbacks})
[rank0]: File "/home/ly/train/src/llamafactory/train/tuner.py", line 67, in _training_function
[rank0]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]: File "/home/ly/train/src/llamafactory/train/sft/workflow.py", line 127, in run_sft
[rank0]: predict_results = trainer.predict(dataset_module["eval_dataset"], metric_key_prefix="predict", **gen_kwargs)
[rank0]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 244, in predict
[rank0]: return super().predict(test_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank0]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/transformers/trainer.py", line 3946, in predict
[rank0]: output = eval_loop(
[rank0]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/transformers/trainer.py", line 4051, in evaluation_loop
[rank0]: for step, inputs in enumerate(dataloader):
[rank0]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/accelerate/data_loader.py", line 552, in __iter__
[rank0]: current_batch = next(dataloader_iter)
[rank0]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 701, in __next__
[rank0]: data = self._next_data()
[rank0]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 757, in _next_data
[rank0]: data = self._dataset_fetcher.fetch(index) # may raise StopIteration
[rank0]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
[rank0]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
[rank0]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]: KeyError: 0
[rank2]: Traceback (most recent call last):
[rank2]: File "/home/ly/train/src/llamafactory/launcher.py", line 23, in <module>
[rank2]: launch()
[rank2]: File "/home/ly/train/src/llamafactory/launcher.py", line 19, in launch
[rank2]: run_exp()
[rank2]: File "/home/ly/train/src/llamafactory/train/tuner.py", line 93, in run_exp
[rank2]: _training_function(config={"args": args, "callbacks": callbacks})
[rank2]: File "/home/ly/train/src/llamafactory/train/tuner.py", line 67, in _training_function
[rank2]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank2]: File "/home/ly/train/src/llamafactory/train/sft/workflow.py", line 127, in run_sft
[rank2]: predict_results = trainer.predict(dataset_module["eval_dataset"], metric_key_prefix="predict", **gen_kwargs)
[rank2]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 244, in predict
[rank2]: return super().predict(test_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank2]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/transformers/trainer.py", line 3946, in predict
[rank2]: output = eval_loop(
[rank2]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/transformers/trainer.py", line 4051, in evaluation_loop
[rank2]: for step, inputs in enumerate(dataloader):
[rank2]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/accelerate/data_loader.py", line 552, in __iter__
[rank2]: current_batch = next(dataloader_iter)
[rank2]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 701, in __next__
[rank2]: data = self._next_data()
[rank2]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 757, in _next_data
[rank2]: data = self._dataset_fetcher.fetch(index) # may raise StopIteration
[rank2]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
[rank2]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank2]: File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
[rank2]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank2]: KeyError: 0
[rank0]:[W216 13:22:02.832660849 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0216 13:22:03.658000 3110423 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3110522 closing signal SIGTERM
W0216 13:22:03.659000 3110423 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3110523 closing signal SIGTERM
W0216 13:22:03.660000 3110423 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3110524 closing signal SIGTERM
E0216 13:22:04.254000 3110423 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 3 (pid: 3110525) of binary: /home/ly/.conda/envs/train/bin/python
Traceback (most recent call last):
File "/home/ly/.conda/envs/train/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ly/.conda/envs/train/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/ly/train/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-02-16_13:22:03
host : gzz-SYS-7049GP-TRT
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 3110525)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Others
No response
The text was updated successfully, but these errors were encountered:
Reminder
System Info
llamafactory
version: 0.9.2.dev0Reproduction
llamafactory-cli train myscripts/llama3_lora_predict.yaml
yaml文件配置
model
model_name_or_path: /home/ly/model
adapter_name_or_path: saves/Llama-3.1-8B-Instruct/lora/train_2025-02-14-14-23-47
trust_remote_code: true
method
stage: sft
do_predict: true
finetuning_type: lora
dataset
eval_dataset: alpaca_en_demo
template: llama3
cutoff_len: 2048
max_samples: 100
overwrite_cache: true
preprocessing_num_workers: 16
output
output_dir: saves/llama3-8b/lora/predict
overwrite_output_dir: true
eval
per_device_eval_batch_size: 4
predict_with_generate: true
ddp_timeout: 180000000
报错信息
Others
No response
The text was updated successfully, but these errors were encountered: