You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I followed the readme except for using Python3.10. After launching my single node job, everythong seemed to be alright during generating rollouts. But after that, my job failed with following outputs:
Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=/nas/wangshumin.wsm/rlspace/data/simplerl/simplelr_qwen_level3to5/train.parquet', 'data.val_files=/nas/wangshumin.wsm/rlspace/data/simplerl/simplelr_qwen_level3to5/test.parquet', 'data.train_batch_size=1024', 'data.val_batch_size=500', 'data.max_prompt_length=1024', 'data.max_response_length=2048', 'actor_rollout_ref.model.path=/nas/wangshumin.wsm/rlspace/model/qwen2.5-7B', 'actor_rollout_ref.actor.optim.lr=5e-7', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.actor.ppo_mini_batch_size=256', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.0001', 'actor_rollout_ref.actor.entropy_coeff=0.001', 'actor_rollout_ref.actor.clip_ratio=0.2', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'actor_rollout_ref.actor.fsdp_config.param_offload=False', 'actor_rollout_ref.actor.fsdp_config.grad_offload=False', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=False', 'actor_rollout_ref.rollout.temperature=1.0', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=160', 'actor_rollout_ref.rollout.tensor_model_parallel_size=1', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.4', 'actor_rollout_ref.rollout.n=8', 'actor_rollout_ref.rollout.enable_chunked_prefill=False', 'actor_rollout_ref.rollout.max_num_batched_tokens=4072', 'actor_rollout_ref.rollout.micro_rollout_batch_size=1024', 'actor_rollout_ref.ref.log_prob_micro_batch_size=160', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'algorithm.kl_ctrl.kl_coef=0.001', 'critic.ppo_micro_batch_size_per_gpu=4', 'trainer.critic_warmup=0', 'trainer.logger=[console,wandb]', 'trainer.project_name=verl_train_shumin', 'trainer.remove_previous_ckpt=False', 'trainer.experiment_name=verl-grpo_qwen2.5-7B_lv35_test_qwen2.5-7B_max_response2048_batch1024_rollout8_klcoef0.0001_entcoef0.001_simplelr_qwen_level3to5', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=1', 'trainer.remove_clip=False', 'trainer.save_freq=5', 'trainer.test_freq=5', 'trainer.default_local_dir=/nas/wangshumin.wsm/rlspace/ckpt/verl-grpo_qwen2.5-7B_lv35_test_qwen2.5-7B_max_response2048_batch1024_rollout8_klcoef0.0001_entcoef0.001_simplelr_qwen_level3to5', 'trainer.total_epochs=20']
Traceback (most recent call last):
File "/tmp/ray/session_2025-04-16_19-35-10_385103_1143568/runtime_resources/working_dir_files/_ray_pkg_17121a184816c556/verl/trainer/main_ppo.py", line 111, in main
run_ppo(config)
File "/tmp/ray/session_2025-04-16_19-35-10_385103_1143568/runtime_resources/working_dir_files/_ray_pkg_17121a184816c556/verl/trainer/main_ppo.py", line 119, in run_ppo
ray.get(main_task.remote(config, compute_score))
File "/root/anaconda3/envs/simplerl/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/root/anaconda3/envs/simplerl/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/root/anaconda3/envs/simplerl/lib/python3.10/site-packages/ray/_private/worker.py", line 2667, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/root/anaconda3/envs/simplerl/lib/python3.10/site-packages/ray/_private/worker.py", line 864, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RayActorError): �[36mray::main_task()�[39m (pid=1165872, ip=0.0.0.0)
File "/tmp/ray/session_2025-04-16_19-35-10_385103_1143568/runtime_resources/working_dir_files/_ray_pkg_17121a184816c556/verl/trainer/main_ppo.py", line 205, in main_task
trainer.fit()
File "/tmp/ray/session_2025-04-16_19-35-10_385103_1143568/runtime_resources/working_dir_files/_ray_pkg_17121a184816c556/verl/trainer/ppo/ray_trainer.py", line 819, in fit
gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
File "/tmp/ray/session_2025-04-16_19-35-10_385103_1143568/runtime_resources/working_dir_files/_ray_pkg_17121a184816c556/verl/single_controller/ray/base.py", line 42, in func
output = ray.get(output)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name:
actor_id: 19982502f97c12af80f3d83503000000
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
�[36m(WorkerDict pid=1166399)�[0m /root/anaconda3/envs/simplerl/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.�[32m [repeated 7x across cluster]�[0m
�[36m(WorkerDict pid=1166399)�[0m with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]�[32m [repeated 7x across cluster]�[0m
Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
File "/tmp/ray/session_2025-04-16_19-35-10_385103_1143568/runtime_resources/working_dir_files/_ray_pkg_17121a184816c556/verl/single_controller/ray/base.py", line 42, in func
output = ray.get(output)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name:
actor_id: 19982502f97c12af80f3d83503000000
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
�[36m(WorkerDict pid=1166399)�[0m /root/anaconda3/envs/simplerl/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.�[32m [repeated 7x across cluster]�[0m
�[36m(WorkerDict pid=1166399)�[0m with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]�[32m [repeated 7x across cluster]�[0m
The text was updated successfully, but these errors were encountered:
I followed the readme except for using Python3.10. After launching my single node job, everythong seemed to be alright during generating rollouts. But after that, my job failed with following outputs:
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
�[36m(WorkerDict pid=1166399)�[0m /root/anaconda3/envs/simplerl/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning:
torch.cpu.amp.autocast(args...)
is deprecated. Please usetorch.amp.autocast('cpu', args...)
instead.�[32m [repeated 7x across cluster]�[0m�[36m(WorkerDict pid=1166399)�[0m with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]�[32m [repeated 7x across cluster]�[0m
�[31m---------------------------------------�[39m
�[31mJob 'raysubmit_CjpGi1eu54kTJMWn' failed�[39m
�[31m---------------------------------------�[39m
Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
File "/tmp/ray/session_2025-04-16_19-35-10_385103_1143568/runtime_resources/working_dir_files/_ray_pkg_17121a184816c556/verl/single_controller/ray/base.py", line 42, in func
output = ray.get(output)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name:
actor_id: 19982502f97c12af80f3d83503000000
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
�[36m(WorkerDict pid=1166399)�[0m /root/anaconda3/envs/simplerl/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning:
torch.cpu.amp.autocast(args...)
is deprecated. Please usetorch.amp.autocast('cpu', args...)
instead.�[32m [repeated 7x across cluster]�[0m�[36m(WorkerDict pid=1166399)�[0m with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]�[32m [repeated 7x across cluster]�[0m
The text was updated successfully, but these errors were encountered: