Actor died without any reason #80

shuminwang-ai · 2025-04-16T12:37:09Z

I followed the readme except for using Python3.10. After launching my single node job, everythong seemed to be alright during generating rollouts. But after that, my job failed with following outputs:

Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=/nas/wangshumin.wsm/rlspace/data/simplerl/simplelr_qwen_level3to5/train.parquet', 'data.val_files=/nas/wangshumin.wsm/rlspace/data/simplerl/simplelr_qwen_level3to5/test.parquet', 'data.train_batch_size=1024', 'data.val_batch_size=500', 'data.max_prompt_length=1024', 'data.max_response_length=2048', 'actor_rollout_ref.model.path=/nas/wangshumin.wsm/rlspace/model/qwen2.5-7B', 'actor_rollout_ref.actor.optim.lr=5e-7', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.actor.ppo_mini_batch_size=256', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.0001', 'actor_rollout_ref.actor.entropy_coeff=0.001', 'actor_rollout_ref.actor.clip_ratio=0.2', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'actor_rollout_ref.actor.fsdp_config.param_offload=False', 'actor_rollout_ref.actor.fsdp_config.grad_offload=False', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=False', 'actor_rollout_ref.rollout.temperature=1.0', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=160', 'actor_rollout_ref.rollout.tensor_model_parallel_size=1', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.4', 'actor_rollout_ref.rollout.n=8', 'actor_rollout_ref.rollout.enable_chunked_prefill=False', 'actor_rollout_ref.rollout.max_num_batched_tokens=4072', 'actor_rollout_ref.rollout.micro_rollout_batch_size=1024', 'actor_rollout_ref.ref.log_prob_micro_batch_size=160', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'algorithm.kl_ctrl.kl_coef=0.001', 'critic.ppo_micro_batch_size_per_gpu=4', 'trainer.critic_warmup=0', 'trainer.logger=[console,wandb]', 'trainer.project_name=verl_train_shumin', 'trainer.remove_previous_ckpt=False', 'trainer.experiment_name=verl-grpo_qwen2.5-7B_lv35_test_qwen2.5-7B_max_response2048_batch1024_rollout8_klcoef0.0001_entcoef0.001_simplelr_qwen_level3to5', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=1', 'trainer.remove_clip=False', 'trainer.save_freq=5', 'trainer.test_freq=5', 'trainer.default_local_dir=/nas/wangshumin.wsm/rlspace/ckpt/verl-grpo_qwen2.5-7B_lv35_test_qwen2.5-7B_max_response2048_batch1024_rollout8_klcoef0.0001_entcoef0.001_simplelr_qwen_level3to5', 'trainer.total_epochs=20']
Traceback (most recent call last):
File "/tmp/ray/session_2025-04-16_19-35-10_385103_1143568/runtime_resources/working_dir_files/_ray_pkg_17121a184816c556/verl/trainer/main_ppo.py", line 111, in main
run_ppo(config)
File "/tmp/ray/session_2025-04-16_19-35-10_385103_1143568/runtime_resources/working_dir_files/_ray_pkg_17121a184816c556/verl/trainer/main_ppo.py", line 119, in run_ppo
ray.get(main_task.remote(config, compute_score))
File "/root/anaconda3/envs/simplerl/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/root/anaconda3/envs/simplerl/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/root/anaconda3/envs/simplerl/lib/python3.10/site-packages/ray/_private/worker.py", line 2667, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/root/anaconda3/envs/simplerl/lib/python3.10/site-packages/ray/_private/worker.py", line 864, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RayActorError): �[36mray::main_task()�[39m (pid=1165872, ip=0.0.0.0)
File "/tmp/ray/session_2025-04-16_19-35-10_385103_1143568/runtime_resources/working_dir_files/_ray_pkg_17121a184816c556/verl/trainer/main_ppo.py", line 205, in main_task
trainer.fit()
File "/tmp/ray/session_2025-04-16_19-35-10_385103_1143568/runtime_resources/working_dir_files/_ray_pkg_17121a184816c556/verl/trainer/ppo/ray_trainer.py", line 819, in fit
gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
File "/tmp/ray/session_2025-04-16_19-35-10_385103_1143568/runtime_resources/working_dir_files/_ray_pkg_17121a184816c556/verl/single_controller/ray/base.py", line 42, in func
output = ray.get(output)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name:
actor_id: 19982502f97c12af80f3d83503000000

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
�[36m(WorkerDict pid=1166399)�[0m /root/anaconda3/envs/simplerl/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.�[32m [repeated 7x across cluster]�[0m
�[36m(WorkerDict pid=1166399)�[0m with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]�[32m [repeated 7x across cluster]�[0m

�[31m---------------------------------------�[39m
�[31mJob 'raysubmit_CjpGi1eu54kTJMWn' failed�[39m
�[31m---------------------------------------�[39m

Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
File "/tmp/ray/session_2025-04-16_19-35-10_385103_1143568/runtime_resources/working_dir_files/_ray_pkg_17121a184816c556/verl/single_controller/ray/base.py", line 42, in func
output = ray.get(output)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name:
actor_id: 19982502f97c12af80f3d83503000000

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
�[36m(WorkerDict pid=1166399)�[0m /root/anaconda3/envs/simplerl/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.�[32m [repeated 7x across cluster]�[0m
�[36m(WorkerDict pid=1166399)�[0m with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]�[32m [repeated 7x across cluster]�[0m

The text was updated successfully, but these errors were encountered:

Zeng-WH · 2025-04-22T07:02:55Z

Hi. There are maybe many reasons for this, possibly related to the ray issue at most time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Actor died without any reason #80

Actor died without any reason #80

shuminwang-ai commented Apr 16, 2025

Zeng-WH commented Apr 22, 2025

Uh oh!

Actor died without any reason #80

Actor died without any reason #80

Comments

shuminwang-ai commented Apr 16, 2025

Zeng-WH commented Apr 22, 2025

Uh oh!