Skip to content

Actor died without any reason #80

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
shuminwang-ai opened this issue Apr 16, 2025 · 1 comment
Open

Actor died without any reason #80

shuminwang-ai opened this issue Apr 16, 2025 · 1 comment

Comments

@shuminwang-ai
Copy link

I followed the readme except for using Python3.10. After launching my single node job, everythong seemed to be alright during generating rollouts. But after that, my job failed with following outputs:

Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=/nas/wangshumin.wsm/rlspace/data/simplerl/simplelr_qwen_level3to5/train.parquet', 'data.val_files=/nas/wangshumin.wsm/rlspace/data/simplerl/simplelr_qwen_level3to5/test.parquet', 'data.train_batch_size=1024', 'data.val_batch_size=500', 'data.max_prompt_length=1024', 'data.max_response_length=2048', 'actor_rollout_ref.model.path=/nas/wangshumin.wsm/rlspace/model/qwen2.5-7B', 'actor_rollout_ref.actor.optim.lr=5e-7', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.actor.ppo_mini_batch_size=256', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.0001', 'actor_rollout_ref.actor.entropy_coeff=0.001', 'actor_rollout_ref.actor.clip_ratio=0.2', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'actor_rollout_ref.actor.fsdp_config.param_offload=False', 'actor_rollout_ref.actor.fsdp_config.grad_offload=False', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=False', 'actor_rollout_ref.rollout.temperature=1.0', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=160', 'actor_rollout_ref.rollout.tensor_model_parallel_size=1', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.4', 'actor_rollout_ref.rollout.n=8', 'actor_rollout_ref.rollout.enable_chunked_prefill=False', 'actor_rollout_ref.rollout.max_num_batched_tokens=4072', 'actor_rollout_ref.rollout.micro_rollout_batch_size=1024', 'actor_rollout_ref.ref.log_prob_micro_batch_size=160', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'algorithm.kl_ctrl.kl_coef=0.001', 'critic.ppo_micro_batch_size_per_gpu=4', 'trainer.critic_warmup=0', 'trainer.logger=[console,wandb]', 'trainer.project_name=verl_train_shumin', 'trainer.remove_previous_ckpt=False', 'trainer.experiment_name=verl-grpo_qwen2.5-7B_lv35_test_qwen2.5-7B_max_response2048_batch1024_rollout8_klcoef0.0001_entcoef0.001_simplelr_qwen_level3to5', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=1', 'trainer.remove_clip=False', 'trainer.save_freq=5', 'trainer.test_freq=5', 'trainer.default_local_dir=/nas/wangshumin.wsm/rlspace/ckpt/verl-grpo_qwen2.5-7B_lv35_test_qwen2.5-7B_max_response2048_batch1024_rollout8_klcoef0.0001_entcoef0.001_simplelr_qwen_level3to5', 'trainer.total_epochs=20']
Traceback (most recent call last):
File "/tmp/ray/session_2025-04-16_19-35-10_385103_1143568/runtime_resources/working_dir_files/_ray_pkg_17121a184816c556/verl/trainer/main_ppo.py", line 111, in main
run_ppo(config)
File "/tmp/ray/session_2025-04-16_19-35-10_385103_1143568/runtime_resources/working_dir_files/_ray_pkg_17121a184816c556/verl/trainer/main_ppo.py", line 119, in run_ppo
ray.get(main_task.remote(config, compute_score))
File "/root/anaconda3/envs/simplerl/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/root/anaconda3/envs/simplerl/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/root/anaconda3/envs/simplerl/lib/python3.10/site-packages/ray/_private/worker.py", line 2667, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/root/anaconda3/envs/simplerl/lib/python3.10/site-packages/ray/_private/worker.py", line 864, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RayActorError): �[36mray::main_task()�[39m (pid=1165872, ip=0.0.0.0)
File "/tmp/ray/session_2025-04-16_19-35-10_385103_1143568/runtime_resources/working_dir_files/_ray_pkg_17121a184816c556/verl/trainer/main_ppo.py", line 205, in main_task
trainer.fit()
File "/tmp/ray/session_2025-04-16_19-35-10_385103_1143568/runtime_resources/working_dir_files/_ray_pkg_17121a184816c556/verl/trainer/ppo/ray_trainer.py", line 819, in fit
gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
File "/tmp/ray/session_2025-04-16_19-35-10_385103_1143568/runtime_resources/working_dir_files/_ray_pkg_17121a184816c556/verl/single_controller/ray/base.py", line 42, in func
output = ray.get(output)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name:
actor_id: 19982502f97c12af80f3d83503000000

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
�[36m(WorkerDict pid=1166399)�[0m /root/anaconda3/envs/simplerl/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.�[32m [repeated 7x across cluster]�[0m
�[36m(WorkerDict pid=1166399)�[0m with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]�[32m [repeated 7x across cluster]�[0m

�[31m---------------------------------------�[39m
�[31mJob 'raysubmit_CjpGi1eu54kTJMWn' failed�[39m
�[31m---------------------------------------�[39m

Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
File "/tmp/ray/session_2025-04-16_19-35-10_385103_1143568/runtime_resources/working_dir_files/_ray_pkg_17121a184816c556/verl/single_controller/ray/base.py", line 42, in func
output = ray.get(output)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name:
actor_id: 19982502f97c12af80f3d83503000000

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
�[36m(WorkerDict pid=1166399)�[0m /root/anaconda3/envs/simplerl/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.�[32m [repeated 7x across cluster]�[0m
�[36m(WorkerDict pid=1166399)�[0m with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]�[32m [repeated 7x across cluster]�[0m

@Zeng-WH
Copy link
Collaborator

Zeng-WH commented Apr 22, 2025

Hi. There are maybe many reasons for this, possibly related to the ray issue at most time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants