Skip to content

Wait method for jobs / higher level job API #240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
JonaOtto opened this issue Jul 26, 2022 · 2 comments
Closed

Wait method for jobs / higher level job API #240

JonaOtto opened this issue Jul 26, 2022 · 2 comments

Comments

@JonaOtto
Copy link
Contributor

Hello pyslurm developers,
I work on an HPC performance tool for my university. We want to enable the tool to dispatch measurement executions of a target code to our cluster, which uses SLURM. Ideally, we want to use pyslurm for this.
What we need is a way to:

  1. Dispatch jobs to the cluster: Already possible with job.submit_batch_job.
  2. Wait for a job to finish, so that we can examine the results. So ideally something like a blocking method job.wait(job_id) would be nice, which you could call to wait for a job (referenced by the job_id) to finish.
    I'm a pyslurm newbie, but as far as I understand, there is no such thing in pyslurm at the moment. As far as I understand there would be several possibilities building such behavior with some combinations of the find, find_id and get methods from the job class.

How do you think would be the approach to do this? Would you think it would be applicable to build such behavior into pyslurm? Or that this is a thing that our tool should care about?

I have to dive deeper into the code, but if there is a thing on this topic I can help with, I would be happy to do so. Generally, we would like to offer to contribute back our knowledge we may obtain during the process, if it is in code or not. It would maybe also be a possibility just to see how it turns out on our side, and we contribute back our code/interface we developed, or even just some comments for others on how we did it.

Thanks for doing this great project, I'm exited to hear your thoughts!

Best,
Jonathan

@tazend
Copy link
Member

tazend commented Jul 26, 2022

Hi @JonaOtto

For the Job API, I'm currently working in #224 to rework the whole API structure a bit, to support more features and to hopefully make it easier to interact with the job interface, i.e supporting more methods like cancel, update, suspend, hold and so on... But there are still some things to do until it's done :)

Anyway, for your specific problem right now with the current codebase: sbatch also has the --wait flag which blocks until the job terminates. So I just had a look at how they do this here
They basically continously fetch the data for a specific job-id and check if the job is in a finished state.

This could easily be replicated in pyslurm I guess, making a function as you said wait (or wait_finished) which wraps around the functionality of find_id (which does slurm_load_job) and simply stays in a while-loop until it is determined that the job has actually finished (using the IS_JOB_FINISHED(job) macro) - at which point the blocking is released.

I could take a look at this when I have the time, otherwise if you want to give it a try and do a PR afterwards, go ahead :)

@JonaOtto
Copy link
Contributor Author

Hi @tazend,
Thanks for the input! I did not know that sbatch can do this. I will think about it, and see what I end up doing. That's either doing it with pyslurm, or given we really need this small fraction of the whole API, we could also just do sbatch and using this --wait flag I guess. In case I do it in pyslurm, you will get my PR (probably in the next week). In case we decide to use the flag, I will close this issue.

tazend pushed a commit that referenced this issue Sep 9, 2022
* Fix introduced typo in partition information dictionary key. (#241)

* Added wait_finished method to job class (#240).

* Added test method for wait_finished method of the job class.

* Added _load_single_job method to the job class to extract the slurm_load_job functionality.

* Updated find_id and wait_finished to use _load_single_job.

Co-authored-by: Jonathan Goodson <jonathan.goodson@gmail.com>
@tazend tazend closed this as completed Sep 9, 2022
tazend pushed a commit that referenced this issue Sep 11, 2022
* Fix introduced typo in partition information dictionary key. (#241)

* Added wait_finished method to job class (#240).

* Added test method for wait_finished method of the job class.

* Added _load_single_job method to the job class to extract the slurm_load_job functionality.

* Updated find_id and wait_finished to use _load_single_job.

Co-authored-by: Jonathan Goodson <jonathan.goodson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants