Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a dynamic number of threads instead of 4 for parsing #23

Closed
AlbanSagouis opened this issue Feb 13, 2023 · 3 comments · Fixed by #32
Closed

Use a dynamic number of threads instead of 4 for parsing #23

AlbanSagouis opened this issue Feb 13, 2023 · 3 comments · Fixed by #32
Assignees

Comments

@AlbanSagouis
Copy link
Contributor

On the user side there are these two functionsrgnparser::gn_parse() and rgnparser::gn_parse_tidy(). Both call the non-exported functionrgnparser::parse_one() that calls rgnparser::gnparser_cmd() that calls gnparser.

Both exported functions have an argument threads with a default value of 4 and it would make more sense to me if this number was higher.

We could dynamically detect the number of available cores with parallel::detectCores(), I believe it is installed by default with R, and set threads = number of cores - 1.

@AlbanSagouis
Copy link
Contributor Author

Well actually, gnparser by default runs on as many cores as there are on the system, see here:

-j, --jobs (positive integer, default is a number of CPUs on a machine)

The number of jobs running concurrently. This flag is ignored when parsing one name:

gnparser -j 200 names.txt

So I think we should offer the user the possibility to change this value but, if possible, we should let gnparser determine the number of threads by itself. Maybe rgnparser::gnparser_cmd() has to pass this argument?

I see two use cases where the users might want to set the number of core themselves:

  • they are already running the function on a distributed pipeline and want to use only one thread
  • they don't want to saturate their system

Additional point: whether we implement this or not, running several parallel calls to rgnparser::gn_parse(x, threads = 4) at the same time on any system seems wrong. Maybe we should warn users that gnparser runs in parallel by default? Especially if we detect that the system is an HPC?

AlbanSagouis added a commit that referenced this issue Feb 13, 2023
Instead of using 4 by default in `gn_parse_tidy()`
@AlbanSagouis
Copy link
Contributor Author

Oh... actually threads has a default value of 4 in gn_parse_tidy() but not in gn_parse() where it's NULL and works correctly of course so that solves most of my concerns.

Now remains the question of using several cores when running parallel calls to rgnparser::gn_parse(): is there something to worry about? should we try to detect these situations and warn users?

AlbanSagouis added a commit that referenced this issue Feb 13, 2023
@joelnitta
Copy link
Contributor

joelnitta commented Sep 19, 2023

I think the correct course of action here is to set the default number of threads for gn_parse() and gn_parse_tidy() to 1. Setting to NULL could correspond to the default value of gnparser (i.e., the number of threads available on the machine).

Reasons:

  • gnparser is already quite fast (approx 9,000 names / sec on one thread), which is probably fine for most use cases.
  • It can be problematic for software to automatically use all threads especially on shared machines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants