Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare performance of CWL implementations #103

Open
dleehr opened this issue Dec 1, 2017 · 19 comments
Open

Compare performance of CWL implementations #103

dleehr opened this issue Dec 1, 2017 · 19 comments
Labels
cwl-engine Issue related to a specific CWL engine

Comments

@dleehr
Copy link
Member

dleehr commented Dec 1, 2017

Toil adds a lot of complexity to running the workflows and makes them harder to debug. We assume this is the best CWL implementation to use because of parallelism, but don't have data to back that up. Even if it's faster, it's not clear if the performance improvements over cwl-runner, for example outweigh the complexity at this point.

@dleehr dleehr added this to the Beta milestone Dec 1, 2017
@dleehr
Copy link
Member Author

dleehr commented Dec 1, 2017

It's also worth considering how many of the open issues here are specifically related to toil.

@dleehr
Copy link
Member Author

dleehr commented Dec 4, 2017

WIP on this right now. Reconfiguring dev to use cwltool and will rerun a job that succeeded with cwltoil

@dleehr
Copy link
Member Author

dleehr commented Dec 6, 2017

cwl-airflow summary:

  • Apache incubator project. Uses a relational database for persistence. SQLite by default but supports Postgres/MySQL

  • Out of the box, Airflow includes several "workflows" that can be run, aka DAGs. These are registered with the service and can jobs can be submitted against them

  • cwl-airlfow aims to make CWL workflows available as DAGs.

  • Requires some configuration changes to the airflow.cfg file - to indicate where workflows are stored

  • jobs can be automatically run by placing a JSON file in the jobs/new directory, provided the file name matches a workflow name

  • This is not directly compatible with our workflow directory structure, but jobs can be manually submitted:

      cwl-airflow-runner workflows/bespin-cwl/workflows/exomeseq.cwl exomeseq.json 
    
  • This command requested a job to start, and parsed the workflow. Failed with error Scatter is not supported

  • Could not get packed workflows to run

  • No obvious location for input data files, but having them in the directory with the job JSON file would cause the scheduler to think they were all job files, so they would need to be elsewhere and likely referenced by absolute path.

  • Airflow can run multiple concurrent jobs but it's not clear yet how a workflow and a job are related. Plus, since scatter is not supported, there wouldn't be much benefit to concurrency.

@johnbradley
Copy link
Collaborator

johnbradley commented Dec 6, 2017

Rabix Bunny summary:

sudo apt-get update
sudo apt install docker.io
sudo apt-get install default-jdk
wget https://github.com/rabix/bunny/releases/download/v1.0.3/rabix-1.0.3.tar.gz -O rabix-1.0.3.tar.gz && tar -xvf rabix-1.0.3.tar.gz
  • run comands similar to cwltool:
    ./rabix <cwl-workflow> <job-order>

@dleehr
Copy link
Member Author

dleehr commented Dec 6, 2017

@johnbradley what about performance for rabix bunny? Does it run scatter tasks concurrently?

@dleehr
Copy link
Member Author

dleehr commented Dec 6, 2017

Since I already had an idle VM for airflow testing with the data on it, I installed rabix bunny.

Initially it failed due to relative file paths in the input job order, but a quick sed script fixed that

$ sed -e 's|"path": "SA0|"path": "/work/data_for_job_9/SA0|g' job-9.json > job-9-abspaths.json
It started up lots of tasks under a single java process and like 2000% CPU usage. So it's definitely doing parallelism. But it failed on the first step:
# rabix-cli-1.0.3/rabix workflows/bespin-cwl/workflows/exomeseq.cwl job-9-abspaths.json 
[2017-12-06 16:16:08.599] [INFO] Job root.preprocessing.3.file_pair_details has started
[2017-12-06 16:16:08.602] [INFO] Job root.preprocessing.10.file_pair_details has started
[2017-12-06 16:16:08.603] [INFO] Job root.preprocessing.22.file_pair_details has started
[2017-12-06 16:16:08.604] [INFO] Job root.preprocessing.17.file_pair_details has started
[2017-12-06 16:16:08.604] [INFO] Job root.preprocessing.5.file_pair_details has started
[2017-12-06 16:16:08.604] [INFO] Job root.preprocessing.7.file_pair_details has started
[2017-12-06 16:16:08.604] [INFO] Job root.preprocessing.16.file_pair_details has started
[2017-12-06 16:16:08.604] [INFO] Job root.preprocessing.15.file_pair_details has started
[2017-12-06 16:16:08.605] [INFO] Job root.preprocessing.19.file_pair_details has started
[2017-12-06 16:16:08.605] [INFO] Job root.preprocessing.24.file_pair_details has started
[2017-12-06 16:16:08.606] [INFO] Job root.preprocessing.11.file_pair_details has started
[2017-12-06 16:16:08.606] [INFO] Job root.preprocessing.2.file_pair_details has started
[2017-12-06 16:16:08.607] [INFO] Job root.preprocessing.22.make_bait_interval_list has started
[2017-12-06 16:16:08.607] [INFO] Job root.preprocessing.24.make_bait_interval_list has started
[2017-12-06 16:16:08.607] [INFO] Job root.preprocessing.2.make_target_interval_list has started
[2017-12-06 16:16:08.608] [INFO] Job root.preprocessing.9.file_pair_details has started
[2017-12-06 16:16:08.608] [INFO] Job root.preprocessing.1.make_target_interval_list has started
[2017-12-06 16:16:08.608] [INFO] Job root.preprocessing.11.make_target_interval_list has started
[2017-12-06 16:16:08.609] [INFO] Job root.preprocessing.22.make_target_interval_list has started
[2017-12-06 16:16:08.609] [INFO] Job root.preprocessing.14.make_target_interval_list has started
[2017-12-06 16:16:08.608] [INFO] Job root.preprocessing.6.file_pair_details has started
[2017-12-06 16:16:08.609] [INFO] Job root.preprocessing.6.make_bait_interval_list has started
[2017-12-06 16:16:08.611] [INFO] Job root.preprocessing.12.file_pair_details has started
[2017-12-06 16:16:08.611] [INFO] Job root.preprocessing.16.make_bait_interval_list has started
[2017-12-06 16:16:08.611] [INFO] Job root.preprocessing.18.file_pair_details has started
[2017-12-06 16:16:08.612] [INFO] Job root.preprocessing.14.make_bait_interval_list has started
[2017-12-06 16:16:08.612] [INFO] Job root.preprocessing.20.file_pair_details has started
[2017-12-06 16:16:08.612] [INFO] Job root.preprocessing.13.make_bait_interval_list has started
[2017-12-06 16:16:08.612] [INFO] Job root.preprocessing.6.make_target_interval_list has started
[2017-12-06 16:16:08.613] [INFO] Job root.preprocessing.1.file_pair_details has started
[2017-12-06 16:16:08.934] [INFO] Job root.preprocessing.8.file_pair_details has started
[2017-12-06 16:16:09.074] [INFO] Job root.preprocessing.14.file_pair_details has started
[2017-12-06 16:16:09.156] [INFO] Job root.preprocessing.13.file_pair_details has started
[2017-12-06 16:16:09.265] [INFO] Job root.preprocessing.23.file_pair_details has started
[2017-12-06 16:16:09.703] [INFO] Job root.preprocessing.4.file_pair_details has started
[2017-12-06 16:16:09.817] [INFO] Job root.preprocessing.21.file_pair_details has started
[2017-12-06 16:16:34.520] [INFO] Pulling docker image dukegcb/picard:2.10.7
[2017-12-06 16:16:34.540] [INFO] Pulling docker image dukegcb/picard:2.10.7
[2017-12-06 16:16:34.557] [INFO] Pulling docker image dukegcb/picard:2.10.7
[2017-12-06 16:16:34.600] [INFO] Pulling docker image dukegcb/picard:2.10.7
[2017-12-06 16:16:34.636] [INFO] Pulling docker image dukegcb/picard:2.10.7
[2017-12-06 16:16:34.787] [INFO] Pulling docker image dukegcb/picard:2.10.7
[2017-12-06 16:16:34.790] [INFO] Pulling docker image dukegcb/picard:2.10.7
[2017-12-06 16:16:35.876] [INFO] Pulling docker image dukegcb/picard:2.10.7
[2017-12-06 16:16:35.877] [INFO] Pulling docker image dukegcb/picard:2.10.7
[2017-12-06 16:16:36.611] [INFO] Pulling docker image dukegcb/picard:2.10.7
2017-12-06 16:16:50.308] [INFO] Running command line: java -Xmx4g -jar /opt/picard/picard.jar BedToIntervalList I= /data/exome-seq/capture/SeqCap_EZ_Exome_v3_capture.noChr.bed O= list.interval_list SD= /data/exome-seq/b37/human_g1k_v37.fasta
[2017-12-06 16:16:50.318] [INFO] Running command line: java -Xmx4g -jar /opt/picard/picard.jar BedToIntervalList I= /data/exome-seq/capture/SeqCap_EZ_Exome_v3_capture.noChr.bed O= list.interval_list SD= /data/exome-seq/b37/human_g1k_v37.fasta
[2017-12-06 16:16:50.324] [INFO] Running command line: java -Xmx4g -jar /opt/picard/picard.jar BedToIntervalList I= /data/exome-seq/capture/SeqCap_EZ_Exome_v3_capture.noChr.bed O= list.interval_list SD= /data/exome-seq/b37/human_g1k_v37.fasta
[2017-12-06 16:16:50.329] [INFO] Running command line: java -Xmx4g -jar /opt/picard/picard.jar BedToIntervalList I= /data/exome-seq/capture/SeqCap_EZ_Exome_v3_capture.noChr.bed O= list.interval_list SD= /data/exome-seq/b37/human_g1k_v37.fasta
[2017-12-06 16:16:50.333] [INFO] Running command line: java -Xmx4g -jar /opt/picard/picard.jar BedToIntervalList I= /data/exome-seq/capture/SeqCap_EZ_Exome_v3_capture.noChr.bed O= list.interval_list SD= /data/exome-seq/b37/human_g1k_v37.fasta
[2017-12-06 16:16:50.341] [INFO] Running command line: java -Xmx4g -jar /opt/picard/picard.jar BedToIntervalList I= /data/exome-seq/capture/SeqCap_EZ_Exome_v3_primary.noChr.bed O= list.interval_list SD= /data/exome-seq/b37/human_g1k_v37.fasta
[2017-12-06 16:16:50.345] [INFO] Running command line: java -Xmx4g -jar /opt/picard/picard.jar BedToIntervalList I= /data/exome-seq/capture/SeqCap_EZ_Exome_v3_primary.noChr.bed O= list.interval_list SD= /data/exome-seq/b37/human_g1k_v37.fasta
[2017-12-06 16:16:50.352] [INFO] Running command line: java -Xmx4g -jar /opt/picard/picard.jar BedToIntervalList I= /data/exome-seq/capture/SeqCap_EZ_Exome_v3_primary.noChr.bed O= list.interval_list SD= /data/exome-seq/b37/human_g1k_v37.fasta
[2017-12-06 16:16:50.354] [INFO] Running command line: java -Xmx4g -jar /opt/picard/picard.jar BedToIntervalList I= /data/exome-seq/capture/SeqCap_EZ_Exome_v3_capture.noChr.bed O= list.interval_list SD= /data/exome-seq/b37/human_g1k_v37.fasta
[2017-12-06 16:16:50.361] [INFO] Running command line: java -Xmx4g -jar /opt/picard/picard.jar BedToIntervalList I= /data/exome-seq/capture/SeqCap_EZ_Exome_v3_primary.noChr.bed O= list.interval_list SD= /data/exome-seq/b37/human_g1k_v37.fasta
[2017-12-06 16:16:50.367] [INFO] Running command line: java -Xmx4g -jar /opt/picard/picard.jar BedToIntervalList I= /data/exome-seq/capture/SeqCap_EZ_Exome_v3_primary.noChr.bed O= list.interval_list SD= /data/exome-seq/b37/human_g1k_v37.fasta
[2017-12-06 16:16:50.373] [INFO] Running command line: java -Xmx4g -jar /opt/picard/picard.jar BedToIntervalList I= /data/exome-seq/capture/SeqCap_EZ_Exome_v3_primary.noChr.bed O= list.interval_list SD= /data/exome-seq/b37/human_g1k_v37.fasta
[2017-12-06 16:16:58.078] [ERROR] Job root.preprocessing.11.make_target_interval_list failed with exit code 1.
[2017-12-06 16:16:58.149] [ERROR] Job root.preprocessing.1.make_target_interval_list failed with exit code 1.
[2017-12-06 16:16:58.149] [ERROR] Job root.preprocessing.16.make_bait_interval_list failed with exit code 1.
[2017-12-06 16:16:58.149] [ERROR] Job root.preprocessing.6.make_target_interval_list failed with exit code 1.
[2017-12-06 16:16:58.149] [ERROR] Job root.preprocessing.13.make_bait_interval_list failed with exit code 1.
[2017-12-06 16:16:58.149] [ERROR] Job root.preprocessing.24.make_bait_interval_list failed with exit code 1.
[2017-12-06 16:16:58.149] [ERROR] Job root.preprocessing.2.make_target_interval_list failed with exit code 1.
[2017-12-06 16:16:58.151] [INFO] Job root.preprocessing.11.make_target_interval_list failed with exit code 1.
[2017-12-06 16:16:58.151] [INFO] Job root.preprocessing.16.make_bait_interval_list failed with exit code 1.
[2017-12-06 16:16:58.151] [INFO] Job root.preprocessing.13.make_bait_interval_list failed with exit code 1.
[2017-12-06 16:16:58.151] [INFO] Job root.preprocessing.24.make_bait_interval_list failed with exit code 1.
[2017-12-06 16:16:58.151] [INFO] Job root.preprocessing.2.make_target_interval_list failed with exit code 1.
[2017-12-06 16:16:58.152] [INFO] Job root.preprocessing.6.make_target_interval_list failed with exit code 1.
[2017-12-06 16:16:58.152] [INFO] Job root.preprocessing.1.make_target_interval_list failed with exit code 1.
[2017-12-06 16:16:58.163] [ERROR] Job root.preprocessing.22.make_bait_interval_list failed with exit code 1.
[2017-12-06 16:16:58.163] [INFO] Job root.preprocessing.22.make_bait_interval_list failed with exit code 1.
[2017-12-06 16:16:58.164] [WARN] Job root.preprocessing.16.make_bait_interval_list, rootId: 1c683c18-9114-4714-9699-879341cf2426 failed: Job root.preprocessing.16.make_bait_interval_list failed with exit code 1.
[2017-12-06 16:16:58.179] [WARN] Root job 1c683c18-9114-4714-9699-879341cf2426 failed.
[2017-12-06 16:16:58.238] [ERROR] Job root.preprocessing.6.make_bait_interval_list failed with exit code 1.
[2017-12-06 16:16:58.238] [INFO] Job root.preprocessing.4.make_bait_interval_list has started
[2017-12-06 16:16:58.239] [INFO] Job root.preprocessing.6.make_bait_interval_list failed with exit code 1.
[2017-12-06 16:16:58.239] [INFO] Job root.preprocessing.12.make_target_interval_list has started
[2017-12-06 16:16:58.239] [INFO] Job root.preprocessing.21.make_target_interval_list has started
[2017-12-06 16:16:58.240] [INFO] Job root.preprocessing.5.make_target_interval_list has started
[2017-12-06 16:16:58.240] [INFO] Job root.preprocessing.4.make_target_interval_list has started
[2017-12-06 16:16:58.240] [INFO] Job root.preprocessing.21.make_bait_interval_list has started
[2017-12-06 16:16:58.240] [INFO] Job root.preprocessing.16.make_target_interval_list has started
[2017-12-06 16:16:58.308] [INFO] Job root.preprocessing.9.make_bait_interval_list has started

@johnbradley
Copy link
Collaborator

There is also an issue on rabix bunny with scatter and TES/funnel: rabix/bunny#382.

@dleehr
Copy link
Member Author

dleehr commented Dec 6, 2017

I haven't had much luck finding logs from the failed rabix jobs, but I was able to get docker logs from the failed job:

root@dan-cwl-test-airflow:/work/data_for_job_9/rabix-cli-1.0.3# docker logs nifty_perlman
16:16:54.477 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/picard/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed Dec 06 16:16:54 UTC 2017] picard.util.BedToIntervalList INPUT=/data/exome-seq/capture/SeqCap_EZ_Exome_v3_capture.noChr.bed SEQUENCE_DICTIONARY=/data/exome-seq/b37/human_g1k_v37.fasta OUTPUT=list.interval_list    SORT=true UNIQUE=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Wed Dec 06 16:16:54 UTC 2017] Executing as root@032518a4b888 on Linux 4.4.0-62-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12; Deflater: Intel; Inflater: Intel; Picard version: 2.10.7-SNAPSHOT
[Wed Dec 06 16:16:54 UTC 2017] picard.util.BedToIntervalList done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=1012924416
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMException: Could not find dictionary next to reference file /data/exome-seq/b37/human_g1k_v37.fasta
	at htsjdk.variant.utils.SAMSequenceDictionaryExtractor$TYPE$1.extractDictionary(SAMSequenceDictionaryExtractor.java:59)
	at htsjdk.variant.utils.SAMSequenceDictionaryExtractor.extractDictionary(SAMSequenceDictionaryExtractor.java:134)
	at picard.util.BedToIntervalList.doWork(BedToIntervalList.java:118)
	at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:228)
	at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
	at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)

picard's BedToIntervalList expects a secondary human_g1k_v37.dict file next to the reference_genome .fasta file, and the exomeseq workflow does request this as a secondaryFile on the reference_genome, but it's not mounted into the container as cwltool and cwltoil do.

@dleehr
Copy link
Member Author

dleehr commented Dec 6, 2017

bunny secondaryFiles as inputs bug discussed in rabix/bunny#211, but the state discussed there (working in local but not with TES) isn't consistent with what I'm seeing

@johnbradley
Copy link
Collaborator

Quote from that rabix/bunny issue above:

We have confirmed that this issue is not specific to the TES backend.

@johnbradley
Copy link
Collaborator

Looking into Arvados.
There is a version that is supposed to run entirely within docker: https://doc.arvados.org/install/arvbox.html

However running this on ubuntu 16.04 it gets stuck starting up and keeps printing this:

Waiting for keepstore0 keepstore1 keepproxy vm ...
Waiting for keepstore0 keepstore1 keepproxy vm ...

I let it run for an hour and it didn't get past this step.

There are separate manual installation instructions that I will look into: https://doc.arvados.org/install/index.html

@dleehr
Copy link
Member Author

dleehr commented Jan 3, 2018

Some promising initial results with cwl-tes and funnel

  1. Requires funnel. Funnel is the TES server. It's written in Go and a single linux binary is downloadable. The server starts with funnel server run, which listens on an http port
  2. Instead of cwl-runner workflow input, the command-line is cwl-tes --tes http://localhost:8000 workflow input
  3. cwl-tes worked fine with our packed workflow and relative paths in our job input file.
  4. Funnel is very chatty sending messages back and forth for stdout/stderr
  5. The CWL task steps run in parallel but they were all submitted at the same time. 48 samples meant that 48 concurrent fastqc processes were kicked off.

Funnel does appear to support some clusters/schedulers so there may be a way to limit this.

@dleehr
Copy link
Member Author

dleehr commented Jan 3, 2018

Actually it does appear that cwl-tes extracts ResourceRequirements (CPU/RAM/Disk) out of the CWL and provides them to the TES server

https://github.com/common-workflow-language/cwl-tes/blob/735eb5b2533d997dc54bf6be15f60da7170a915b/cwl_tes/tes.py#L298-L341

@dleehr
Copy link
Member Author

dleehr commented Jan 3, 2018

Funnel's built-in web interface is pretty handy, and the tasks are named based on the CWL step name:

screen shot 2018-01-03 at 12 11 22 pm

Not a lot of info about an EXECUTOR_ERROR is compared to a SYSTEM_ERROR. The executor errors have exit codes (137) that seem to correspond to java messages I saw fly by about insufficient memory. All of the make_x_interval_list jobs failed, and I'd guess that's related to SecondaryFiles (e.g. .idx, .dict) not being pulled in - which is also an issue with rabix bunny and its TES interface.

These tasks to have the memory and CPU requirements from the workflow annotated on them, and I do believe funnel is doing its best to schedule them.

@adamstruck
Copy link

Hi everyone! I am the author of cwl-tes and a lead developer of Funnel. Regarding Funnel's chattiness; that is a configurable option in the worker that can be turned off. We just cut a new Funnel release today with breaking changes (release notes).

Please let me know if you have any questions/comments regarding either project. My colleagues and I would be happy to help.

@dleehr
Copy link
Member Author

dleehr commented Sep 13, 2018

Currently running a workflow under cwltool --parallel. Seems to be working well on the single VM. However, there is no logging output while the workflow is running.

@dleehr dleehr added the cwl-engine Issue related to a specific CWL engine label Sep 13, 2018
@dleehr
Copy link
Member Author

dleehr commented Sep 24, 2018

Happy with performance in a single VM using cwltool --parallel in https://github.com/Duke-GCB/gcb-ansible/pull/90. Next advancement would be to distribute steps across >1 VM, but this is resolved for single VMs.

@dleehr dleehr closed this as completed Sep 24, 2018
@dleehr dleehr reopened this Sep 25, 2018
@dleehr
Copy link
Member Author

dleehr commented Sep 25, 2018

Reopening since this will be useful for future enhancements/scalability.

@dleehr
Copy link
Member Author

dleehr commented Mar 6, 2019

Hey, https://github.com/duke-gcb/calrissian looks pretty good 😆

@dleehr dleehr removed this from the Beta milestone Mar 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cwl-engine Issue related to a specific CWL engine
Projects
None yet
Development

No branches or pull requests

3 participants