Skip to content

Consensus job failed repeatedly #2374

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
weihanlau opened this issue Mar 29, 2025 · 5 comments
Open

Consensus job failed repeatedly #2374

weihanlau opened this issue Mar 29, 2025 · 5 comments

Comments

@weihanlau
Copy link

weihanlau commented Mar 29, 2025

Hi everyone, I'm trying to run hicanu (canu 2.2) with some pacbio hifi reads, but am repeatedly running into a failure with the consensus job stage. Does anyone know how this can be resolved?

Here's my command for it:

canu -p hicanu -d "$out_dir" genomeSize=1.05g -pacbio-hifi "$hifi_reads"

And here's what is shown in the canu.out log:

--   Stages to run:
--     assemble HiFi reads.
--
--
-- Correction skipped; not enabled.
--
-- Trimming skipped; not enabled.
--
-- BEGIN ASSEMBLY
-- Using slow alignment for consensus (iteration '2').
-- Configured 83 consensus jobs.
--
-- Grid:  cns        3.125 GB    8 CPUs  (consensus)
--
--
-- Consensus jobs failed, tried 2 times, giving up.
--   job ctgcns/0079.cns FAILED.
--   job ctgcns/0080.cns FAILED.
--   job ctgcns/0081.cns FAILED.
--   job ctgcns/0082.cns FAILED.

ABORT:
ABORT: canu 2.2
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:

If it helps, my input file to the main command is a .fq.gz, which canu should be able to work with.

Thank you so much for your help!

Wei Han

@skoren
Copy link
Member

skoren commented Mar 31, 2025

You're running an older version and there have been several fixes to consensus since v2.2. I'd suggest trying v2.3, you should be able to re-start this assembly using v2.3 but make a backup of the assembly folder to be safe. If that still doesn't work, post the logs of the failed jobs <assembly dir>/unitigging/5-consensus/*_79.out for example.

@weihanlau
Copy link
Author

weihanlau commented Apr 1, 2025

Hello! Thank you so much for the reply and the suggestions. Really appreciate it! I ran canu 2.3 and ran into the same problems.

Here's a relevant part of the canu.out file. It looks like there's no memory associated with the Grid: cns tag. Could this be a problem?

Found PacBio HiFi reads in 'hicanu.seqStore':
--   Libraries:
--     PacBio HiFi:           1
--   Reads:
--     Corrected:             36280148066
--     Corrected and Trimmed: 36280148066
--
--
--                         (tag)Threads
--                (tag)Memory         |
--        (tag)             |         |  algorithm
--        -------  ----------  --------  -----------------------------
-- Grid:  meryl     25.000 GB    7 CPUs  (k-mer counting)
-- Grid:  cormhap   25.000 GB   11 CPUs  (overlap detection with mhap)
-- Grid:  obtovl    16.000 GB   11 CPUs  (overlap detection)
-- Grid:  utgovl    16.000 GB   11 CPUs  (overlap detection)
-- Grid:  cor        -.--- GB    4 CPUs  (read correction)
-- Grid:  ovb        4.000 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       32.000 GB    1 CPU   (overlap store sorting)
-- Grid:  red       16.000 GB    4 CPUs  (read error detection)
-- Grid:  oea        8.000 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat       64.000 GB   16 CPUs  (contig construction with bogart)
-- Grid:  cns        -.--- GB    8 CPUs  (consensus)
--
-- Generating assembly 'hicanu' in '/scratch/weihan99/Cidindela_formosa_v2/results/assembly_hicanu':
--   genomeSize:
--     1050000000
--
--   Overlap Generation Limits:
--     corOvlErrorRate 0.0000 (  0.00%)
--     obtOvlErrorRate 0.0250 (  2.50%)
--     utgOvlErrorRate 0.0100 (  1.00%)
--
--   Overlap Processing Limits:
--     corErrorRate    0.0000 (  0.00%)
--     obtErrorRate    0.0250 (  2.50%)
--     utgErrorRate    0.0003 (  0.03%)
--     cnsErrorRate    0.0500 (  5.00%)
--
--   Stages to run:
--     assemble HiFi reads.
--
--
-- Correction skipped; not enabled.
--
-- Trimming skipped; not enabled.
--
-- BEGIN ASSEMBLY
-- Using slow alignment for consensus (iteration '2').
-- Configured 83 consensus jobs.
--
-- Grid:  cns        4.500 GB    8 CPUs  (consensus)
--
--
-- Consensus jobs failed, tried 2 times, giving up.
--   job ctgcns/0079.cns FAILED.
--   job ctgcns/0080.cns FAILED.
--   job ctgcns/0081.cns FAILED.
--   job ctgcns/0082.cns FAILED.
--

ABORT:
ABORT: canu 2.3
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:

As requested, here is the consensus.28289772_80.out logs of one of the failed jobs in /unitigging/5-consensus (the other logs are almost identical):

Found perl (from '/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/perl/5.36.1/bin/perl'):
  /cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Compiler/gcccore/perl/5.36.1/bin/perl
  This is perl 5, version 36, subversion 1 (v5.36.1) built for x86_64-linux-thread-multi

Found java (from '/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Core/java/17.0.6/bin/java'):
  /cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v3/Core/java/17.0.6/bin/java
  Picked up JAVA_TOOL_OPTIONS: -Xmx2g

Found canu (from '/scratch/weihan99/Cidindela_formosa_v2/canu-2.3/build/bin/canu'):
  /scratch/weihan99/Cidindela_formosa_v2/canu-2.3/build/bin/canu
  canu 2.3

Running job 80 based on SLURM_ARRAY_TASK_ID=80 and offset=0.
-- Using seqFile '../hicanu.ctgStore/partition.0080'.
-- Opening tigStore '../hicanu.ctgStore' version 1.
-- Opening output results file './ctgcns/0080.cns.WORKING'.
-- Loading corrected-trimmed reads from seqFile '../hicanu.ctgStore/partition.0080'
/local/var/spool/slurmd/job28290063/slurm_script: line 104: 15815 Killed                  $bin/utgcns -R ../hicanu.ctgStore/partition.$jobid -T ../hicanu.ctgStore 1 -P $jobid -O ./ctgcns/$jobid.cns.WORKING -maxcoverage 40 -e 0.05 -em 0.05 -pbdagcon -edlib -threads 8
slurmstepd: error: Detected 1 oom_kill event in StepId=28290063.batch. Some of the step tasks have been OOM Killed.

It looks like an OOM error killed the process. In Job 79, the consensus job actually begins but fails with the same sort of error. I've included a snippet of this below. Any ideas what I can do about this?

   4803     62379      78        54    4.41x        0    0.00x        24    3.15x
   4805     55176      84     /local/var/spool/slurmd/job28290815/slurm_script: line 104: 198643 Killed                  $bin/utgcns -R ../hicanu.ctgStore/partition.$jobid -T ../hicanu.ctgStore 1 -P $jobid -O ./ctgcns/$jobid.cns.WORKING -maxcoverage 40 -e 0.05 -em 0.05 -pbdagcon -edlib -threads 8
slurmstepd: error: Detected 1 oom_kill event in StepId=28290815.batch. Some of the step tasks have been OOM Killed.

I am launching canu on the head node with the basic command:

canu -p hicanu -d "$out_dir" genomeSize=1.05g -pacbio-hifi "$hifi_reads"

Thank you so much for your help!

Wei Han

@skoren
Copy link
Member

skoren commented Apr 1, 2025

Looks like your grid is killing these jobs due to exceeding memory. Canu typically estimates the memory (looks like it requested 4.5gb in this case) but perhaps it was incorrect in this case. You can try to increase the memory by adding the option cnsMemory=20 will increase each job's allocation to 20gb from 4.5..

@weihanlau
Copy link
Author

weihanlau commented Apr 2, 2025

Hello! Thanks again for that.

The consensus job looks to have run after bumping up the memory as you suggested. However, the pipeline still fails. Here is what is shown in the canu.out file:

-- BEGIN ASSEMBLY
-- Using slow alignment for consensus (iteration '1').
-- Configured 83 consensus jobs.
-- All 83 consensus jobs finished successfully.
-- Finished stage 'consensusCheck', reset canuIteration.
-- Using slow alignment for consensus (iteration '0').
-- Configured 83 consensus jobs.
-- Using slow alignment for consensus (iteration '0').
-- Configured 83 consensus jobs.
----------------------------------------
-- Starting command on Wed Apr  2 00:01:39 2025 with 2535986.137 GB free disk space

    cd unitigging
    /scratch/weihan99/Cidindela_formosa_v2/canu-2.3/build/bin/tgStoreLoad \
      -S ../hicanu.seqStore \
      -T  ./hicanu.ctgStore 2 \
      -L ./5-consensus/ctgcns.files \
    > ./5-consensus/ctgcns.files.ctgStoreLoad.err 2>&1

-- Finished on Wed Apr  2 00:02:54 2025 (75 seconds) with 2535956.186 GB free disk space
----------------------------------------
----------------------------------------
-- Starting command on Wed Apr  2 00:02:54 2025 with 2535956.186 GB free disk space

    cd unitigging
    /scratch/weihan99/Cidindela_formosa_v2/canu-2.3/build/bin/tgTigDisplay \
      -S ../hicanu.seqStore \
      -b \
      -o ../hicanu.contigs.unsorted.bam \
      -L ./5-consensus/ctgcns.files \
    > ./5-consensus/ctgcns.files.contigs.bam.err 2>&1

-- Finished on Wed Apr  2 00:44:21 2025 (2487 seconds) with 2535771.967 GB free disk space
----------------------------------------
----------------------------------------
-- Starting command on Wed Apr  2 00:44:21 2025 with 2535771.967 GB free disk space

    cd .
    samtools sort \
      --write-index \
      --threads 2 \
      -o ./hicanu.contigs.bam ./hicanu.contigs.unsorted.bam \
    > ./ctgcns.bam.err 2>&1
slurmstepd: error: *** JOB 28302961 ON gra151 CANCELLED AT 2025-04-02T01:01:44 DUE TO TIME LIMIT ***

I'm assuming this is an OOM issue as well? I can't seem to find the corresponding log file for the failed job under /unitigging/.

Should I try to bump up the memory of each of the tags for each corresponding module (e.g. canu -p hicanu -d "$out_dir" genomeSize=1.05g merylMemory=30 cormhapMemory=30 obtovlMemory=30 utgovlMemory=30 corMemory=30 ovbMemory=30 ovsMemory=30 redMemory=30 oeaMemoery=30 batMemory=30 cnsMemory=30 -pacbio-hifi "$hifi_reads"?

Or, would simply using the masterMemory=30 option be good enough?

Thank you so much! Really appreciate your help here.

Wei Han

@skoren
Copy link
Member

skoren commented Apr 4, 2025

This is the canu executor script and it's not memory, it's time. You should add a default time to canu's submit command via the gridOptions=--time=24:00:00 or some similar reasonable time within the maximum allowed by your grid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants