Workflow "out of memory" issue

Dear AnVIL Community,
I’d like to learn your experience with WholeGenomeGermlineSingleSample v3.3.4 workflow. I consistently encounter “out of memory” errors at the MarkDuplicates stage.

From my understanding, the memory_multiplier parameter controls this step. I have experimented with several values:

  • 34, 68, 70, 80 → all returned out-of-memory errors.

  • 100, 250, 300 → returned the error “Invalid value for field ‘resource.properties.machineType’”, which I believe indicates that GCP rejected the request due to excessive resource allocation.

Since I am working with large uBAM files (400 samples, total size is about 30 TB), I am unsure how best to configure these parameters to complete the workflow successfully. I have attached my current inputs.json file for reference.

Please advise on how to properly set the parameters (particularly memory and disk sizing) so that the workflow can run successfully on large inputs. I’d also be happy to provide any additional details that would help in troubleshooting.

I greatly appreciate any insight you can share.

# input.json:
{“WholeGenomeGermlineSingleSample.CollectRawWgsMetrics.read_length”:“${151}”,“WholeGenomeGermlineSingleSample.UnmappedBamToAlignedBam.ApplyBQSR.gatk_docker”:“${}”,“WholeGenomeGermlineSingleSample.BamToGvcf.make_bamout”:“${false}”,“WholeGenomeGermlineSingleSample.fingerprint_genotypes_file”:“gs://dsde-data-na12878-public/NA12878.hg38.reference.fingerprint.vcf”,“WholeGenomeGermlineSingleSample.CollectRawWgsMetrics.memory_multiplier”:“${4}”,“WholeGenomeGermlineSingleSample.references”:“${{“contamination_sites_ud”:“gs://gcp-public-data–broad-references/hg38/v0/Homo_sapiens_assembly38.contam.UD”,“contamination_sites_bed”:“gs://gcp-public-data–broad-references/hg38/v0/Homo_sapiens_assembly38.contam.bed”,“contamination_sites_mu”:“gs://gcp-public-data–broad-references/hg38/v0/Homo_sapiens_assembly38.contam.mu”,“calling_interval_list”:“gs://gcp-public-data–broad-references/hg38/v0/wgs_calling_regions.hg38.interval_list”,“reference_fasta”:{“ref_dict”:“gs://gcp-public-data–broad-references/hg38/v0/Homo_sapiens_assembly38.dict”,“ref_fasta”:“gs://gcp-public-data–broad-references/hg38/v0/Homo_sapiens_assembly38.fasta”,“ref_fasta_index”:“gs://gcp-public-data–broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai”,“ref_alt”:“gs://gcp-public-data–broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.alt”,“ref_sa”:“gs://gcp-public-data–broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.sa”,“ref_amb”:“gs://gcp-public-data–broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.amb”,“ref_bwt”:“gs://gcp-public-data–broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.bwt”,“ref_ann”:“gs://gcp-public-data–broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.ann”,“ref_pac”:“gs://gcp-public-data–broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.pac”},“known_indels_sites_vcfs”:[“gs://gcp-public-data–broad-references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz”,“gs://gcp-public-data–broad-references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz”],“known_indels_sites_indices”:[“gs://gcp-public-data–broad-references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi”,“gs://gcp-public-data–broad-references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi”],“dbsnp_vcf”:“gs://gcp-public-data–broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf”,“dbsnp_vcf_index”:“gs://gcp-public-data–broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx”,“evaluation_interval_list”:“gs://gcp-public-data–broad-references/hg38/v0/wgs_evaluation_regions.hg38.interval_list”,“haplotype_database_file”:“gs://gcp-public-data–broad-references/hg38/v0/Homo_sapiens_assembly38.haplotype_database.txt”}}”,“WholeGenomeGermlineSingleSample.sample_and_unmapped_bams”:“${{ “sample_name”: this.sample_name_id, “base_file_name”: this.base_file_name, “flowcell_unmapped_bams”: this.flowcell_unmapped_bams, “final_gvcf_base_name”: this.final_gvcf_base_name, “unmapped_bam_suffix”: “.bam” }}”,“WholeGenomeGermlineSingleSample.UnmappedBamToAlignedBam.SortSampleBam.memory_multiplier”:“${34}”,“WholeGenomeGermlineSingleSample.UnmappedBamToAlignedBam.GatherBamFiles.additional_disk”:“${1000}”,“WholeGenomeGermlineSingleSample.UnmappedBamToAlignedBam.MarkDuplicates.read_name_regex”:“${null}”,“WholeGenomeGermlineSingleSample.cloud_provider”:“gcp”,“WholeGenomeGermlineSingleSample.BamToGvcf.SortBamout.additional_disk”:“${1000}”,“WholeGenomeGermlineSingleSample.CollectRawWgsMetrics.additional_disk”:“${1000}”,“WholeGenomeGermlineSingleSample.UnmappedBamToAlignedBam.ApplyBQSR.memory_multiplier”:“${8}”,“WholeGenomeGermlineSingleSample.UnmappedBamToAlignedBam.BaseRecalibrator.gatk_docker”:“${}”,“WholeGenomeGermlineSingleSample.BamToGvcf.HaplotypeCallerGATK4.memory_multiplier”:“${8}”,“WholeGenomeGermlineSingleSample.UnmappedBamToAlignedBam.ApplyBQSR.additional_disk”:“${1000}”,“WholeGenomeGermlineSingleSample.BamToGvcf.make_gvcf”:“${true}”,“WholeGenomeGermlineSingleSample.wgs_coverage_interval_list”:“gs://gcp-public-data–broad-references/hg38/v0/wgs_coverage_regions.hg38.interval_list”,“WholeGenomeGermlineSingleSample.BamToGvcf.SortBamout.memory_multiplier”:“${20}”,“WholeGenomeGermlineSingleSample.UnmappedBamToAlignedBam.MarkDuplicates.additional_disk”:“${1500}”,“WholeGenomeGermlineSingleSample.papi_settings”:“${{“preemptible_tries”:3,“agg_preemptible_tries”:3}}”,“WholeGenomeGermlineSingleSample.BamToCram.ValidateCram.memory_multiplier”:“${4}”,“WholeGenomeGermlineSingleSample.scatter_settings”:“${{“haplotype_scatter_count”:50,“break_bands_at_multiples_of”:1000000}}”,“WholeGenomeGermlineSingleSample.UnmappedBamToAlignedBam.MarkDuplicates.memory_multiplier”:“${80}”,“WholeGenomeGermlineSingleSample.UnmappedBamToAlignedBam.GatherBamFiles.memory_multiplier”:“${4}”,“WholeGenomeGermlineSingleSample.UnmappedBamToAlignedBam.GatherBqsrReports.gatk_docker”:“${}”,“WholeGenomeGermlineSingleSample.AggregatedBamQC.CheckFingerprintTask.memory_size”:“${1000}”}

Hi @yxhan ,

I’m not sure if this is a Terra issue or perhaps an issue/something going on with the workflow itself. I’d recommend reaching out to Terra support via email (support@terra.bio) or through the AnVIL Menu > Support > Contact Us.

Thanks!
Ava

This is their response: “Unfortunately we can’t advise on specific memory values to use, but it may help to know that the amount of memory it’s possible to configure a machine with depends on its CPU count - so you may be able to access higher amounts of memory by increasing your core count.“

I already tried many times and requested memory that exceeded what the GCP/Terra/WARP can provide. Any other teams that I can reach out to?

Thanks!

Hi @yxhan,

This is a tricky problem; it’s not immediately clear to me where the memory_multiplier parameter is being used. Perhaps in this sub-workflow? If that’s correct, then 80*7.5 is already 600 GB. Is it possible to subset the .bam file? Said another way, are there any .bam files that are succeeding?

I saw you opened an issue on GitHub , which should hopefully be closer to getting you a solution!

Thanks!
Ava

Hi Ava,

We are resuming from the shutdown and apologize for the delayed response.

I have submitted the job by providing data table, where I can select a list of input parameters for the workflow, where the memory_multiplier is one of them.

Based on the file sizes (400 samples, total size is about 30 TB in FASTQ.gz format), our UBAMs are about 500–600 GB, converted from 75–100 GB FASTQ.gz files. If I’m interpreting your answer correctly, then regardless of how we adjust the memory multiplier, these files are too large for the available cloud machine types to handle. Is that correct?

Unfortunately, none of the uBAMs were succeeding although some were processed further than others.

Thanks,
Yixing

Hi @yxhan ,

I’m sorry you are still having trouble with this. Again, I’m not sure exactly what’s happening, but here are a few other thoughts that might help you find the solution:

  • Is there a maxRetries parameter that can be set or changed?
  • Do any Java-based tools need to be told explicitly how much RAM is available?
  • Are there any additional considerations in the switch from Cromwell to Batch?
  • Does the workflow succeed with a smaller, subsetted file? Are the files that get further along smaller ones?

Ultimately I think you will get more helpful information from the GATK/WARP community.

Thanks!
Ava