How to process GVCFs from partially masked genomic data using Haplotypecaller
This work is being done using a WDL produced by WARP pipelines for use on app.terra.bio.
I am attempting to generate GVCF data from WGS CRAM data. Parts of the WGS were masked (using PrintReads) by request of the participants. Unfortunately, it appears the loss of some of these regions may be preventing me from using publicly available interval_lists when processing this data.
Steve
REQUIRED for all errors and issues:
a) GATK version used: 3.5
b) Exact command used, from the stderr:
[Sun Mar 03 00:09:14 GMT 2024] MergeBamAlignment ADD_PG_TAG_TO_READS=false UNMAPPED_BAM=/cromwell_root/fc-secure-c28b1d42-52a4-4332-9fac-e666bca4db63/submissions/deba4bd1-edb9-4cb3-8cac-73b607ca6d5a/BamToUnmappedBams/bb28e1dc-c6b9-4e97-95c9-9d0be709ea4e/call-SortSam/shard-4/H5JGT.2.bam.unmapped.bam ALIGNED_BAM=[/dev/stdin] OUTPUT=H5JGT.2.bam.unmapped.aligned.unsorted.bam PROGRAM_RECORD_ID=bwamem PROGRAM_GROUP_VERSION=0.7.15-r1140 PROGRAM_GROUP_COMMAND_LINE=bwa mem -K 100000000 -p -v 3 -t 16 -Y /cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta PROGRAM_GROUP_NAME=bwamem CLIP_ADAPTERS=false IS_BISULFITE_SEQUENCE=false ALIGNED_READS_ONLY=false MAX_INSERTIONS_OR_DELETIONS=-1 ATTRIBUTES_TO_RETAIN=[X0] ATTRIBUTES_TO_REMOVE=[NM, MD] EXPECTED_ORIENTATIONS=[FR] ALIGNER_PROPER_PAIR_FLAGS=true SORT_ORDER=unsorted PRIMARY_ALIGNMENT_STRATEGY=MostDistant ADD_MATE_CIGAR=true UNMAP_CONTAMINANT_READS=true UNMAPPED_READ_STRATEGY=COPY_TO_TAG VALIDATION_STRINGENCY=SILENT MAX_RECORDS_IN_RAM=2000000 REFERENCE_SEQUENCE=/cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta PAIRED_RUN=true ATTRIBUTES_TO_REVERSE=[OQ, U2] ATTRIBUTES_TO_REVERSE_COMPLEMENT=[E2, SQ] READ1_TRIM=0 READ2_TRIM=0 CLIP_OVERLAPPING_READS=true HARD_CLIP_OVERLAPPING_READS=false INCLUDE_SECONDARY_ALIGNMENTS=true MIN_UNCLIPPED_BASES=32 MATCHING_DICTIONARY_TAGS=[M5, LN] VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=2 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=falsec)
Entire program log:
[Sun Mar 03 00:09:14 GMT 2024] Executing as root@63a2f2d18944 on Linux 6.1.75+ amd64; OpenJDK 64-Bit Server VM 1.8.0_275-b01; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.26.10 INFO 2024-03-03 00:09:14 SamAlignmentMerger Processing SAM file(s): [/dev/stdin] [M::bwa_idx_load_from_disk] read 3171 ALT contigs [W::main_mem] when '-p' is in use, the second query file is ignored. [M::process] read 662252 sequences (100000052 bp)... [M::process] 0 single-end sequences; 662252 paired-end sequences INFO 2024-03-03 00:09:40 SamToFastq Processed 1,000,000 records. Elapsed time: 00:00:26s. Time for last 1,000,000: 26s. Last read position: */* [M::process] read 662252 sequences (100000052 bp)... [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (38, 268054, 22, 56) [M::mem_pestat] analyzing insert size distribution for orientation FF... [M::mem_pestat] (25, 50, 75) percentile: (826, 1330, 2235) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 5053) [M::mem_pestat] mean and std.dev: (1457.06, 939.79) [M::mem_pestat] low and high boundaries for proper pairs: (1, 6462) [M::mem_pestat] analyzing insert size distribution for orientation FR... [M::mem_pestat] (25, 50, 75) percentile: (304, 417, 572) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1108) [M::mem_pestat] mean and std.dev: (452.89, 192.79) [M::mem_pestat] low and high boundaries for proper pairs: (1, 1376) [M::mem_pestat] analyzing insert size distribution for orientation RF... [M::mem_pestat] (25, 50, 75) percentile: (27, 53, 1336) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 3954) [M::mem_pestat] mean and std.dev: (671.09, 1056.41) [M::mem_pestat] low and high boundaries for proper pairs: (1, 5263) [M::mem_pestat] analyzing insert size distribution for orientation RR... [M::mem_pestat] (25, 50, 75) percentile: (1121, 1403, 2205) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 4373) [M::mem_pestat] mean and std.dev: (1577.06, 892.90) [M::mem_pestat] low and high boundaries for proper pairs: (1, 5457) [M::mem_pestat] skip orientation FF [M::mem_pestat] skip orientation RF [M::mem_pestat] skip orientation RR [M::mem_process_seqs] Processed 662252 reads in 855.618 CPU sec, 54.952 real sec [M::process] 0 single-end sequences; 662252 paired-end sequences INFO 2024-03-03 00:10:32 AbstractAlignmentMerger Seen many non-increasing record positions. Printing Read-names as well. [Sun Mar 03 00:10:32 GMT 2024] picard.sam.MergeBamAlignment done. Elapsed time: 1.31 minutes. Runtime.totalMemory()=1005060096 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" picard.PicardException: Second read from pair not found in unmapped bam: H5JGTDSX7230707:2:1101:11180:33505, H5JGTDSX7230707:2:1101:11180:33536 at picard.sam.AbstractAlignmentMerger.mergeAlignment(AbstractAlignmentMerger.java:432) at picard.sam.SamAlignmentMerger.mergeAlignment(SamAlignmentMerger.java:181) at picard.sam.MergeBamAlignment.doWork(MergeBamAlignment.java:368) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
-
Hi Steven Gage
Looks like when the cram file was sliced to remove unauthorized regions some of the reads seem to have lost their pair therefore processing fails to complete without them. In this case there are a bunch of things that need to be performed such as
1- Removing all singleton reads that have lost their pairs
2- Realigning and merging files with the new set of properly paired reads.
On the other hand if those cram files were already aligned and just had removed some of the regions then you don't need to realign all files again and just continue with variant calling steps.
I hope this helps.
-
I was using WARP pipelines to run variant calling, which requires unmapped BAM files for arguments, but I can circumvent this by using a different variant calling and passing it the aligned BAM files. Thank you!
Steve
Please sign in to leave a comment.
2 comments