Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Setting SPARK resources when using the gatk release version

0

15 comments

  • Avatar
    Tiffany Miller

    Hi Ury Alon , I just wanted to let you know that we are looking into your question. I will report back soon. 

    0
    Comment actions Permalink
  • Avatar
    Tiffany Miller

    Ok, this is how you want to run this:

    gatk ApplyBQSRSpark \

    --input input.bam \

    --output output.bam \

    –bsqr-recal-file output.baserecalibrationtable.txt \

    --spark-master local[$NUM_THREADS]

    If spark master is not specified, the tool will use as many threads as there are available cores.

      --spark-runner LOCAL doesn’t actually do anything the way you originally wrote it and is equivalent to running the given tool without the argument.  That’s why the spark specific arguments (like --num-executors ) don’t work here, because they are spark specific and so gatk doesn’t recognize them. The --num-executors and other spark arguments can be specified (separated from the GATK tool arguments by -- as shown here)  if running on a spark cluster using spark-submit (--spark-runner SPARK) or using a  Google cloud dataproc (--spark-runner GCS). I hope this helps!

    2
    Comment actions Permalink
  • Avatar
    Ury Alon

    Thank you Tiffany,

    The main takeaway for me from you answer is that if I would like to control the number of threads I should use the Gatk commands with the Spark postfix, which I didn't, and then the --spark-runner[x] parameter may be specified.  Is that true?

    My use case is running Gatk within docker containers on an EC2 instance in an AWS environment.  The same instance is used to run multiple containers.  I must control the thread allocation manually, as each container "sees" all of the instances physical and logical CPUs and may "think" they are all available.  Is there a guide for using Gatk in such scenario (AWS/ECS)?

    Lastly, and this is a new issue resulting directly from changing the command from MarkDuplicates (for example) to MarkDuplicatesSpark: while the MarkDuplicates is completed successfully and all the output files are generated, the MarkDuplicateSpark (with the exact same input) command fails with the following error:

    htsjdk.samtools.util.RuntimeIOException: Invalid file header in BAM index ./normal/bam/case6.normal.1.bai: 
    at htsjdk.samtools.AbstractBAMFileIndex.verifyIndexMagicNumber(AbstractBAMFileIndex.java:378)
    at htsjdk.samtools.AbstractBAMFileIndex.<init>(AbstractBAMFileIndex.java:70)
    at htsjdk.samtools.AbstractBAMFileIndex.<init>(AbstractBAMFileIndex.java:56)
    at htsjdk.samtools.CachingBAMFileIndex.<init>(CachingBAMFileIndex.java:52)
    at htsjdk.samtools.BAMFileReader.getIndex(BAMFileReader.java:415)
    at org.disq_bio.disq.impl.formats.sam.AbstractSamSource.ensureIndexWillBeClosed(AbstractSamSource.java:132)
    at org.disq_bio.disq.impl.formats.sam.AbstractSamSource.createSamReader(AbstractSamSource.java:104)
    at org.disq_bio.disq.impl.formats.sam.AbstractSamSource.getFileHeader(AbstractSamSource.java:68)
    at org.disq_bio.disq.HtsjdkReadsRddStorage.read(HtsjdkReadsRddStorage.java:163)
    at org.disq_bio.disq.HtsjdkReadsRddStorage.read(HtsjdkReadsRddStorage.java:127)
    at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSource.getHeader(ReadsSparkSource.java:180)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeReads(GATKSparkTool.java:570)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeToolInputs(GATKSparkTool.java:549)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:539)
    at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:31)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
    at org.broadinstitute.hellbender.Main.main(Main.java:292)

    For reference, these are the two commands:

    gatk MarkDuplicates --INPUT ./normal/bam/case6.normal.1.bam --INPUT ./normal/bam/case6.normal.2.bam --OUTPUT mark-duplicates.bam --METRICS_FILE mark-duplicates-metrics.txt

    gatk MarkDuplicatesSpark --input ./normal/bam/case6.normal.1.bam --input ./normal/bam/case6.normal.2.bam --output mark-duplicates.bam --metrics-file mark-duplicates-metrics.txt --spark-master local[4]


    thanks,

      Ury

     

     

    0
    Comment actions Permalink
  • Avatar
    danilovkiri

    Hi Ury Alon

    I am sorry but are you completely sure that the index file is intact? Have you tried to regenerate the index via `samtools index` and see whether the either of MarkDuplicates or MarkDuplicatesSpark fails again? If positive then it is really quite a problem to be looked into.

     

    Also it might be helpful for other users to know that --spark-master local[$NUM_THREADS] does not work in zsh environment throwing something like

    zsh: no matches found: local[12]

    It is necessary to switch to bash to execute the command successfully. I suppose it would be a good idea (though not of high priority) to change the CLI syntax to be compatible with multiple shells.

    0
    Comment actions Permalink
  • Avatar
    Ury Alon

    Hi,

    1. samtools index (version 1.9) is used to index the bam files.
    2. The index is intact, please see reproduction steps below.  Whether it is intact or not, there is an issue here: if it's intact, why is MarkDuplicateSpark failing?  If it's not intact, why is MarkDuplicates completed successfully?
    3. I executed the commands again, locally (sam files were generated using minimap2):
    samtools sort normal/sam/case6.normal.1.sam -o case6.normal.1.bam

    samtools sort normal/sam/case6.normal.2.sam -o case6.normal.2.bam

    samtools index case6.normal.1.bam case6.normal.1.bai

    samtools index case6.normal.2.bam case6.normal.2.bai

    gatk MarkDuplicates --INPUT ./case6.normal.1.bam --INPUT ./case6.normal.2.bam --OUTPUT mark-duplicates.bam --METRICS_FILE mark-duplicates-metrics.txt

    gatk MarkDuplicatesSpark --input ./case6.normal.1.bam --input ./case6.normal.2.bam --output mark-duplicates.bam --metrics-file mark-duplicates-metrics.txt --spark-master local[4]

    Once again, MarkDuplicates was completed successfully with output files generated.  MarkDuplicatesSpark still failed, but this time produced a different error message:

    A USER ERROR has occurred: Multiple inputs to MarkDuplicatesSpark detected. MarkDuplicatesSpark requires all inputs to be queryname sorted or querygroup-sorted for multi-input processing but input ./case6.normal.1.bam was sorted in coordinate order

    This is strange - the input files for samtools are the same ones and the samtools sort & index commands are identical to those executed previously (same software versions for all tools).

    So now I'm seeing two problems: the different results of MarkDuplicate and MarkDuplicateSpark, and the different error results when executing MarkDuplicateSpark on what is supposed to be the same input.

     

    If you'd like me to run the toolchain from the beginning with specific inputs (fastq, reference file) and with specific tools, I will do this gladly, just let me know.

    0
    Comment actions Permalink
  • Avatar
    danilovkiri

    Look, MarkDuplicatesSpark can work both with coordinate-sorted and queryname-sorted BAMs unless there is multiple input, which happens in your case,

    --input ./case6.normal.1.bam --input ./case6.normal.2.bam

    so the exception MarkDuplicatesSpark throws (about queryname-sorting requirements) is perfectly fine since multiple input requires queryname-sorted input files.

    If I were you, I would take case6.normal.1.sam and case6.normal.2.sam, sort them by queryname, index, then pass to MarkDuplicates and MarkDuplicatesSpark (basically, exactly what you did previously except for sorting order). Please report here what happens once you completed these steps.

    By the way, when running both commands

    gatk MarkDuplicates --INPUT ./case6.normal.1.bam --INPUT ./case6.normal.2.bam --OUTPUT mark-duplicates.bam --METRICS_FILE mark-duplicates-metrics.txt

    gatk MarkDuplicatesSpark --input ./case6.normal.1.bam --input ./case6.normal.2.bam --output mark-duplicates.bam --metrics-file mark-duplicates-metrics.txt --spark-master local[4]

    I do hope you change the output filename and the metrics filename (or the directory to store the output) to be able to compare the outputs further. Otherwise, the latter command overwrites the output of the former one.

    0
    Comment actions Permalink
  • Avatar
    Ury Alon

    Thanks @danilovkiri

    I followed your instructions and made a few additional tests.  Here are the results I got:

    I started off by sorting the sam files by query names:

    samtools sort -n normal/sam/case6.normal.1.sam -o case6.normal.1.bam

    Then attempted to index the sorted BAM file:

    samtools index case6.normal.1.BAM case6.normal.1.bai

    and received an error:

    [E::hts_idx_push] NO_COOR reads not in a single block at the end 4 -1
    samtools index: failed to create index for "case6.normal.1.bam": No such file or directory

    My understanding is that samtools index fails if all the reads with no coordinates are NOT grouped together, so it will fail if sorting was not coordinate-based.

    I also tried:

    1. using gatk SamSort/SamSortSpark and gatk BuildBAMIndex instead of samtools, and got (from BuildBamIndex):

    Error: htsjdk.samtools.SAMException: Input BAM file must be sorted by coordinate"

    2. running gatk MarkDuplicate/MarkDuplicateSpark on QNAME-sorted BAM files without creating indexes: got:

    Alignments added out of order in SAMFileWriterImpl.addAlignment for file:///scratch/mark-duplicates.bam. Sort order is queryname. Offending records are at ...

    and:

    A USER ERROR has occurred: Failed to load reads from ./case6.normal.1.BAM Caused by:Unsupported class file major version 55

    respectively.

    It seems like I'm stuck, and probably not following a good practice.

    So I went back to tutorials #6483/6484, which provides a completely different way for getting a merged BAM file with marked duplicates.  Unfortunately, the tutorial also results with errors from MergeBamAlignment (with the reference inputs provided in the tutorial) , but this is a different issue.

    Just to re-focus on my original issue:

    1. I'm developing a production-grade genomic pipeline built upon AWS infrastructure
    2. For the sake of discussion, I'm implementing a tumor/normal variation calling pipeline
    3. I would like to use Gatk 4.1.7.0 as the main library in my pipeline, currently using the gatk script rather than manually running via the java command
    4. Due to the way AWS is executing jobs, I must have a way of specifying CPUs and/or threads and or memory available for the various Gatk commands
    5. As detailed above, using the spark "flavor" of the Gatk commands results with failures.

    How would you suggest I proceed?

    Thank you again for your attention.

    Ury

     

    0
    Comment actions Permalink
  • Avatar
    danilovkiri

    Sorry for a delay Ury Alon

    First, my apologies for asking you to index a coordinate-sorted BAM file, it is obviously impossible. I tried to reproduce your problem and here is what I have understood.

    First, I forgot to mention that MarkDuplicates can be run under `gatk` executable script (supplied in gatk 4+ releases). It can be run under picard.jar which is a native way (I guess running under gatk executable is absolutely equal to using picard.jar judging by the log output and tool mechanics).

    I took a merged bam alignment BAM file (uBAM + BWA MEM aligned BAM), produced via

    java -jar -Xmx4G -Xms1G picard.jar MergeBamAlignment \
    R=path_to_hg38.fa CREATE_INDEX=true ADD_MATE_CIGAR=true CLIP_OVERLAPPING_READS=true \
    UNMAPPED_BAM=path_to_fastqtosam.bam \
    ALIGNED_BAM=path_to_raw.bam \
    O=path_to_mergebamalignment.bam \
    TMP_DIR=path_to_TMP_DIR INCLUDE_SECONDARY_ALIGNMENTS=true MAX_INSERTIONS_OR_DELETIONS=-1 \
    PRIMARY_ALIGNMENT_STRATEGY=BestMapq ATTRIBUTES_TO_RETAIN=XS

    The default sorting order output for MergeBamAlignment is coordinate-sorted. Then I created two BAM files from a merged BAM for chromosomes 21 and 22 (to make two input files). These files were sorted by coordinate via samtools (just in case, not necessary) and by query name.

    I ran MarkDuplicates under picard.jar using coordinate-sorted files via

    java -jar -Xmx30G -Xms1G /home/wgs/Tools/Picard/picard.jar MarkDuplicates \
    I=./example.chr21.coord_sorted.bam I=./example.chr22.coord_sorted.bam \
    O=./example.chr21-22.coord_sorted.dupmarked.bam TMP_DIR=/mnt/disk2/TMP \
    M=./coord_sorted.metrics.txt

    and using query name-sorted input files via

    java -jar -Xmx30G -Xms1G /home/wgs/Tools/Picard/picard.jar MarkDuplicates \
    I=./example.chr21.name_sorted.bam I=./example.chr22.name_sorted.bam \
    O=./example.chr21-22.name_sorted.dupmarked.bam ASSUME_SORT_ORDER=queryname \
    TMP_DIR=/mnt/disk2/TMP M=./name_sorted.metrics.txt

    Both runs completed successfully. Note that it is compulsory to specify ASSUME_SORT_ORDER=queryname if you are working with such data, otherwise, it will throw an exception when trying to create an output.

    I also tried to run the same analysis under gatk4 executable (not Spark) for coordinate-sorted input files

    /home/wgs/Tools/gatk-4.1.7.0/gatk --java-options "-Xmx30G -Xms1G" \
    MarkDuplicates --INPUT ./example.chr21.coord_sorted.bam \
    --INPUT ./example.chr22.coord_sorted.bam \
    --OUTPUT ./gatk4_markduplicates/example.chr21-22.coord_sorted.dupmarked.bam \
    --METRICS_FILE ./gatk4_markduplicates/coord_sorted.metrics.txt

    and query name-sorted input files

    /home/wgs/Tools/gatk-4.1.7.0/gatk --java-options "-Xmx30G -Xms1G" \
    MarkDuplicates --INPUT ./example.chr21.name_sorted.bam \
    --INPUT ./example.chr22.name_sorted.bam \
    --OUTPUT ./gatk4_markduplicates/example.chr21-22.name_sorted.dupmarked.bam \
    --METRICS_FILE ./gatk4_markduplicates/name_sorted.metrics.txt \
    --ASSUME_SORT_ORDER queryname

    Again, you have to specify --ASSUME_SORT_ORDER queryname argument to process query name-sorted input files. Both runs completed successfully. 

    The first conclusion is that non-Spark Markuplicates can work with both coordinate-sorted and query name-sorted files regardless of the number of input files (Spark MarkDuplicates requires query name-sorted input files if using multiple inputs).

    Then I tried to run Spark MarkDuplicates on the query name-sorted input files, and it threw an exception 

    Caused by: org.broadinstitute.hellbender.exceptions.GATKException:
    Detected multiple mark duplicate records objects corresponding to read with name
    'CL100121543L1C004R035_150796', this could be the result of the file sort order
    being incorrect or that a previous tool has let readnames span multiple partitions

    I looked up the above-mentioned readname in the input file set and it contained the following entries:

    CL100121543L1C004R035_150796 73 chr21 28544409 60 72M28S = 28544409 0 ACATCTATAGGGTACAATGTGATGTTTTATGTATGTGTGTGTATGTATACATGTGTATATATATATATATATATATATATATATATATATATATATTTTT DCBCED>EECDFDFEDD=FEE37DB8BDECFAF7DDEEF=F8DFF?DDFFCFFCFCFDFFFDFEFFFFFFFFFDFDFEFFEEEF5DDFFEFDFEFD?FFE SA:Z:chr3,134259959,+,50S50M,0,0; MD:Z:72 PG:Z:bwa RG:Z:barcode NM:i:0 UQ:i:0 AS:i:72 XS:i:50
    CL100121543L1C004R035_150796 101 chr21 28544523 0 * = 28544523 0 ACATCTATAGGGTACAATGTGATGTTTTATGTATGTGTGTGTATGTATACATGTGTATATATATATATATATATATATATATATATATATATATATTTTT DCBCED>EECDFDFEDD=FEE37DB8BDECFAF7DDEEF=F8DFF?DDFFCFFCFCFDFFFDFEFFFFFFFFFDFDFEFFEEEF5DDFFEFDFEFD?FFE MC:Z:100M PG:Z:bwa RG:Z:barcode MQ:i:60
    CL100121543L1C004R035_150796 133 chr21 28544409 0 * = 28544409 0 CAGCTACTCGGGAGGCTGAGGCAGGAGAATGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCTGAGATAGTGCCACTGCACTCCAGCATGGGCGACAG EDFECEDEEF@D7FEFFE@EEEEFDDFD>CFEDE<F@9EDAACD@DEEEC>EF?=FEEDDF6FCDFEF(DFFAFEEEBDFDE8FFFDECEEFCFEF=FEF MC:Z:72M28S PG:Z:bwa RG:Z:barcode MQ:i:60
    CL100121543L1C004R035_150796 153 chr21 28544523 60 100M = 28544523 0 CTGTCGCCCATGCTGGAGTGCAGTGGCACTATCTCAGCTCACTGCAAGCTCCGCCTCCCGGGTTCACACCATTCTCCTGCCTCAGCCTCCCGAGTAGCTG FEF=FEFCFEECEDFFF8EDFDBEEEFAFFD(FEFDCF6FDDEEF=?FE>CEEED@DCAADE9@F<EDEFC>DFDDFEEEE@EFFEF7D@FEEDECEFDE MD:Z:100 PG:Z:bwa RG:Z:barcode NM:i:0 UQ:i:0 AS:i:100 XS:i:76

    The multiple unmapped entries are OK since this is the very essence of MergeBamAlignment to combine mapped and unmapped data. Here comes my question to Bhanu Gandham: why does MarkDuplicatesSpark get angry with multiple mappings while non-Spark version does not. By the way, non-Spark MarkDuplciates marks both mapped reads from the snippet above (with flags 73 and 153) as duplicates anyway (adds PG:Z:MarkDuplicates field).

    I then tried to remove unmapped reads from a MergeBamAlignment BAM and run MarkDuplicatesSpark in the same manner as above. Again, it threw the same exception, but this time it was complaining about another readname CL100121543L1C001R006_144243.

    The input file set contained the following entries for that readname:

    CL100121543L1C001R006_144243 73 chr21 18345741 51 31S69M = 18345741 0 CGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATCCACTGAATATCATGAATAAATACTCAAATGGTTAGCCTATTCAGA FFFFBFFFFEFFFFFFFF<FFFFGFFGFFFFFF@GFFFFFFFFFFFFFFFFFFFGFGFFFFFGFFFGFGFFGF<FFFFGFFGGAGFFFFGFFFGGGGEFF SA:Z:chr17,13680319,-,45S55M,0,0; MD:Z:69 PG:Z:bwa RG:Z:barcode NM:i:0 UQ:i:0 AS:i:69 XS:i:52
    CL100121543L1C001R006_144243 153 chr21 18345791 60 100M = 18345791 0 AATGGTTAGCCTATTCAGAAAAATTTCTAATTAGCTACAATTATATAAACACACATCATGCTCCTATTTATGTTTGGCCATGTTCACAGTCAATAGTTAT ;FFFFFFFGFFFFFFFFFFFFFFFFFFGFFFFGFFFFFFFFFFGFEFFFFFFFFFFFFGEFFFFF@FFGFFF?F@GFFFFFFFFFFGFFFFF>FFFFFFF MD:Z:100 PG:Z:bwa RG:Z:barcode NM:i:0 UQ:i:0 AS:i:100 XS:i:0

    Judging on SAM flag values, both 73 and 153 mean that the corresponding read is mapped, paired, and its mate is unmapped. So we can see two different alignments here, which cause an exception to be thrown.

    Ury Alon try to run MarkDuplicatesSpark with query name-sorted input files, do not try to create indices (remove indices if they have the same basename as the query name-sorted input files). You might not face my problem here depending on your preprocessing steps.

    0
    Comment actions Permalink
  • Avatar
    Ury Alon

    Hi @danilovkiri
    Thanks for getting back to me and performing all these tests.

    First, I confirm that it is my experience that non-SPARK MarkDuplicates works without any issues.
    I also got the same errors as you did from the MarkDuplicatesSpark.

    Now for running MarkDuplicatesSpark without indexing: as I mentioned in my previous post, I already tried that, and received the following error:

    A USER ERROR has occurred: Failed to load reads from ./case6.normal.1.BAM Caused by:Unsupported class file major version 55

    For the sake of clarity, these are the commands I executed:

    READ1=normal/fastq/case6.normal.1.READ1.fastq
    READ2=normal/fastq/case6.normal.1.READ2.fastq
    bwa mem -M hg38.fa ${READ1} ${READ2} > ${SAMPLE_ID}.${LANE_NUMBER}.sam
    samtools sort -n ${SAMPLE_ID}.${LANE_NUMBER}.sam > ${SAMPLE_ID}.${LANE_NUMBER}.bam

    READ1=normal/fastq/case6.normal.2.READ1.fastq
    READ2=normal/fastq/case6.normal.2.READ2.fastq
    bwa mem -M hg38.fa ${READ1} ${READ2} > ${SAMPLE_ID}.${LANE_NUMBER}.sam
    samtools sort -n ${SAMPLE_ID}.${LANE_NUMBER}.sam > ${SAMPLE_ID}.${LANE_NUMBER}.bam

    gatk MarkDuplicatesSpark --input ${SAMPLE_ID}.1.bam --input ${SAMPLE_ID}.2.bam --output ${SAMPLE_ID}.mark-duplicates.bam --metrics-file ${SAMPLE_ID}.mark-duplicates-metrics.txt --spark-master local[4]

    If, by any chance, you think the problem is with the FASTQ files I use, please let me know which FASTQ files to use (such as the 6484_snippet* files found in the tutorial), but please keep in mind the above issue occurrs for multi-lane duplicate marking, so I'll probably need precise instructions for having multiple FASTQ file pairs.

    Thanks,

      Ury

    0
    Comment actions Permalink
  • Avatar
    danilovkiri

    Hi Ury Alon

    Sorry for a delay. I just googled the exception

    Caused by:Unsupported class file major version 55

    and found that it might be related to the java version you are using. Please have a look at this https://stackoverflow.com/questions/53583199/pyspark-error-unsupported-class-file-major-version and report back.

     

    0
    Comment actions Permalink
  • Avatar
    Ury Alon

    Hi @danilovkiri,

    Thanks for finding the cause for this error.

    I've been taking my time setting up a fully reproducible environment for running the discussed pipeline over test data, based on the supported library (java, python etc.) versions.

    I think I have the docker and the pipeline commands in place, but I can't find a multi-lane FASTQ sample (probably a small one) to work with.  The ones I have fail and I think the problem is with the data:

    htsjdk.samtools.SAMException: Value was put into PairInfoMap more than once. 1: RGHWI-ST1353:302:C3GEKACXX:7:1101:18641:5023

    Can you please refer me to a 2/4/x-lane FASTQ file set that can be used for testing?

    0
    Comment actions Permalink
  • Avatar
    danilovkiri

    Hi.

    Unfortunately, I cannot provide you with any file sets.

    When did you get the above-mentioned error? Which tool did you run? This error was not referenced here before. 

    Googling this error results in the following:

    The input BAM file contains more than one read with the same name (look here). You can fix that by running samtools view -f 0x2 (or -f 2) and pass the filtered by -f 0x2 criterion BAM output to MarcDuplicates. Look at https://broadinstitute.github.io/picard/explain-flags.html for flag designation. 

    Look for the read name from the exception in your both panel-specific bam files. It must not be present in both of them, otherwise, it is another strange problem.

    You have mentioned an issue with MergeBamAlignment. Could you please elaborate?

    "So I went back to tutorials #6483/6484, which provides a completely different way for getting a merged BAM file with marked duplicates.  Unfortunately, the tutorial also results with errors from MergeBamAlignment (with the reference inputs provided in the tutorial), but this is a different issue."

    0
    Comment actions Permalink
  • Avatar
    Ury Alon

    I ended up taking "real" 4-lane FASTQ files (huge ones), truncating them and hoping for the best.

    It turns out the problem before was in the data, as I suspected, and this time I succeeded to go from FASTQ to MarkDuplicates.

    Then I moved to MarkDuplicatesSpark, and it worked too!  SUCCESS!!

    Finally, this is the script I used:

    #!/bin/bash
    set -x
    set -e

    SAMPLE_ID=P01
    for l in {1..4}
    do
      READ1=L${l}-P01-READ1.fastq.gz
      READ2=L${l}-P01-READ2.fastq.gz
      LANE_NUMBER=${l}
      gatk --spark-runner LOCAL FastqToSam --FASTQ ${READ1} --FASTQ2 ${READ2} --OUTPUT ${LANE_NUMBER}.fastqtosam.bam --READ_GROUP_NAME ${SAMPLE_ID}${LANE_NUMBER} --SAMPLE_NAME ${SAMPLE_ID} --LIBRARY_NAME lib${SAMPLE_ID} --SORT_ORDER queryname
      gatk --spark-runner LOCAL MarkIlluminaAdapters --INPUT ${LANE_NUMBER}.fastqtosam.bam --OUTPUT ${LANE_NUMBER}.markilluminaadapters.bam --METRICS ${LANE_NUMBER}.markilluminaadapters_metrics.txt
      gatk --spark-runner LOCAL SamToFastq --INPUT ${LANE_NUMBER}.markilluminaadapters.bam --FASTQ ${LANE_NUMBER}.interleaved.fastq --CLIPPING_ATTRIBUTE XT --CLIPPING_ACTION 2 --INTERLEAVE true --NON_PF true
      bwa mem -M -t 16 -p hg38.fa ${LANE_NUMBER}.interleaved.fastq > ${LANE_NUMBER}.sam
      samtools sort -@ 4 -n ${LANE_NUMBER}.sam -o ${LANE_NUMBER}.querysorted.bam
      gatk MergeBamAlignment --REFERENCE_SEQUENCE hg38.fa --UNMAPPED_BAM ${LANE_NUMBER}.fastqtosam.bam --ALIGNED_BAM ${LANE_NUMBER}.querysorted.bam --OUTPUT ${LANE_NUMBER}.mergebamalignment.bam --CREATE_INDEX false --ADD_MATE_CIGAR true --CLIP_ADAPTERS false --CLIP_OVERLAPPING_READS true --INCLUDE_SECONDARY_ALIGNMENTS true --MAX_INSERTIONS_OR_DELETIONS -1 --PRIMARY_ALIGNMENT_STRATEGY MostDistant --ATTRIBUTES_TO_RETAIN XS
      samtools sort -@ 4 -n ${LANE_NUMBER}.mergebamalignment.bam -o ${LANE_NUMBER}.bam
    done

    gatk MarkDuplicatesSpark --input 1.bam --input 2.bam --input 3.bam --input 4.bam --output ${SAMPLE_ID}.mark-duplicates.bam --metrics-file ${SAMPLE_ID}.mark-duplicates-metrics.txt --spark-master local[4]

    I'm not sure that the first sorting by queryname is required - will have to check it.

    BTW, it also worked without sorting the BAM file after MergeBamAlignment when I specified the following option:

    --allow-multiple-sort-orders-in-input true

    but I'm not sure it produces the desired output.

     

    Regarding MergeBamAlignment - once I got the commands and their parameters in order - no more errors.  I can't even remember what was the problem back then.

     

    0
    Comment actions Permalink
  • Avatar
    Amin

    Hello Ury Alon,

    You said:

    I'm not sure that the first sorting by queryname is required - will have to check it.

    I checked and I can confirm that the first sorting is not required. I left the complete BASH script here: https://gatk.broadinstitute.org/hc/en-us/articles/360039568932/comments/360010691471

    0
    Comment actions Permalink
  • Avatar
    Ury Alon

    @Amin, thanks for verifying this.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk