Setting SPARK resources when using the gatk release version
I'm using GATK 4.1.7.0 downloaded from the release page.
I'm trying to configure the local SPARK engine's executors, CPUs and memory.
Following the syntax specified in various posts produces parameter syntax errors, either for the LOCAL[n] parameter or for the --conf 'spark.num_executors=n' or --num_executors=n parameters
-
Hi Ury Alon , I just wanted to let you know that we are looking into your question. I will report back soon.
-
Ok, this is how you want to run this:
gatk ApplyBQSRSpark \
--input input.bam \
--output output.bam \
–bsqr-recal-file output.baserecalibrationtable.txt \
--spark-master local[$NUM_THREADS]If spark master is not specified, the tool will use as many threads as there are available cores.
--spark-runner LOCAL
doesn’t actually do anything the way you originally wrote it and is equivalent to running the given tool without the argument. That’s why the spark specific arguments (like--num-executors
) don’t work here, because they are spark specific and so gatk doesn’t recognize them. The--num-executors
and other spark arguments can be specified (separated from the GATK tool arguments by-- as shown here
) if running on a spark cluster using spark-submit (--spark-runner SPARK
) or using a Google cloud dataproc (--spark-runner GCS
). I hope this helps! -
Thank you Tiffany,
The main takeaway for me from you answer is that if I would like to control the number of threads I should use the Gatk commands with the Spark postfix, which I didn't, and then the --spark-runner[x] parameter may be specified. Is that true?
My use case is running Gatk within docker containers on an EC2 instance in an AWS environment. The same instance is used to run multiple containers. I must control the thread allocation manually, as each container "sees" all of the instances physical and logical CPUs and may "think" they are all available. Is there a guide for using Gatk in such scenario (AWS/ECS)?
Lastly, and this is a new issue resulting directly from changing the command from MarkDuplicates (for example) to MarkDuplicatesSpark: while the MarkDuplicates is completed successfully and all the output files are generated, the MarkDuplicateSpark (with the exact same input) command fails with the following error:
htsjdk.samtools.util.RuntimeIOException: Invalid file header in BAM index ./normal/bam/case6.normal.1.bai:
at htsjdk.samtools.AbstractBAMFileIndex.verifyIndexMagicNumber(AbstractBAMFileIndex.java:378)
at htsjdk.samtools.AbstractBAMFileIndex.<init>(AbstractBAMFileIndex.java:70)
at htsjdk.samtools.AbstractBAMFileIndex.<init>(AbstractBAMFileIndex.java:56)
at htsjdk.samtools.CachingBAMFileIndex.<init>(CachingBAMFileIndex.java:52)
at htsjdk.samtools.BAMFileReader.getIndex(BAMFileReader.java:415)
at org.disq_bio.disq.impl.formats.sam.AbstractSamSource.ensureIndexWillBeClosed(AbstractSamSource.java:132)
at org.disq_bio.disq.impl.formats.sam.AbstractSamSource.createSamReader(AbstractSamSource.java:104)
at org.disq_bio.disq.impl.formats.sam.AbstractSamSource.getFileHeader(AbstractSamSource.java:68)
at org.disq_bio.disq.HtsjdkReadsRddStorage.read(HtsjdkReadsRddStorage.java:163)
at org.disq_bio.disq.HtsjdkReadsRddStorage.read(HtsjdkReadsRddStorage.java:127)
at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSource.getHeader(ReadsSparkSource.java:180)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeReads(GATKSparkTool.java:570)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeToolInputs(GATKSparkTool.java:549)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:539)
at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:31)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
at org.broadinstitute.hellbender.Main.main(Main.java:292)For reference, these are the two commands:
gatk MarkDuplicates --INPUT ./normal/bam/case6.normal.1.bam --INPUT ./normal/bam/case6.normal.2.bam --OUTPUT mark-duplicates.bam --METRICS_FILE mark-duplicates-metrics.txt
gatk MarkDuplicatesSpark --input ./normal/bam/case6.normal.1.bam --input ./normal/bam/case6.normal.2.bam --output mark-duplicates.bam --metrics-file mark-duplicates-metrics.txt --spark-master local[4]thanks,
Ury
-
Hi Ury Alon
I am sorry but are you completely sure that the index file is intact? Have you tried to regenerate the index via `samtools index` and see whether the either of MarkDuplicates or MarkDuplicatesSpark fails again? If positive then it is really quite a problem to be looked into.
Also it might be helpful for other users to know that --spark-master local[$NUM_THREADS] does not work in zsh environment throwing something like
zsh: no matches found: local[12]
It is necessary to switch to bash to execute the command successfully. I suppose it would be a good idea (though not of high priority) to change the CLI syntax to be compatible with multiple shells.
-
Hi,
- samtools index (version 1.9) is used to index the bam files.
- The index is intact, please see reproduction steps below. Whether it is intact or not, there is an issue here: if it's intact, why is MarkDuplicateSpark failing? If it's not intact, why is MarkDuplicates completed successfully?
- I executed the commands again, locally (sam files were generated using minimap2):
samtools sort normal/sam/case6.normal.1.sam -o case6.normal.1.bam
samtools sort normal/sam/case6.normal.2.sam -o case6.normal.2.bam
samtools index case6.normal.1.bam case6.normal.1.bai
samtools index case6.normal.2.bam case6.normal.2.bai
gatk MarkDuplicates --INPUT ./case6.normal.1.bam --INPUT ./case6.normal.2.bam --OUTPUT mark-duplicates.bam --METRICS_FILE mark-duplicates-metrics.txt
gatk MarkDuplicatesSpark --input ./case6.normal.1.bam --input ./case6.normal.2.bam --output mark-duplicates.bam --metrics-file mark-duplicates-metrics.txt --spark-master local[4]Once again, MarkDuplicates was completed successfully with output files generated. MarkDuplicatesSpark still failed, but this time produced a different error message:
A USER ERROR has occurred: Multiple inputs to MarkDuplicatesSpark detected. MarkDuplicatesSpark requires all inputs to be queryname sorted or querygroup-sorted for multi-input processing but input ./case6.normal.1.bam was sorted in coordinate order
This is strange - the input files for samtools are the same ones and the samtools sort & index commands are identical to those executed previously (same software versions for all tools).
So now I'm seeing two problems: the different results of MarkDuplicate and MarkDuplicateSpark, and the different error results when executing MarkDuplicateSpark on what is supposed to be the same input.
If you'd like me to run the toolchain from the beginning with specific inputs (fastq, reference file) and with specific tools, I will do this gladly, just let me know.
-
Look, MarkDuplicatesSpark can work both with coordinate-sorted and queryname-sorted BAMs unless there is multiple input, which happens in your case,
--input ./case6.normal.1.bam --input ./case6.normal.2.bam
so the exception MarkDuplicatesSpark throws (about queryname-sorting requirements) is perfectly fine since multiple input requires queryname-sorted input files.
If I were you, I would take case6.normal.1.sam and case6.normal.2.sam, sort them by queryname, index, then pass to MarkDuplicates and MarkDuplicatesSpark (basically, exactly what you did previously except for sorting order). Please report here what happens once you completed these steps.
By the way, when running both commands
gatk MarkDuplicates --INPUT ./case6.normal.1.bam --INPUT ./case6.normal.2.bam --OUTPUT mark-duplicates.bam --METRICS_FILE mark-duplicates-metrics.txt
gatk MarkDuplicatesSpark --input ./case6.normal.1.bam --input ./case6.normal.2.bam --output mark-duplicates.bam --metrics-file mark-duplicates-metrics.txt --spark-master local[4]I do hope you change the output filename and the metrics filename (or the directory to store the output) to be able to compare the outputs further. Otherwise, the latter command overwrites the output of the former one.
-
Thanks @danilovkiri
I followed your instructions and made a few additional tests. Here are the results I got:
I started off by sorting the sam files by query names:
samtools sort -n normal/sam/case6.normal.1.sam -o case6.normal.1.bam
Then attempted to index the sorted BAM file:
samtools index case6.normal.1.BAM case6.normal.1.bai
and received an error:
[E::hts_idx_push] NO_COOR reads not in a single block at the end 4 -1
samtools index: failed to create index for "case6.normal.1.bam": No such file or directoryMy understanding is that samtools index fails if all the reads with no coordinates are NOT grouped together, so it will fail if sorting was not coordinate-based.
I also tried:
1. using gatk SamSort/SamSortSpark and gatk BuildBAMIndex instead of samtools, and got (from BuildBamIndex):
Error: htsjdk.samtools.SAMException: Input BAM file must be sorted by coordinate"
2. running gatk MarkDuplicate/MarkDuplicateSpark on QNAME-sorted BAM files without creating indexes: got:
Alignments added out of order in SAMFileWriterImpl.addAlignment for file:///scratch/mark-duplicates.bam. Sort order is queryname. Offending records are at ...
and:
A USER ERROR has occurred: Failed to load reads from ./case6.normal.1.BAM Caused by:Unsupported class file major version 55
respectively.
It seems like I'm stuck, and probably not following a good practice.
So I went back to tutorials #6483/6484, which provides a completely different way for getting a merged BAM file with marked duplicates. Unfortunately, the tutorial also results with errors from MergeBamAlignment (with the reference inputs provided in the tutorial) , but this is a different issue.
Just to re-focus on my original issue:
- I'm developing a production-grade genomic pipeline built upon AWS infrastructure
- For the sake of discussion, I'm implementing a tumor/normal variation calling pipeline
- I would like to use Gatk 4.1.7.0 as the main library in my pipeline, currently using the gatk script rather than manually running via the java command
- Due to the way AWS is executing jobs, I must have a way of specifying CPUs and/or threads and or memory available for the various Gatk commands
- As detailed above, using the spark "flavor" of the Gatk commands results with failures.
How would you suggest I proceed?
Thank you again for your attention.
Ury
-
Sorry for a delay Ury Alon
First, my apologies for asking you to index a coordinate-sorted BAM file, it is obviously impossible. I tried to reproduce your problem and here is what I have understood.
First, I forgot to mention that MarkDuplicates can be run under `gatk` executable script (supplied in gatk 4+ releases). It can be run under picard.jar which is a native way (I guess running under gatk executable is absolutely equal to using picard.jar judging by the log output and tool mechanics).
I took a merged bam alignment BAM file (uBAM + BWA MEM aligned BAM), produced via
java -jar -Xmx4G -Xms1G picard.jar MergeBamAlignment \
R=path_to_hg38.fa CREATE_INDEX=true ADD_MATE_CIGAR=true CLIP_OVERLAPPING_READS=true \
UNMAPPED_BAM=path_to_fastqtosam.bam \
ALIGNED_BAM=path_to_raw.bam \
O=path_to_mergebamalignment.bam \
TMP_DIR=path_to_TMP_DIR INCLUDE_SECONDARY_ALIGNMENTS=true MAX_INSERTIONS_OR_DELETIONS=-1 \
PRIMARY_ALIGNMENT_STRATEGY=BestMapq ATTRIBUTES_TO_RETAIN=XSThe default sorting order output for MergeBamAlignment is coordinate-sorted. Then I created two BAM files from a merged BAM for chromosomes 21 and 22 (to make two input files). These files were sorted by coordinate via samtools (just in case, not necessary) and by query name.
I ran MarkDuplicates under picard.jar using coordinate-sorted files via
java -jar -Xmx30G -Xms1G /home/wgs/Tools/Picard/picard.jar MarkDuplicates \
I=./example.chr21.coord_sorted.bam I=./example.chr22.coord_sorted.bam \
O=./example.chr21-22.coord_sorted.dupmarked.bam TMP_DIR=/mnt/disk2/TMP \
M=./coord_sorted.metrics.txtand using query name-sorted input files via
java -jar -Xmx30G -Xms1G /home/wgs/Tools/Picard/picard.jar MarkDuplicates \
I=./example.chr21.name_sorted.bam I=./example.chr22.name_sorted.bam \
O=./example.chr21-22.name_sorted.dupmarked.bam ASSUME_SORT_ORDER=queryname \
TMP_DIR=/mnt/disk2/TMP M=./name_sorted.metrics.txtBoth runs completed successfully. Note that it is compulsory to specify ASSUME_SORT_ORDER=queryname if you are working with such data, otherwise, it will throw an exception when trying to create an output.
I also tried to run the same analysis under gatk4 executable (not Spark) for coordinate-sorted input files
/home/wgs/Tools/gatk-4.1.7.0/gatk --java-options "-Xmx30G -Xms1G" \
MarkDuplicates --INPUT ./example.chr21.coord_sorted.bam \
--INPUT ./example.chr22.coord_sorted.bam \
--OUTPUT ./gatk4_markduplicates/example.chr21-22.coord_sorted.dupmarked.bam \
--METRICS_FILE ./gatk4_markduplicates/coord_sorted.metrics.txtand query name-sorted input files
/home/wgs/Tools/gatk-4.1.7.0/gatk --java-options "-Xmx30G -Xms1G" \
MarkDuplicates --INPUT ./example.chr21.name_sorted.bam \
--INPUT ./example.chr22.name_sorted.bam \
--OUTPUT ./gatk4_markduplicates/example.chr21-22.name_sorted.dupmarked.bam \
--METRICS_FILE ./gatk4_markduplicates/name_sorted.metrics.txt \
--ASSUME_SORT_ORDER querynameAgain, you have to specify --ASSUME_SORT_ORDER queryname argument to process query name-sorted input files. Both runs completed successfully.
The first conclusion is that non-Spark Markuplicates can work with both coordinate-sorted and query name-sorted files regardless of the number of input files (Spark MarkDuplicates requires query name-sorted input files if using multiple inputs).
Then I tried to run Spark MarkDuplicates on the query name-sorted input files, and it threw an exception
Caused by: org.broadinstitute.hellbender.exceptions.GATKException:
Detected multiple mark duplicate records objects corresponding to read with name
'CL100121543L1C004R035_150796', this could be the result of the file sort order
being incorrect or that a previous tool has let readnames span multiple partitionsI looked up the above-mentioned readname in the input file set and it contained the following entries:
CL100121543L1C004R035_150796 73 chr21 28544409 60 72M28S = 28544409 0 ACATCTATAGGGTACAATGTGATGTTTTATGTATGTGTGTGTATGTATACATGTGTATATATATATATATATATATATATATATATATATATATATTTTT DCBCED>EECDFDFEDD=FEE37DB8BDECFAF7DDEEF=F8DFF?DDFFCFFCFCFDFFFDFEFFFFFFFFFDFDFEFFEEEF5DDFFEFDFEFD?FFE SA:Z:chr3,134259959,+,50S50M,0,0; MD:Z:72 PG:Z:bwa RG:Z:barcode NM:i:0 UQ:i:0 AS:i:72 XS:i:50
CL100121543L1C004R035_150796 101 chr21 28544523 0 * = 28544523 0 ACATCTATAGGGTACAATGTGATGTTTTATGTATGTGTGTGTATGTATACATGTGTATATATATATATATATATATATATATATATATATATATATTTTT DCBCED>EECDFDFEDD=FEE37DB8BDECFAF7DDEEF=F8DFF?DDFFCFFCFCFDFFFDFEFFFFFFFFFDFDFEFFEEEF5DDFFEFDFEFD?FFE MC:Z:100M PG:Z:bwa RG:Z:barcode MQ:i:60
CL100121543L1C004R035_150796 133 chr21 28544409 0 * = 28544409 0 CAGCTACTCGGGAGGCTGAGGCAGGAGAATGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCTGAGATAGTGCCACTGCACTCCAGCATGGGCGACAG EDFECEDEEF@D7FEFFE@EEEEFDDFD>CFEDE<F@9EDAACD@DEEEC>EF?=FEEDDF6FCDFEF(DFFAFEEEBDFDE8FFFDECEEFCFEF=FEF MC:Z:72M28S PG:Z:bwa RG:Z:barcode MQ:i:60
CL100121543L1C004R035_150796 153 chr21 28544523 60 100M = 28544523 0 CTGTCGCCCATGCTGGAGTGCAGTGGCACTATCTCAGCTCACTGCAAGCTCCGCCTCCCGGGTTCACACCATTCTCCTGCCTCAGCCTCCCGAGTAGCTG FEF=FEFCFEECEDFFF8EDFDBEEEFAFFD(FEFDCF6FDDEEF=?FE>CEEED@DCAADE9@F<EDEFC>DFDDFEEEE@EFFEF7D@FEEDECEFDE MD:Z:100 PG:Z:bwa RG:Z:barcode NM:i:0 UQ:i:0 AS:i:100 XS:i:76The multiple unmapped entries are OK since this is the very essence of MergeBamAlignment to combine mapped and unmapped data. Here comes my question to Bhanu Gandham: why does MarkDuplicatesSpark get angry with multiple mappings while non-Spark version does not. By the way, non-Spark MarkDuplciates marks both mapped reads from the snippet above (with flags 73 and 153) as duplicates anyway (adds PG:Z:MarkDuplicates field).
I then tried to remove unmapped reads from a MergeBamAlignment BAM and run MarkDuplicatesSpark in the same manner as above. Again, it threw the same exception, but this time it was complaining about another readname CL100121543L1C001R006_144243.
The input file set contained the following entries for that readname:
CL100121543L1C001R006_144243 73 chr21 18345741 51 31S69M = 18345741 0 CGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATCCACTGAATATCATGAATAAATACTCAAATGGTTAGCCTATTCAGA FFFFBFFFFEFFFFFFFF<FFFFGFFGFFFFFF@GFFFFFFFFFFFFFFFFFFFGFGFFFFFGFFFGFGFFGF<FFFFGFFGGAGFFFFGFFFGGGGEFF SA:Z:chr17,13680319,-,45S55M,0,0; MD:Z:69 PG:Z:bwa RG:Z:barcode NM:i:0 UQ:i:0 AS:i:69 XS:i:52
CL100121543L1C001R006_144243 153 chr21 18345791 60 100M = 18345791 0 AATGGTTAGCCTATTCAGAAAAATTTCTAATTAGCTACAATTATATAAACACACATCATGCTCCTATTTATGTTTGGCCATGTTCACAGTCAATAGTTAT ;FFFFFFFGFFFFFFFFFFFFFFFFFFGFFFFGFFFFFFFFFFGFEFFFFFFFFFFFFGEFFFFF@FFGFFF?F@GFFFFFFFFFFGFFFFF>FFFFFFF MD:Z:100 PG:Z:bwa RG:Z:barcode NM:i:0 UQ:i:0 AS:i:100 XS:i:0Judging on SAM flag values, both 73 and 153 mean that the corresponding read is mapped, paired, and its mate is unmapped. So we can see two different alignments here, which cause an exception to be thrown.
Ury Alon try to run MarkDuplicatesSpark with query name-sorted input files, do not try to create indices (remove indices if they have the same basename as the query name-sorted input files). You might not face my problem here depending on your preprocessing steps.
-
Hi @danilovkiri
Thanks for getting back to me and performing all these tests.First, I confirm that it is my experience that non-SPARK MarkDuplicates works without any issues.
I also got the same errors as you did from the MarkDuplicatesSpark.Now for running MarkDuplicatesSpark without indexing: as I mentioned in my previous post, I already tried that, and received the following error:
A USER ERROR has occurred: Failed to load reads from ./case6.normal.1.BAM Caused by:Unsupported class file major version 55
For the sake of clarity, these are the commands I executed:
READ1=normal/fastq/case6.normal.1.READ1.fastq
READ2=normal/fastq/case6.normal.1.READ2.fastq
bwa mem -M hg38.fa ${READ1} ${READ2} > ${SAMPLE_ID}.${LANE_NUMBER}.sam
samtools sort -n ${SAMPLE_ID}.${LANE_NUMBER}.sam > ${SAMPLE_ID}.${LANE_NUMBER}.bam
READ1=normal/fastq/case6.normal.2.READ1.fastq
READ2=normal/fastq/case6.normal.2.READ2.fastq
bwa mem -M hg38.fa ${READ1} ${READ2} > ${SAMPLE_ID}.${LANE_NUMBER}.sam
samtools sort -n ${SAMPLE_ID}.${LANE_NUMBER}.sam > ${SAMPLE_ID}.${LANE_NUMBER}.bam
gatk MarkDuplicatesSpark --input ${SAMPLE_ID}.1.bam --input ${SAMPLE_ID}.2.bam --output ${SAMPLE_ID}.mark-duplicates.bam --metrics-file ${SAMPLE_ID}.mark-duplicates-metrics.txt --spark-master local[4]If, by any chance, you think the problem is with the FASTQ files I use, please let me know which FASTQ files to use (such as the 6484_snippet* files found in the tutorial), but please keep in mind the above issue occurrs for multi-lane duplicate marking, so I'll probably need precise instructions for having multiple FASTQ file pairs.
Thanks,
Ury
-
Hi Ury Alon
Sorry for a delay. I just googled the exception
Caused by:Unsupported class file major version 55
and found that it might be related to the java version you are using. Please have a look at this https://stackoverflow.com/questions/53583199/pyspark-error-unsupported-class-file-major-version and report back.
-
Hi @danilovkiri,
Thanks for finding the cause for this error.
I've been taking my time setting up a fully reproducible environment for running the discussed pipeline over test data, based on the supported library (java, python etc.) versions.
I think I have the docker and the pipeline commands in place, but I can't find a multi-lane FASTQ sample (probably a small one) to work with. The ones I have fail and I think the problem is with the data:
htsjdk.samtools.SAMException: Value was put into PairInfoMap more than once. 1: RGHWI-ST1353:302:C3GEKACXX:7:1101:18641:5023
Can you please refer me to a 2/4/x-lane FASTQ file set that can be used for testing?
-
Hi.
Unfortunately, I cannot provide you with any file sets.
When did you get the above-mentioned error? Which tool did you run? This error was not referenced here before.
Googling this error results in the following:
The input BAM file contains more than one read with the same name (look here). You can fix that by running samtools view -f 0x2 (or -f 2) and pass the filtered by -f 0x2 criterion BAM output to MarcDuplicates. Look at https://broadinstitute.github.io/picard/explain-flags.html for flag designation.
Look for the read name from the exception in your both panel-specific bam files. It must not be present in both of them, otherwise, it is another strange problem.
You have mentioned an issue with MergeBamAlignment. Could you please elaborate?
"So I went back to tutorials #6483/6484, which provides a completely different way for getting a merged BAM file with marked duplicates. Unfortunately, the tutorial also results with errors from MergeBamAlignment (with the reference inputs provided in the tutorial), but this is a different issue."
-
I ended up taking "real" 4-lane FASTQ files (huge ones), truncating them and hoping for the best.
It turns out the problem before was in the data, as I suspected, and this time I succeeded to go from FASTQ to MarkDuplicates.
Then I moved to MarkDuplicatesSpark, and it worked too! SUCCESS!!
Finally, this is the script I used:
#!/bin/bash
set -x
set -e
SAMPLE_ID=P01
for l in {1..4}
do
READ1=L${l}-P01-READ1.fastq.gz
READ2=L${l}-P01-READ2.fastq.gz
LANE_NUMBER=${l}
gatk --spark-runner LOCAL FastqToSam --FASTQ ${READ1} --FASTQ2 ${READ2} --OUTPUT ${LANE_NUMBER}.fastqtosam.bam --READ_GROUP_NAME ${SAMPLE_ID}${LANE_NUMBER} --SAMPLE_NAME ${SAMPLE_ID} --LIBRARY_NAME lib${SAMPLE_ID} --SORT_ORDER queryname
gatk --spark-runner LOCAL MarkIlluminaAdapters --INPUT ${LANE_NUMBER}.fastqtosam.bam --OUTPUT ${LANE_NUMBER}.markilluminaadapters.bam --METRICS ${LANE_NUMBER}.markilluminaadapters_metrics.txt
gatk --spark-runner LOCAL SamToFastq --INPUT ${LANE_NUMBER}.markilluminaadapters.bam --FASTQ ${LANE_NUMBER}.interleaved.fastq --CLIPPING_ATTRIBUTE XT --CLIPPING_ACTION 2 --INTERLEAVE true --NON_PF true
bwa mem -M -t 16 -p hg38.fa ${LANE_NUMBER}.interleaved.fastq > ${LANE_NUMBER}.sam
samtools sort -@ 4 -n ${LANE_NUMBER}.sam -o ${LANE_NUMBER}.querysorted.bam
gatk MergeBamAlignment --REFERENCE_SEQUENCE hg38.fa --UNMAPPED_BAM ${LANE_NUMBER}.fastqtosam.bam --ALIGNED_BAM ${LANE_NUMBER}.querysorted.bam --OUTPUT ${LANE_NUMBER}.mergebamalignment.bam --CREATE_INDEX false --ADD_MATE_CIGAR true --CLIP_ADAPTERS false --CLIP_OVERLAPPING_READS true --INCLUDE_SECONDARY_ALIGNMENTS true --MAX_INSERTIONS_OR_DELETIONS -1 --PRIMARY_ALIGNMENT_STRATEGY MostDistant --ATTRIBUTES_TO_RETAIN XS
samtools sort -@ 4 -n ${LANE_NUMBER}.mergebamalignment.bam -o ${LANE_NUMBER}.bam
done
gatk MarkDuplicatesSpark --input 1.bam --input 2.bam --input 3.bam --input 4.bam --output ${SAMPLE_ID}.mark-duplicates.bam --metrics-file ${SAMPLE_ID}.mark-duplicates-metrics.txt --spark-master local[4]I'm not sure that the first sorting by queryname is required - will have to check it.
BTW, it also worked without sorting the BAM file after MergeBamAlignment when I specified the following option:
--allow-multiple-sort-orders-in-input true
but I'm not sure it produces the desired output.
Regarding MergeBamAlignment - once I got the commands and their parameters in order - no more errors. I can't even remember what was the problem back then.
-
Hello Ury Alon,
You said:
I'm not sure that the first sorting by queryname is required - will have to check it.
I checked and I can confirm that the first sorting is not required. I left the complete BASH script here: https://gatk.broadinstitute.org/hc/en-us/articles/360039568932/comments/360010691471
-
@Amin, thanks for verifying this.
Please sign in to leave a comment.
15 comments