Multi-library designs: error proceeding from pre-processing to variant calling
Hi: I'm analyzing data from a number of samples (different individuals). Some individuals were sequenced twice from different library preps, to increase coverage.
I've followed the documentation here for processing these samples. Specifically, I mapped and sorted each library separately, and assigned read group information. Then I marked duplicates while merging bams from all libraries belonging to the same individual. Example here:
java -Xmx16g -jar /programs/picard-tools-2.8.2/picard.jar MarkDuplicates \
INPUT= SRR1607505.sorted.bam I= SRR1607504.sorted.bam \
OUTPUT= PARV1.sorted.marked.bam \
METRICS_FILE=PARV1.duplicate.metrics.txt \
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
That produced a single bam file that I did indel realignment on (following an older pipeline). From there I attempted to call variants using HaplotypeCaller
java -Xmx20g -jar /programs/bin/GATK/GenomeAnalysisTK.jar -T HaplotypeCaller
-R GCA_902806625.1.fa \
-I lamich_PARV1.sorted.marked.realigned.fixemate.bam \
-nct 10 \
--emitRefConfidence GVCF \
-o lamich_PARV1.g.vcf >&log_lamich_PARV1
But I get an error that HaplotypeCaller can only be used on one sample at a time.
ERROR MESSAGE: Invalid command line: Argument emitRefConfidence has a bad value: Can only be used in single sample mode currently. Use the sample_name argument to run on a single sample out of a multi-sample BAM file.
At what stage should I merge libraries from the same individual and what is the best way to do it? Is there an argument I am missing in MarkDuplicates?
I also tried running HaplotypeCaller on individual libraries and then using MergeVcfs to merge the .g.vcf.gzs however I got an error that my sample entries did not match.
Exception in thread "main" java.lang.IllegalArgumentException: Input file /home/lc736_0001/sm983/ref_genome/SRR1607505.g.vcf.gz has sample entries that don't match the other files.
In addition, the documentation states that I should now run base recalibration (BQSR) instead of indel realignment. However, when I try to run that tool I get an error that I need to provide a VCF file containing known sites of genetic variation. I don't understand how to input such a file when my goal is to identify sites of genetic variation through this pipeline.
java -Xmx16g -jar /programs/bin/GATK/GenomeAnalysisTK.jar -T BaseRecalibrator \
-R GCA_902806625.1.fa \
-I lamich_PARV2.sorted.marked.bam \
-o lamich_PARV2.merged.dedup.recal.bam
ERROR MESSAGE: Invalid command line: This calculation is critically dependent on being able to mask out known variant sites. Please provide a VCF file containing known sites of genetic variation.
I'm using GATK v3.8-1-0-gf15c1c3ef, Compiled 2018/02/19 05:43:50
Thanks for any advice
-
Hi Sabrina McNew! Our GATK support team only provides support for GATK4 at this point. We really encourage you to upgrade to GATK4 because many bugs have been solved since GATK3.
If you need to use GATK3, you are welcome to keep this post on our discussion page here for feedback from other users. You can also check out our legacy support site for GATK3 advice: https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/
Please sign in to leave a comment.
1 comment