Hi: I'm analyzing data from a number of samples (different individuals). Some individuals were sequenced twice from different library preps, to increase coverage.
I've followed the documentation here for processing these samples. Specifically, I mapped and sorted each library separately, and assigned read group information. Then I marked duplicates while merging bams from all libraries belonging to the same individual. Example here:
java -Xmx16g -jar /programs/picard-tools-2.8.2/picard.jar MarkDuplicates \
INPUT= SRR1607505.sorted.bam I= SRR1607504.sorted.bam \
OUTPUT= PARV1.sorted.marked.bam \
That produced a single bam file that I did indel realignment on (following an older pipeline). From there I attempted to call variants using HaplotypeCaller
java -Xmx20g -jar /programs/bin/GATK/GenomeAnalysisTK.jar -T HaplotypeCaller
-R GCA_902806625.1.fa \
-I lamich_PARV1.sorted.marked.realigned.fixemate.bam \
-nct 10 \
--emitRefConfidence GVCF \
-o lamich_PARV1.g.vcf >&log_lamich_PARV1
But I get an error that HaplotypeCaller can only be used on one sample at a time.
ERROR MESSAGE: Invalid command line: Argument emitRefConfidence has a bad value: Can only be used in single sample mode currently. Use the sample_name argument to run on a single sample out of a multi-sample BAM file.
At what stage should I merge libraries from the same individual and what is the best way to do it? Is there an argument I am missing in MarkDuplicates?
I also tried running HaplotypeCaller on individual libraries and then using MergeVcfs to merge the .g.vcf.gzs however I got an error that my sample entries did not match.
Exception in thread "main" java.lang.IllegalArgumentException: Input file /home/lc736_0001/sm983/ref_genome/SRR1607505.g.vcf.gz has sample entries that don't match the other files.
In addition, the documentation states that I should now run base recalibration (BQSR) instead of indel realignment. However, when I try to run that tool I get an error that I need to provide a VCF file containing known sites of genetic variation. I don't understand how to input such a file when my goal is to identify sites of genetic variation through this pipeline.
java -Xmx16g -jar /programs/bin/GATK/GenomeAnalysisTK.jar -T BaseRecalibrator \
-R GCA_902806625.1.fa \
-I lamich_PARV2.sorted.marked.bam \
ERROR MESSAGE: Invalid command line: This calculation is critically dependent on being able to mask out known variant sites. Please provide a VCF file containing known sites of genetic variation.
I'm using GATK v3.8-1-0-gf15c1c3ef, Compiled 2018/02/19 05:43:50
Thanks for any advice
Please sign in to leave a comment.