GetPileupSummaries- Badly formed genome unclippedLoc error - Contig 1 given as location, but this contig isn't present in the Fasta sequence dictionary
I'm in the second step of SNV pipeline where I wanted to calculate contamination but GetPileupSummaries is giving the below error.
REQUIRED for all errors and issues:
a) GATK version used: 4.2.5.0
b) Exact command used: java -jar gatk-4.2.5.0/gatk-package-4.2.5.0-local.jar GetPileupSummaries -I /P4-Tumor.recal.bam -V /home/lab4/Downloads/gnomad.exomes.r2.1.1.sites.vcf.bgz -L /home/lab4/Downloads/gnomad.exomes.r2.1.1.sites.vcf.bgz -O /home/lab4/P4_Tumor-pileups.table
c) Entire program log:
lab4@lab4-Vostro-3800:~$ java -jar gatk-4.2.5.0/gatk-package-4.2.5.0-local.jar GetPileupSummaries -I /P4-TD.recal.bam -V /home/lab4/Downloads/gnomad.exomes.r2.1.1.sites.vcf.bgz -L /home/lab4/Downloads/gnomad.exomes.r2.1.1.sites.vcf.bgz -O /home/lab4/P4-pileups.table
14:22:55.591 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/lab4/gatk-4.2.5.0/gatk-package-4.2.5.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Oct 12, 2022 2:22:55 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
14:22:55.719 INFO GetPileupSummaries - ------------------------------------------------------------
14:22:55.719 INFO GetPileupSummaries - The Genome Analysis Toolkit (GATK) v4.2.5.0
14:22:55.719 INFO GetPileupSummaries - For support and documentation go to https://software.broadinstitute.org/gatk/
14:22:55.719 INFO GetPileupSummaries - Executing as lab4@lab4-Vostro-3800 on Linux v5.15.0-50-generic amd64
14:22:55.719 INFO GetPileupSummaries - Java runtime: OpenJDK 64-Bit Server VM v11.0.16+8-post-Ubuntu-0ubuntu122.04
14:22:55.719 INFO GetPileupSummaries - Start Date/Time: 12 October 2022 at 2:22:55 PM IST
14:22:55.720 INFO GetPileupSummaries - ------------------------------------------------------------
14:22:55.720 INFO GetPileupSummaries - ------------------------------------------------------------
14:22:55.720 INFO GetPileupSummaries - HTSJDK Version: 2.24.1
14:22:55.720 INFO GetPileupSummaries - Picard Version: 2.25.4
14:22:55.720 INFO GetPileupSummaries - Built for Spark Version: 2.4.5
14:22:55.720 INFO GetPileupSummaries - HTSJDK Defaults.COMPRESSION_LEVEL : 2
14:22:55.721 INFO GetPileupSummaries - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
14:22:55.721 INFO GetPileupSummaries - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
14:22:55.721 INFO GetPileupSummaries - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
14:22:55.721 INFO GetPileupSummaries - Deflater: IntelDeflater
14:22:55.721 INFO GetPileupSummaries - Inflater: IntelInflater
14:22:55.721 INFO GetPileupSummaries - GCS max retries/reopens: 20
14:22:55.721 INFO GetPileupSummaries - Requester pays: disabled
14:22:55.721 INFO GetPileupSummaries - Initializing engine
14:22:56.047 INFO FeatureManager - Using codec VCFCodec to read file file:///home/lab4/Downloads/gnomad.exomes.r2.1.1.sites.vcf.bgz
14:22:56.147 INFO FeatureManager - Using codec VCFCodec to read file file:///home/lab4/Downloads/gnomad.exomes.r2.1.1.sites.vcf.bgz
14:22:56.236 INFO GetPileupSummaries - Shutting down engine
[12 October 2022 at 2:22:56 PM IST] org.broadinstitute.hellbender.tools.walkers.contamination.GetPileupSummaries done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=166723584
***********************************************************************
A USER ERROR has occurred: Badly formed genome unclippedLoc: Contig 1 given as location, but this contig isn't present in the Fasta sequence dictionary
***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
lab4@lab4-Vostro-3800:~$
Please let me know what should I do.
Thanks.
-
Hi Tanay Biswas,
Thank you for writing to the GATK forum! I hope that we can help you sort this out.
I brought your issue to our developers and received some feedback and next steps for you. In your input command, you are passing --L /home/lab4/Downloads/gnomad.exomes.r2.1.1.sites.vcf.bgz. We searched for gnomad.exomes.r2.1.1.sites.vcf.bgz and found it here, where the summary given is:
The gnomAD v2.1.1 data set contains data from 125,748 exomes and 15,708 whole genomes, all mapped to the GRCh37/hg19 reference sequence.Hg19 contigs are named 1, 2, 3, etc., so GetPileupSummaries is complaining because contig 1 is present, but it is not in the reference that it aligned the bam to (maybe GRCh38). If that is the case, you should try downloading your site's VCF from the GRCh38 liftover, or you could try using gnomAD v3.
I hope this helps! Please let me know if this leads you to success. If you have any questions in the meantime, please do not hesitate to reach out.
Best,
Anthony -
Hi Anthony,
The BAM files are generated by aligning with hg19 all chromosomes. That's why I have used gnomAD hg19 vcf file. I'm not understanding why this error is coming. Let me know how to deal with this.
Thank you.
-
Hi Tanay Biswas,
Thank you for that information! Could you please try running the following command and providing us with the output?
--samtools view -H [BAMFILE]
This output will give us information about the contigs present and allow us to figure out what’s going on.
I look forward to hearing back from you!
Best,
Anthony -
Hi Anthony,
Please see the below output for samtools view:
lab4@lab4-Vostro-3800:~$ samtools view -H /home/lab4/Seq_Data/WES/IITK-P4-TD/IITK-P4-TD.recal.bam
@HD VN:1.4 GO:none SO:coordinate
@SQ SN:chrM LN:16571
@SQ SN:chr1 LN:249250621
@SQ SN:chr2 LN:243199373
@SQ SN:chr3 LN:198022430
@SQ SN:chr4 LN:191154276
@SQ SN:chr5 LN:180915260
@SQ SN:chr6 LN:171115067
@SQ SN:chr7 LN:159138663
@SQ SN:chr8 LN:146364022
@SQ SN:chr9 LN:141213431
@SQ SN:chr10 LN:135534747
@SQ SN:chr11 LN:135006516
@SQ SN:chr12 LN:133851895
@SQ SN:chr13 LN:115169878
@SQ SN:chr14 LN:107349540
@SQ SN:chr15 LN:102531392
@SQ SN:chr16 LN:90354753
@SQ SN:chr17 LN:81195210
@SQ SN:chr18 LN:78077248
@SQ SN:chr19 LN:59128983
@SQ SN:chr20 LN:63025520
@SQ SN:chr21 LN:48129895
@SQ SN:chr22 LN:51304566
@SQ SN:chrX LN:155270560
@SQ SN:chrY LN:59373566
@SQ SN:chr1_gl000191_random LN:106433
@SQ SN:chr1_gl000192_random LN:547496
@SQ SN:chr4_ctg9_hap1 LN:590426
@SQ SN:chr4_gl000193_random LN:189789
@SQ SN:chr4_gl000194_random LN:191469
@SQ SN:chr6_apd_hap1 LN:4622290
@SQ SN:chr6_cox_hap2 LN:4795371
@SQ SN:chr6_dbb_hap3 LN:4610396
@SQ SN:chr6_mann_hap4 LN:4683263
@SQ SN:chr6_mcf_hap5 LN:4833398
@SQ SN:chr6_qbl_hap6 LN:4611984
@SQ SN:chr6_ssto_hap7 LN:4928567
@SQ SN:chr7_gl000195_random LN:182896
@SQ SN:chr8_gl000196_random LN:38914
@SQ SN:chr8_gl000197_random LN:37175
@SQ SN:chr9_gl000198_random LN:90085
@SQ SN:chr9_gl000199_random LN:169874
@SQ SN:chr9_gl000200_random LN:187035
@SQ SN:chr9_gl000201_random LN:36148
@SQ SN:chr11_gl000202_random LN:40103
@SQ SN:chr17_ctg5_hap1 LN:1680828
@SQ SN:chr17_gl000203_random LN:37498
@SQ SN:chr17_gl000204_random LN:81310
@SQ SN:chr17_gl000205_random LN:174588
@SQ SN:chr17_gl000206_random LN:41001
@SQ SN:chr18_gl000207_random LN:4262
@SQ SN:chr19_gl000208_random LN:92689
@SQ SN:chr19_gl000209_random LN:159169
@SQ SN:chr21_gl000210_random LN:27682
@SQ SN:chrUn_gl000211 LN:166566
@SQ SN:chrUn_gl000212 LN:186858
@SQ SN:chrUn_gl000213 LN:164239
@SQ SN:chrUn_gl000214 LN:137718
@SQ SN:chrUn_gl000215 LN:172545
@SQ SN:chrUn_gl000216 LN:172294
@SQ SN:chrUn_gl000217 LN:172149
@SQ SN:chrUn_gl000218 LN:161147
@SQ SN:chrUn_gl000219 LN:179198
@SQ SN:chrUn_gl000220 LN:161802
@SQ SN:chrUn_gl000221 LN:155397
@SQ SN:chrUn_gl000222 LN:186861
@SQ SN:chrUn_gl000223 LN:180455
@SQ SN:chrUn_gl000224 LN:179693
@SQ SN:chrUn_gl000225 LN:211173
@SQ SN:chrUn_gl000226 LN:15008
@SQ SN:chrUn_gl000227 LN:128374
@SQ SN:chrUn_gl000228 LN:129120
@SQ SN:chrUn_gl000229 LN:19913
@SQ SN:chrUn_gl000230 LN:43691
@SQ SN:chrUn_gl000231 LN:27386
@SQ SN:chrUn_gl000232 LN:40652
@SQ SN:chrUn_gl000233 LN:45941
@SQ SN:chrUn_gl000234 LN:40531
@SQ SN:chrUn_gl000235 LN:34474
@SQ SN:chrUn_gl000236 LN:41934
@SQ SN:chrUn_gl000237 LN:45867
@SQ SN:chrUn_gl000238 LN:39939
@SQ SN:chrUn_gl000239 LN:33824
@SQ SN:chrUn_gl000240 LN:41933
@SQ SN:chrUn_gl000241 LN:42152
@SQ SN:chrUn_gl000242 LN:43523
@SQ SN:chrUn_gl000243 LN:43341
@SQ SN:chrUn_gl000244 LN:39929
@SQ SN:chrUn_gl000245 LN:36651
@SQ SN:chrUn_gl000246 LN:38154
@SQ SN:chrUn_gl000247 LN:36422
@SQ SN:chrUn_gl000248 LN:39786
@SQ SN:chrUn_gl000249 LN:38502
@RG ID:A00718 PL:Illumina LB:SS6 SM:IITK-P4-TD
@PG ID:GATK IndelRealigner VN:2015.1-3.4.0-1-ga5ca3fc CL:knownAlleles=[(RodBinding name=knownAlleles source=/garnet/Tools/WES_Analysis/GATK_Analysis/gatk_bundle/2.8/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf), (RodBinding name=knownAlleles2 source=/garnet/Tools/WES_Analysis/GATK_Analysis/gatk_bundle/2.8/hg19/1000G_phase1.indels.hg19.sites.vcf)] targetIntervals=/garnet/Analysis/BI/Exome/HN00101026/Analysis/IITK-P4-TD/VARIANT/tmp/chr1.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null
@PG ID:MarkDuplicates PN:MarkDuplicates VN:1.130(8b3e8abe25f920f5aa569db482bb999f29cc447b_1427207353) CL:picard.sam.markduplicates.MarkDuplicates MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=100000 INPUT=[/garnet/Analysis/BI/Exome/HN00101026/Analysis/IITK-P4-TD/ALIGN/IITK-P4-TD.sorted.bam] OUTPUT=/garnet/Analysis/BI/Exome/HN00101026/Analysis/IITK-P4-TD/ALIGN/IITK-P4-TD.remdup.bam METRICS_FILE=picard_metrics.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true TMP_DIR=[/garnet/Analysis/BI/Exome/HN00101026/Analysis/IITK-P4-TD/ALIGN/tmp/IITK-P4-TD.picard] VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=0 MAX_RECORDS_IN_RAM=40000000 MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 SORTING_COLLECTION_SIZE_RATIO=0.25 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false CREATE_INDEX=false CREATE_MD5_FILE=false
@PG ID:bwa PN:bwa VN:0.7.12-r1039 CL:/garnet/Tools/WES_Analysis/GATK_Analysis/bwa-0.7.12/bwa mem -t 52 -M -R @RG\tPL:Illumina\tID:A00718\tSM:IITK-P4-TD\tLB:SS6 /garnet/Tools/WES_Analysis/GATK_Analysis/reference/ucsc.hg19.fasta /garnet/Analysis/BI/Exome/HN00101026/Analysis/IITK-P4-TD/FASTQ/IITK-P4-TD_1.fastq /garnet/Analysis/BI/Exome/HN00101026/Analysis/IITK-P4-TD/FASTQ/IITK-P4-TD_2.fastq
@PG ID:GATK PrintReads VN:2015.1-3.4.0-1-ga5ca3fc CL:readGroup=null platform=null number=-1 sample_file=[] sample_name=[] simplify=false no_pg_tag=false
@PG ID:samtools PN:samtools PP:GATK PrintReads VN:1.13 CL:samtools view -H /home/lab4/Seq_Data/WES/IITK-P4-TD/IITK-P4-TD.recal.bam
lab4@lab4-Vostro-3800:~$Thank you.
Regards,
Tanay
-
Hi Tanay Biswas,
Thank you for including this log!
Your samtools view appears to show that you have aligned to GrCh37 as your contigs are named using the chr1, chr2, …, naming convention. In this case, you can try renaming your contigs using bcftools annotate --rename-chrs.
I hope this helps! Please let me know if this leads you to success. If you have any further questions, please let me know.
Best,
Anthony -
Hi Tanay Biswas,
We haven't heard from you in a while so we're going to close out this ticket. If you still require assistance, simply respond to this email and we'll be happy to pick up where we left off!
Kind regards,
Anthony
-
I fixed this error by making in my case the -L and -V parameter the same. Can you clarify what changes when this happens?
Thanks! -
Hi Manuel Sérgio Sokolov Ravasqueira,
This is discussed in the tool documentation, https://gatk.broadinstitute.org/hc/en-us/articles/13832749845403-GetPileupSummaries
Although the sites (-L) and variants (-V) resources will often be identical, this need not be the case. For example,
gatk GetPileupSummaries \ -I normal.bam \ -V gnomad.vcf.gz \ -L common_snps.interval_list \ -O pileups.table
attempts to get pileups at a list of common snps and emits output for those sites that are present in gnomAD, using the allele frequencies from gnomAD. Note that the sites may be a subset of the variants, the variants may be a subset of the sites, or they may overlap partially. In all cases pileup summaries are emitted for the overlap and nowhere else. The most common use case in which sites and variants differ is when the variants resources is a large file and the sites is an interval list subset from that file.
So, having the -L and -V parameters be the same just tells the tool to output pileup summaries for all sites in the VCF.
Regards,David
Please sign in to leave a comment.
8 comments