Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GetPileupSummaries- Badly formed genome unclippedLoc error - Contig 1 given as location, but this contig isn't present in the Fasta sequence dictionary

0

8 comments

  • Avatar
    Anthony DiCi

    Hi Tanay Biswas,

    Thank you for writing to the GATK forum! I hope that we can help you sort this out.

    I brought your issue to our developers and received some feedback and next steps for you. In your input command, you are passing --L /home/lab4/Downloads/gnomad.exomes.r2.1.1.sites.vcf.bgz. We searched for gnomad.exomes.r2.1.1.sites.vcf.bgz and found it here, where the summary given is:

    The gnomAD v2.1.1 data set contains data from 125,748 exomes and 15,708 whole genomes, all mapped to the GRCh37/hg19 reference sequence.

    Hg19 contigs are named 1, 2, 3, etc., so GetPileupSummaries is complaining because contig 1 is present, but it is not in the reference that it aligned the bam to (maybe GRCh38). If that is the case, you should try downloading your site's VCF from the GRCh38 liftover, or you could try using gnomAD v3. 

    I hope this helps! Please let me know if this leads you to success. If you have any questions in the meantime, please do not hesitate to reach out.

    Best,
    Anthony

    0
    Comment actions Permalink
  • Avatar
    Tanay Biswas

    Hi Anthony,

    The BAM files are generated by aligning with hg19 all chromosomes. That's why I have used gnomAD hg19 vcf file. I'm not understanding why this error is coming. Let me know how to deal with this.

     

    Thank you.

    0
    Comment actions Permalink
  • Avatar
    Anthony DiCi

    Hi Tanay Biswas,

    Thank you for that information! Could you please try running the following command and providing us with the output?

    --samtools view -H [BAMFILE]

    This output will give us information about the contigs present and allow us to figure out what’s going on.

    I look forward to hearing back from you!

    Best,
    Anthony

    0
    Comment actions Permalink
  • Avatar
    Tanay Biswas

    Hi Anthony,

    Please see the below output for samtools view:

    lab4@lab4-Vostro-3800:~$ samtools view -H /home/lab4/Seq_Data/WES/IITK-P4-TD/IITK-P4-TD.recal.bam
    @HD    VN:1.4    GO:none    SO:coordinate
    @SQ    SN:chrM    LN:16571
    @SQ    SN:chr1    LN:249250621
    @SQ    SN:chr2    LN:243199373
    @SQ    SN:chr3    LN:198022430
    @SQ    SN:chr4    LN:191154276
    @SQ    SN:chr5    LN:180915260
    @SQ    SN:chr6    LN:171115067
    @SQ    SN:chr7    LN:159138663
    @SQ    SN:chr8    LN:146364022
    @SQ    SN:chr9    LN:141213431
    @SQ    SN:chr10    LN:135534747
    @SQ    SN:chr11    LN:135006516
    @SQ    SN:chr12    LN:133851895
    @SQ    SN:chr13    LN:115169878
    @SQ    SN:chr14    LN:107349540
    @SQ    SN:chr15    LN:102531392
    @SQ    SN:chr16    LN:90354753
    @SQ    SN:chr17    LN:81195210
    @SQ    SN:chr18    LN:78077248
    @SQ    SN:chr19    LN:59128983
    @SQ    SN:chr20    LN:63025520
    @SQ    SN:chr21    LN:48129895
    @SQ    SN:chr22    LN:51304566
    @SQ    SN:chrX    LN:155270560
    @SQ    SN:chrY    LN:59373566
    @SQ    SN:chr1_gl000191_random    LN:106433
    @SQ    SN:chr1_gl000192_random    LN:547496
    @SQ    SN:chr4_ctg9_hap1    LN:590426
    @SQ    SN:chr4_gl000193_random    LN:189789
    @SQ    SN:chr4_gl000194_random    LN:191469
    @SQ    SN:chr6_apd_hap1    LN:4622290
    @SQ    SN:chr6_cox_hap2    LN:4795371
    @SQ    SN:chr6_dbb_hap3    LN:4610396
    @SQ    SN:chr6_mann_hap4    LN:4683263
    @SQ    SN:chr6_mcf_hap5    LN:4833398
    @SQ    SN:chr6_qbl_hap6    LN:4611984
    @SQ    SN:chr6_ssto_hap7    LN:4928567
    @SQ    SN:chr7_gl000195_random    LN:182896
    @SQ    SN:chr8_gl000196_random    LN:38914
    @SQ    SN:chr8_gl000197_random    LN:37175
    @SQ    SN:chr9_gl000198_random    LN:90085
    @SQ    SN:chr9_gl000199_random    LN:169874
    @SQ    SN:chr9_gl000200_random    LN:187035
    @SQ    SN:chr9_gl000201_random    LN:36148
    @SQ    SN:chr11_gl000202_random    LN:40103
    @SQ    SN:chr17_ctg5_hap1    LN:1680828
    @SQ    SN:chr17_gl000203_random    LN:37498
    @SQ    SN:chr17_gl000204_random    LN:81310
    @SQ    SN:chr17_gl000205_random    LN:174588
    @SQ    SN:chr17_gl000206_random    LN:41001
    @SQ    SN:chr18_gl000207_random    LN:4262
    @SQ    SN:chr19_gl000208_random    LN:92689
    @SQ    SN:chr19_gl000209_random    LN:159169
    @SQ    SN:chr21_gl000210_random    LN:27682
    @SQ    SN:chrUn_gl000211    LN:166566
    @SQ    SN:chrUn_gl000212    LN:186858
    @SQ    SN:chrUn_gl000213    LN:164239
    @SQ    SN:chrUn_gl000214    LN:137718
    @SQ    SN:chrUn_gl000215    LN:172545
    @SQ    SN:chrUn_gl000216    LN:172294
    @SQ    SN:chrUn_gl000217    LN:172149
    @SQ    SN:chrUn_gl000218    LN:161147
    @SQ    SN:chrUn_gl000219    LN:179198
    @SQ    SN:chrUn_gl000220    LN:161802
    @SQ    SN:chrUn_gl000221    LN:155397
    @SQ    SN:chrUn_gl000222    LN:186861
    @SQ    SN:chrUn_gl000223    LN:180455
    @SQ    SN:chrUn_gl000224    LN:179693
    @SQ    SN:chrUn_gl000225    LN:211173
    @SQ    SN:chrUn_gl000226    LN:15008
    @SQ    SN:chrUn_gl000227    LN:128374
    @SQ    SN:chrUn_gl000228    LN:129120
    @SQ    SN:chrUn_gl000229    LN:19913
    @SQ    SN:chrUn_gl000230    LN:43691
    @SQ    SN:chrUn_gl000231    LN:27386
    @SQ    SN:chrUn_gl000232    LN:40652
    @SQ    SN:chrUn_gl000233    LN:45941
    @SQ    SN:chrUn_gl000234    LN:40531
    @SQ    SN:chrUn_gl000235    LN:34474
    @SQ    SN:chrUn_gl000236    LN:41934
    @SQ    SN:chrUn_gl000237    LN:45867
    @SQ    SN:chrUn_gl000238    LN:39939
    @SQ    SN:chrUn_gl000239    LN:33824
    @SQ    SN:chrUn_gl000240    LN:41933
    @SQ    SN:chrUn_gl000241    LN:42152
    @SQ    SN:chrUn_gl000242    LN:43523
    @SQ    SN:chrUn_gl000243    LN:43341
    @SQ    SN:chrUn_gl000244    LN:39929
    @SQ    SN:chrUn_gl000245    LN:36651
    @SQ    SN:chrUn_gl000246    LN:38154
    @SQ    SN:chrUn_gl000247    LN:36422
    @SQ    SN:chrUn_gl000248    LN:39786
    @SQ    SN:chrUn_gl000249    LN:38502
    @RG    ID:A00718    PL:Illumina    LB:SS6    SM:IITK-P4-TD
    @PG    ID:GATK IndelRealigner    VN:2015.1-3.4.0-1-ga5ca3fc    CL:knownAlleles=[(RodBinding name=knownAlleles source=/garnet/Tools/WES_Analysis/GATK_Analysis/gatk_bundle/2.8/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf), (RodBinding name=knownAlleles2 source=/garnet/Tools/WES_Analysis/GATK_Analysis/gatk_bundle/2.8/hg19/1000G_phase1.indels.hg19.sites.vcf)] targetIntervals=/garnet/Analysis/BI/Exome/HN00101026/Analysis/IITK-P4-TD/VARIANT/tmp/chr1.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null
    @PG    ID:MarkDuplicates    PN:MarkDuplicates    VN:1.130(8b3e8abe25f920f5aa569db482bb999f29cc447b_1427207353)    CL:picard.sam.markduplicates.MarkDuplicates MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=100000 INPUT=[/garnet/Analysis/BI/Exome/HN00101026/Analysis/IITK-P4-TD/ALIGN/IITK-P4-TD.sorted.bam] OUTPUT=/garnet/Analysis/BI/Exome/HN00101026/Analysis/IITK-P4-TD/ALIGN/IITK-P4-TD.remdup.bam METRICS_FILE=picard_metrics.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true TMP_DIR=[/garnet/Analysis/BI/Exome/HN00101026/Analysis/IITK-P4-TD/ALIGN/tmp/IITK-P4-TD.picard] VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=0 MAX_RECORDS_IN_RAM=40000000    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 SORTING_COLLECTION_SIZE_RATIO=0.25 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false CREATE_INDEX=false CREATE_MD5_FILE=false
    @PG    ID:bwa    PN:bwa    VN:0.7.12-r1039    CL:/garnet/Tools/WES_Analysis/GATK_Analysis/bwa-0.7.12/bwa mem -t 52 -M -R @RG\tPL:Illumina\tID:A00718\tSM:IITK-P4-TD\tLB:SS6 /garnet/Tools/WES_Analysis/GATK_Analysis/reference/ucsc.hg19.fasta /garnet/Analysis/BI/Exome/HN00101026/Analysis/IITK-P4-TD/FASTQ/IITK-P4-TD_1.fastq /garnet/Analysis/BI/Exome/HN00101026/Analysis/IITK-P4-TD/FASTQ/IITK-P4-TD_2.fastq
    @PG    ID:GATK PrintReads    VN:2015.1-3.4.0-1-ga5ca3fc    CL:readGroup=null platform=null number=-1 sample_file=[] sample_name=[] simplify=false no_pg_tag=false
    @PG    ID:samtools    PN:samtools    PP:GATK PrintReads    VN:1.13    CL:samtools view -H /home/lab4/Seq_Data/WES/IITK-P4-TD/IITK-P4-TD.recal.bam
    lab4@lab4-Vostro-3800:~$ 

    Thank you.

     

    Regards,

    Tanay

    0
    Comment actions Permalink
  • Avatar
    Anthony DiCi

    Hi Tanay Biswas,

    Thank you for including this log!

    Your samtools view appears to show that you have aligned to GrCh37 as your contigs are named using the chr1, chr2, , naming convention. In this case, you can try renaming your contigs using bcftools annotate --rename-chrs.

    I hope this helps! Please let me know if this leads you to success. If you have any further questions, please let me know.

    Best,
    Anthony

    0
    Comment actions Permalink
  • Avatar
    Anthony DiCi

    Hi Tanay Biswas,

    We haven't heard from you in a while so we're going to close out this ticket. If you still require assistance, simply respond to this email and we'll be happy to pick up where we left off!

    Kind regards,

    Anthony​

    0
    Comment actions Permalink
  • Hi Anthony DiCi.

    I fixed this error by making in my case the -L and -V parameter the same. Can you clarify what changes when this happens?

    Thanks!

    0
    Comment actions Permalink
  • Avatar
    David Roazen

    Hi Manuel Sérgio Sokolov Ravasqueira,

    This is discussed in the tool documentation, https://gatk.broadinstitute.org/hc/en-us/articles/13832749845403-GetPileupSummaries 

    Although the sites (-L) and variants (-V) resources will often be identical, this need not be the case. For example,

     gatk GetPileupSummaries \
       -I normal.bam \
       -V gnomad.vcf.gz \
       -L common_snps.interval_list \
       -O pileups.table
     

    attempts to get pileups at a list of common snps and emits output for those sites that are present in gnomAD, using the allele frequencies from gnomAD. Note that the sites may be a subset of the variants, the variants may be a subset of the sites, or they may overlap partially. In all cases pileup summaries are emitted for the overlap and nowhere else. The most common use case in which sites and variants differ is when the variants resources is a large file and the sites is an interval list subset from that file.

    So, having the -L and -V parameters be the same just tells the tool to output pileup summaries for all sites in the VCF.

    Regards,

    David

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk