Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Picard CrosscheckFingerprints with multi-sample VCF

1

5 comments

  • Avatar
    Gökalp Çelik

    Hi Devin McCabe

    CrosscheckFingerprints tool requires GT fields in the format area to be present. If you can add GT fields to your VCF files you may get the tool working again.  

    I hope this helps. 

    0
    Comment actions Permalink
  • Avatar
    Devin McCabe

    Sorry, I should've quoted the VCF header, too:

    ##fileformat=VCFv4.2
    ##FILTER=<ID=PASS,Description="All filters passed">
    ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
    ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
    ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
    ##INFO=<ID=ReverseComplementedAlleles,Number=0,Type=Flag,Description="The REF and the ALT alleles have been reverse complemented in liftover since the mapping from the previous reference to the current one was on the negative strand.">
    ##INFO=<ID=SwappedAlleles,Number=0,Type=Flag,Description="The REF and the ALT alleles have been swapped in liftover due to changes in the reference. It is possible that not all INFO annotations reflect this swap, and in the genotypes, only the GT, PL, and AD fields have been modified. You should check the TAGS_TO_REVERSE parameter that was used during the LiftOver to be sure.">

    The docs state that it's sufficient to have just the PL field, though:

    When provided a VCF, the identity check looks at the PL, GL and GT fields (in that order) and uses the first one that it finds.

    So I think there's something else that's going wrong with my usage of the tool.

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Devin McCabe

    Looks like there may be an issue with one of your VCF files or the map file but the easier one to check is the VCF files. Can you run ValidateVariants on your VCF files?

    Also if they validate without any issues can you try the map file hosted from the link below?

    https://github.com/naumanjaved/fingerprint_maps/blob/master/map_files/hg38_chr.map 

    0
    Comment actions Permalink
  • Avatar
    Devin McCabe

    I was able to successfully validate my merged VCF with:

    docker run \
        --rm \
        --volume $PWD:/usr/working \
        --volume ~/.config/gcloud:/root/.config/gcloud \
        broadinstitute/gatk:latest \
        ./gatk \
        ValidateVariants \
      --gcs-project-for-requester-pays ... \
      -V /usr/working/data/vcfs/all.vcf \
        -R gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta \
        --validation-type-to-exclude ALL

    I can already do a 1-vs-all comparison like this:

    docker run \
        --rm \
        --volume $PWD:/usr/working \
        broadinstitute/picard:latest \
        java "-Xmx1g" \
        -jar /usr/picard/picard.jar \
        CrosscheckFingerprints \
        --HAPLOTYPE_MAP /usr/working/data/ccle_29k_coding_snps_hg38.map \
        --CALCULATE_TUMOR_AWARE_RESULTS false \
        --INPUT /usr/working/data/vcfs/CDS-jTW85u.vcf \
        --SECOND_INPUT /usr/working/data/vcfs/CDS-p4tZuU.vcf \
        --SECOND_INPUT /usr/working/data/vcfs/CDS-fhuadb.vcf \
        --SECOND_INPUT /usr/working/data/vcfs/CDS-000dBy.vcf \
        --SECOND_INPUT /usr/working/data/vcfs/CDS-jTW85u.vcf \
        --CROSSCHECK_BY SAMPLE \
        --CROSSCHECK_MODE CHECK_ALL_OTHERS \
      --OUTPUT /usr/working/data/crosscheck.tsv

    This works, but rather than repeat SECOND_INPUT many times (there are far more than 4 target samples in reality), I have a merged version created like this:

    bcftools merge --no-index CDS-p4tZuU.vcf CDS-fhuadb.vcf CDS-000dBy.vcf CDS-jTW85u.vcf > all.bcf
    bcftools view all.bcf -O v > all.vcf

    This is a valid VCF but can't apparently replace all of the SECOND_INPUT args.

    So maybe a better way to phrase my question is whether there's a way to use mapping args to avoid the SECOND_INPUT repetition.

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Actually there is. 

    --SECOND_INPUT,-SI <String>   A second set of input files (or lists of files) with which to compare fingerprints.

    This parameter also accepts a list of files (file paths in a text file one file per line). Therefore you don't have to reiterate the parameter input multiple times. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk