Picard CrosscheckFingerprints with multi-sample VCF
I'm trying to run CrosscheckFingerprints so that I'm doing a 1-vs-all comparison of samples. The VCF I'm using as first input is the standard output from ExtractFingerprint:
...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CDS-jTW85u-2
chr1 14599 . T A . . . AD:DP:PL 15,0:15:530,45,0
...
The second input is the merged VCF output from ExtractFingerprint on several other samples:
...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CDS-000dBy CDS-fhuadb CDS-nPdpKC CDS-p4tZuU
chr1 14599 . T A . PASS . AD:DP:PL 0,0:0:0,0,0 16,5:21:487,0,108 15,0:15:530,45,0 0,0:0:0,0,0
...
I thought that I could get a crosscheck matrix for my input VCF compared with each sample in the second VCF (I don't need samples in the second VCF compared to each other, though), but I get a generic java.lang.IndexOutOfBoundsException error for my command:
docker run \
--rm \
--volume $PWD:/usr/working \
broadinstitute/picard:latest \
java "-Xmx2g" \
-jar /usr/picard/picard.jar \
CrosscheckFingerprints \
--HAPLOTYPE_MAP /usr/working/data/ccle_29k_coding_snps_hg38.map \
--CALCULATE_TUMOR_AWARE_RESULTS false \
--INPUT /usr/working/data/vcfs/CDS-jTW85u-2.vcf \
--SECOND_INPUT /usr/working/data/vcfs/all.vcf \
--CROSSCHECK_BY SAMPLE \
--CROSSCHECK_MODE CHECK_ALL_OTHERS \
--OUTPUT /usr/working/data/crosscheck.tsv
I'm guessing that the tool is not designed to work in this way, and that the SECOND_INPUT option is only to be used when comparing to an INPUT with the same sample ID. Is there some way to either prepare my VCFs differently or use the various mapping args to accomplish what I want?
a) GATK version used: Picard 3.1.1-14-g6f2c61a90-SNAPSHOT
b) Exact command used:
c) Entire program log:
[Fri Feb 16 17:23:40 UTC 2024] CrosscheckFingerprints --INPUT /usr/working/data/vcfs/CDS-jTW85u-2.vcf --SECOND_INPUT /usr/working/data/vcfs/all.vcf --CROSSCHECK_MODE CHECK_ALL_OTHERS --OUTPUT /usr/working/data/crosscheck.tsv --HAPLOTYPE_MAP /usr/working/data/ccle_29k_coding_snps_hg38.map --CROSSCHECK_BY SAMPLE --CALCULATE_TUMOR_AWARE_RESULTS false --REQUIRE_INDEX_FILES false --LOD_THRESHOLD 0.0 --NUM_THREADS 1 --ALLOW_DUPLICATE_READS false --GENOTYPING_ERROR_RATE 0.01 --OUTPUT_ERRORS_ONLY false --LOSS_OF_HET_RATE 0.5 --EXPECT_ALL_GROUPS_TO_MATCH false --EXIT_CODE_WHEN_MISMATCH 1 --EXIT_CODE_WHEN_NO_VALID_CHECKS 1 --MAX_EFFECT_OF_EACH_HAPLOTYPE_BLOCK 3.0 --TEST_INPUT_READABILITY true --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Fri Feb 16 17:23:41 UTC 2024] Executing as root@b4b95808cb61 on Linux 6.6.12-linuxkit amd64; OpenJDK 64-Bit Server VM 17.0.10+7; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:3.1.1-14-g6f2c61a90-SNAPSHOT
INFO 2024-02-16 17:23:41 CrosscheckFingerprints SECOND_INPUT is not empty, and CROSSCHECK_MODE==CHECK_ALL_OTHERS. Will compare fingerprints from INPUT against all the fingerprints in SECOND_INPUT.
INFO 2024-02-16 17:23:41 CrosscheckFingerprints Fingerprinting 1 INPUT files.
WARNING 2024-02-16 17:23:41 FingerprintChecker Couldn't find index for file /usr/working/data/vcfs/CDS-jTW85u-2.vcf going to read through it all.
INFO 2024-02-16 17:23:41 FingerprintChecker Processed files. 1 fingerprints found in map.
INFO 2024-02-16 17:23:41 CrosscheckFingerprints Fingerprinting 1 SECOND_INPUT files.
WARNING 2024-02-16 17:23:41 FingerprintChecker Couldn't find index for file /usr/working/data/vcfs/all.vcf going to read through it all.
[Fri Feb 16 17:23:41 UTC 2024] picard.fingerprint.CrosscheckFingerprints done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=344981504
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" picard.PicardException: Failed to fingerprint
at picard.fingerprint.FingerprintChecker.fingerprintFiles(FingerprintChecker.java:737)
at picard.fingerprint.CrosscheckFingerprints.doWork(CrosscheckFingerprints.java:559)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:281)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:105)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:115)
Caused by: java.util.concurrent.ExecutionException: java.lang.IndexOutOfBoundsException: Index 0 out of bounds for length 0
at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
at picard.fingerprint.FingerprintChecker.fingerprintFiles(FingerprintChecker.java:735)
... 4 more
Caused by: java.lang.IndexOutOfBoundsException: Index 0 out of bounds for length 0
at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)
at java.base/java.util.Objects.checkIndex(Objects.java:361)
at java.base/java.util.ArrayList.get(ArrayList.java:427)
at htsjdk.variant.variantcontext.FastGenotype.getAllele(FastGenotype.java:128)
at picard.fingerprint.FingerprintChecker.getFingerprintFromVc(FingerprintChecker.java:420)
at picard.fingerprint.FingerprintChecker.loadFingerprintsFromVariantContexts(FingerprintChecker.java:301)
at picard.fingerprint.FingerprintChecker.loadFingerprints(FingerprintChecker.java:214)
at picard.fingerprint.FingerprintChecker.fingerprintVcf(FingerprintChecker.java:463)
at picard.fingerprint.FingerprintChecker.lambda$fingerprintFiles$10(FingerprintChecker.java:712)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
-
Hi Devin McCabe
CrosscheckFingerprints tool requires GT fields in the format area to be present. If you can add GT fields to your VCF files you may get the tool working again.
I hope this helps.
-
Sorry, I should've quoted the VCF header, too:
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##INFO=<ID=ReverseComplementedAlleles,Number=0,Type=Flag,Description="The REF and the ALT alleles have been reverse complemented in liftover since the mapping from the previous reference to the current one was on the negative strand.">
##INFO=<ID=SwappedAlleles,Number=0,Type=Flag,Description="The REF and the ALT alleles have been swapped in liftover due to changes in the reference. It is possible that not all INFO annotations reflect this swap, and in the genotypes, only the GT, PL, and AD fields have been modified. You should check the TAGS_TO_REVERSE parameter that was used during the LiftOver to be sure.">The docs state that it's sufficient to have just the PL field, though:
When provided a VCF, the identity check looks at the PL, GL and GT fields (in that order) and uses the first one that it finds.
So I think there's something else that's going wrong with my usage of the tool.
-
Hi Devin McCabe
Looks like there may be an issue with one of your VCF files or the map file but the easier one to check is the VCF files. Can you run ValidateVariants on your VCF files?
Also if they validate without any issues can you try the map file hosted from the link below?
https://github.com/naumanjaved/fingerprint_maps/blob/master/map_files/hg38_chr.map
-
I was able to successfully validate my merged VCF with:
docker run \
--rm \
--volume $PWD:/usr/working \
--volume ~/.config/gcloud:/root/.config/gcloud \
broadinstitute/gatk:latest \
./gatk \
ValidateVariants \
--gcs-project-for-requester-pays ... \
-V /usr/working/data/vcfs/all.vcf \
-R gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta \
--validation-type-to-exclude ALLI can already do a 1-vs-all comparison like this:
docker run \
--rm \
--volume $PWD:/usr/working \
broadinstitute/picard:latest \
java "-Xmx1g" \
-jar /usr/picard/picard.jar \
CrosscheckFingerprints \
--HAPLOTYPE_MAP /usr/working/data/ccle_29k_coding_snps_hg38.map \
--CALCULATE_TUMOR_AWARE_RESULTS false \
--INPUT /usr/working/data/vcfs/CDS-jTW85u.vcf \
--SECOND_INPUT /usr/working/data/vcfs/CDS-p4tZuU.vcf \
--SECOND_INPUT /usr/working/data/vcfs/CDS-fhuadb.vcf \
--SECOND_INPUT /usr/working/data/vcfs/CDS-000dBy.vcf \
--SECOND_INPUT /usr/working/data/vcfs/CDS-jTW85u.vcf \
--CROSSCHECK_BY SAMPLE \
--CROSSCHECK_MODE CHECK_ALL_OTHERS \
--OUTPUT /usr/working/data/crosscheck.tsvThis works, but rather than repeat SECOND_INPUT many times (there are far more than 4 target samples in reality), I have a merged version created like this:
bcftools merge --no-index CDS-p4tZuU.vcf CDS-fhuadb.vcf CDS-000dBy.vcf CDS-jTW85u.vcf > all.bcf
bcftools view all.bcf -O v > all.vcfThis is a valid VCF but can't apparently replace all of the SECOND_INPUT args.
So maybe a better way to phrase my question is whether there's a way to use mapping args to avoid the SECOND_INPUT repetition.
-
Actually there is.
--SECOND_INPUT,-SI <String> A second set of input files (or lists of files) with which to compare fingerprints.
This parameter also accepts a list of files (file paths in a text file one file per line). Therefore you don't have to reiterate the parameter input multiple times.
Please sign in to leave a comment.
5 comments