Picard CheckFingerprint output for mismatches
Hi,
I am running CheckFingerprint to compare RNA-seq BAM files against the expected genotypes from WGS provided in a VCF file. The VCF file contains several samples. CheckFingerprint works fine for the vast majority of my checks and returns large LOD scores. However, when I run the command using a mismatched sample, I receive an error message (and an LOD score of 0), which concerns me:
java -Xmx8G -jar ${picard}/picard.jar CheckFingerprint \
> -VALIDATION_STRINGENCY SILENT \
> -INPUT ${bamfull} \
> -IGNORE_READ_GROUPS true \
> -GENOTYPES ${wgsFingerprints} \
> -EXPECTED_SAMPLE_ALIAS ${wgsSample} \
> -HAPLOTYPE_MAP ${haplotypeMap} \
> -GENOTYPE_LOD_THRESHOLD -100000000 \
> -OUTPUT testMisMatch
12:26:40.109 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/mfs/ctcn/tools/picard_2.23.6/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Mon Mar 25 12:26:40 EDT 2024] CheckFingerprint --INPUT 76755449_SMA.bam --OUTPUT testMisMatch --GENOTYPES rosmapWgsFingerprints.vcf.gz --EXPECTED_SAMPLE_ALIAS SM-CJGLP --HAPLOTYPE_MAP hg38_haplotype_db.map --GENOTYPE_LOD_THRESHOLD -1.0E8 --IGNORE_READ_GROUPS true --VALIDATION_STRINGENCY SILENT --VERBOSITY INFO --QUIET false --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Mon Mar 25 12:26:40 EDT 2024] Executing as XXX@ctcnlogin on Linux 4.19.0-23-amd64 amd64; OpenJDK 64-Bit Server VM 11.0.18+10-post-Debian-1deb10u1; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: Version:2.23.6
INFO 2024-03-25 12:29:37 FingerprintChecker Reading an indexed file (76755449_SMA.bam)
INFO 2024-03-25 12:33:52 CheckFingerprint Read Group: null / null vs. SM-CJGLP: LOD = 0.0
ERROR 2024-03-25 12:33:52 CheckFingerprint No non-zero results found. This is likely an error. Probable cause: EXPECTED_SAMPLE (if provided) or the sample name from INPUT (if EXPECTED_SAMPLE isn't provided)isn't a sample in GENOTYPES file.
[Mon Mar 25 12:33:52 EDT 2024] picard.fingerprint.CheckFingerprint done. Elapsed time: 7.20 minutes.
Runtime.totalMemory()=4557111296
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
The error message suggests that the sample "SM-CJGLP" does not exist in the VCF file, but that is not true. I ran the same command just replaced the bam file with the correct matching bam file and it worked (LOD > 1000). I also ran the same command replacing the VCF sample name with the name matching the bam file, and it worked too.
Why does CheckFingerprint throw an error instead of returning a small LOD?
-
Hi, thanks for posting. I tried recreating your issue with some of my own data but couldn't reproduce it. In particular, I used the same flags as you with a BAM I knew mismatched my VCF sample and tried running. It produced a large negative LOD (i.e. fingerprint mismatch) as expected.
Can you share a bit more information about your data and environment? In particular:
- Version of Picard/GATK
- The output of `bcftools query -l` on your VCF
- The RG info view `samtools view -H <bam> | grep "@RG"`
Thanks!
-
Hi Ricky,
Thank you for your reply. Initially, I used an older version of Picard, but I observed the exact same result with Picard 3.1.1.
The given EXPECTED_SAMPLE_ALIAS is a valid name in my VCF file. There are 1200 samples in the VCF file, so I won't list all of them here, but everything appears fine to me:
> query -l rosmapWgsFingerprints.vcf.gz | grep SM-CJGLP
SM-CJGLPThe bam file header might be the problem since it does not have the SM tag:
samtools view -H $bamfull | grep "@RG"
@RG ID:76755449_SMA CN:BI DT:2024-03-11T18:21:48:-0400However, it works when I specify the correct sample in the VCF file (high LOD). It also works when I specify the correct bam file belonging to the sample "SM-CJGLP" in the VCF file.
Best,
Hans -
Hi,
I'm still trying to test this a bit, but can you confirm:
- If you add a "SM" field to the RG line, does it produce the same error?
- What do you mean when you say it "works" when specifying the correct bam for the sample in the VCF file? You mean it runs properly when using a different bam but the same VCF? If so, can you also print the RG lines from that bam so I can see?
Thanks
-
I did take a look at the code, and it does seem like there's a step early on where the SM tag value is extracted from the RG and fed to a bunch of functions later on, most likely leading to your `null / null` output, as opposed to e.g. I saw `Read Group: null / SM-MOKH1 vs. HG002` in my log (for my files). In that sense I'm optimistic adding an SM value to your bam RGs would "fix" this issue, but if it does work it'd certainly be a bug in the tool we'd have to look into.
Please sign in to leave a comment.
4 comments