Checks the sample identity of the sequence/genotype data in the provided file (SAM/BAM or VCF) against a set of known genotypes in the supplied genotype file (in VCF format).
Summary
Computes a fingerprint (essentially, genotype information from different parts of the genome) from the supplied input file (SAM/BAM or VCF) file and compares it to the expected fingerprint genotypes provided. The key output is a LOD score which represents the relative likelihood of the sequence data originating from the same sample as the genotypes vs. from a random sample.Two outputs are produced:
- A summary metrics file that gives metrics of the fingerprint matches when comparing the input to a set of genotypes for the expected sample. At the single sample level (if the input was a VCF) or at the read level (lane or index within a lane) (if the input was a SAM/BAM)
- A detail metrics file that contains an individual SNP/Haplotype comparison within a fingerprint comparison.
Example comparing a bam against known genotypes:
java -jar picard.jar CheckFingerprint \ INPUT=sample.bam \ GENOTYPES=sample_genotypes.vcf \ HAPLOTYPE_MAP=fingerprinting_haplotype_database.txt \ OUTPUT=sample_fingerprinting
Detailed Explanation
This tool calculates a single number that reports the LOD score for identity check between the INPUT and the GENOTYPES. A positive value indicates that the data seems to have come from the same individual or, in other words the identity checks out. The scale is logarithmic (base 10), so a LOD of 6 indicates that it is 1,000,000 more likely that the data matches the genotypes than not. A negative value indicates that the data do not match. A score that is near zero is inconclusive and can result from low coverage or non-informative genotypes. The identity check makes use of haplotype blocks defined in the HAPLOTYPE_MAP file to enable it to have higher statistical power for detecting identity or swap by aggregating data from several SNPs in the haplotype block. This enables an identity check of samples with very low coverage (e.g. ~1x mean coverage). When provided a VCF, the identity check looks at the PL, GL and GT fields (in that order) and uses the first one that it finds.Category Diagnostics and Quality Control
Overview
Checks the sample identity of the sequence/genotype data in the provided file (SAM/BAM or VCF) against a set of known genotypes in the supplied genotype file (in VCF format).
Summary
Computes a fingerprint (essentially, genotype information from different parts of the genome) from the supplied input file (SAM/BAM or VCF) file and compares it to the expected fingerprint genotypes provided. The key output is a LOD score which represents the relative likelihood of the sequence data originating from the same sample as the genotypes vs. from a random sample.Two outputs are produced:
- A summary metrics file that gives metrics of the fingerprint matches when comparing the input to a set of genotypes for the expected sample. At the single sample level (if the input was a VCF) or at the read level (lane or index within a lane) (if the input was a SAM/BAM)
- A detail metrics file that contains an individual SNP/Haplotype comparison within a fingerprint comparison.
Example comparing a bam against known genotypes:
java -jar picard.jar CheckFingerprint \ INPUT=sample.bam \ GENOTYPES=sample_genotypes.vcf \ HAPLOTYPE_MAP=fingerprinting_haplotype_database.txt \ OUTPUT=sample_fingerprinting
Detailed Explanation
This tool calculates a single number that reports the LOD score for identity check between the #INPUT and the #GENOTYPES. A positive value indicates that the data seems to have come from the same individual or, in other words the identity checks out. The scale is logarithmic (base 10), so a LOD of 6 indicates that it is 1,000,000 more likely that the data matches the genotypes than not. A negative value indicates that the data do not match. A score that is near zero is inconclusive and can result from low coverage or non-informative genotypes.
The identity check makes use of haplotype blocks defined in the #HAPLOTYPE_MAP file to enable it to have higher statistical power for detecting identity or swap by aggregating data from several SNPs in the haplotype block. This enables an identity check of samples with very low coverage (e.g. ~1x mean coverage).
When provided a VCF, the identity check looks at the PL, GL and GT fields (in that order) and uses the first one that it finds.
CheckFingerprint (Picard) specific arguments
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Arguments | |||
--DETAIL_OUTPUT -D |
null | The text file to which to write detail metrics. | |
--GENOTYPES -G |
null | File of genotypes (VCF) to be used in comparison. May contain any number of genotypes; CheckFingerprint will use only those that are usable for fingerprinting. | |
--HAPLOTYPE_MAP -H |
null | The file lists a set of SNPs, optionally arranged in high-LD blocks, to be used for fingerprinting. See https://software.broadinstitute.org/gatk/documentation/article?id=9526 for details. | |
--INPUT -I |
null | Input file SAM/BAM or VCF. If a VCF is used, it must have at least one sample. If there are more than one samples in the VCF, the parameter OBSERVED_SAMPLE_ALIAS must be provided in order to indicate which sample's data to use. If there are no samples in the VCF, an exception will be thrown. | |
--OUTPUT -O |
null | The base prefix of output files to write. The summary metrics will have the file extension 'fingerprinting_summary_metrics' and the detail metrics will have the extension 'fingerprinting_detail_metrics'. | |
--SUMMARY_OUTPUT -S |
null | The text file to which to write summary metrics. | |
Optional Tool Arguments | |||
--arguments_file |
[] | read one or more arguments files and add them to the command line | |
--EXPECTED_SAMPLE_ALIAS -SAMPLE_ALIAS |
null | This parameter can be used to specify which sample's genotypes to use from the expected VCF file (the GENOTYPES file). If it is not supplied, the sample name from the input (VCF or BAM read group header) will be used. | |
--GENOTYPE_LOD_THRESHOLD -LOD |
5.0 | When counting haplotypes checked and matching, count only haplotypes where the most likely haplotype achieves at least this LOD. | |
--help -h |
false | display the help message | |
--IGNORE_READ_GROUPS -IGNORE_RG |
false | If the input is a SAM/BAM, and this parameter is true, treat the entire input BAM as one single read group in the calculation, ignoring RG annotations, and producing a single fingerprint metric for the entire BAM. | |
--OBSERVED_SAMPLE_ALIAS |
null | If the input is a VCF, this parameters used to select which sample's data in the VCF to use. | |
--version |
false | display the version number for this tool | |
Optional Common Arguments | |||
--COMPRESSION_LEVEL |
5 | Compression level for all compressed files created (e.g. BAM and VCF). | |
--CREATE_INDEX |
false | Whether to create a BAM index when writing a coordinate-sorted BAM file. | |
--CREATE_MD5_FILE |
false | Whether to create an MD5 digest for any BAM or FASTQ files created. | |
--GA4GH_CLIENT_SECRETS |
client_secrets.json | Google Genomics API client_secrets.json file path. | |
--MAX_RECORDS_IN_RAM |
500000 | When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed. | |
--QUIET |
false | Whether to suppress job-summary info on System.err. | |
--REFERENCE_SEQUENCE -R |
null | Reference sequence file. | |
--TMP_DIR |
[] | One or more directories with space available to be used by this program for temporary storage of working files | |
--USE_JDK_DEFLATER -use_jdk_deflater |
false | Use the JDK Deflater instead of the Intel Deflater for writing compressed output | |
--USE_JDK_INFLATER -use_jdk_inflater |
false | Use the JDK Inflater instead of the Intel Inflater for reading compressed input | |
--VALIDATION_STRINGENCY |
STRICT | Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. | |
--VERBOSITY |
INFO | Control verbosity of logging. | |
Advanced Arguments | |||
--showHidden |
false | display hidden arguments |
Argument details
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
--arguments_file / NA
read one or more arguments files and add them to the command line
List[File] []
--COMPRESSION_LEVEL / NA
Compression level for all compressed files created (e.g. BAM and VCF).
int 5 [ [ -∞ ∞ ] ]
--CREATE_INDEX / NA
Whether to create a BAM index when writing a coordinate-sorted BAM file.
Boolean false
--CREATE_MD5_FILE / NA
Whether to create an MD5 digest for any BAM or FASTQ files created.
boolean false
--DETAIL_OUTPUT / -D
The text file to which to write detail metrics.
Exclusion: This argument cannot be used at the same time as OUTPUT
.
R File null
--EXPECTED_SAMPLE_ALIAS / -SAMPLE_ALIAS
This parameter can be used to specify which sample's genotypes to use from the expected VCF file (the GENOTYPES file). If it is not supplied, the sample name from the input (VCF or BAM read group header) will be used.
String null
--GA4GH_CLIENT_SECRETS / NA
Google Genomics API client_secrets.json file path.
String client_secrets.json
--GENOTYPE_LOD_THRESHOLD / -LOD
When counting haplotypes checked and matching, count only haplotypes where the most likely haplotype achieves at least this LOD.
double 5.0 [ [ -∞ ∞ ] ]
--GENOTYPES / -G
File of genotypes (VCF) to be used in comparison. May contain any number of genotypes; CheckFingerprint will use only those that are usable for fingerprinting.
R String null
--HAPLOTYPE_MAP / -H
The file lists a set of SNPs, optionally arranged in high-LD blocks, to be used for fingerprinting. See https://software.broadinstitute.org/gatk/documentation/article?id=9526 for details.
R File null
--help / -h
display the help message
boolean false
--IGNORE_READ_GROUPS / -IGNORE_RG
If the input is a SAM/BAM, and this parameter is true, treat the entire input BAM as one single read group in the calculation, ignoring RG annotations, and producing a single fingerprint metric for the entire BAM.
boolean false
--INPUT / -I
Input file SAM/BAM or VCF. If a VCF is used, it must have at least one sample. If there are more than one samples in the VCF, the parameter OBSERVED_SAMPLE_ALIAS must be provided in order to indicate which sample's data to use. If there are no samples in the VCF, an exception will be thrown.
R String null
--MAX_RECORDS_IN_RAM / NA
When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
Integer 500000 [ [ -∞ ∞ ] ]
--OBSERVED_SAMPLE_ALIAS / NA
If the input is a VCF, this parameters used to select which sample's data in the VCF to use.
String null
--OUTPUT / -O
The base prefix of output files to write. The summary metrics will have the file extension 'fingerprinting_summary_metrics' and the detail metrics will have the extension 'fingerprinting_detail_metrics'.
Exclusion: This argument cannot be used at the same time as SUMMARY_OUTPUT, DETAIL_OUTPUT, S, D
.
R String null
--QUIET / NA
Whether to suppress job-summary info on System.err.
Boolean false
--REFERENCE_SEQUENCE / -R
Reference sequence file.
File null
--showHidden / -showHidden
display hidden arguments
boolean false
--SUMMARY_OUTPUT / -S
The text file to which to write summary metrics.
Exclusion: This argument cannot be used at the same time as OUTPUT
.
R File null
--TMP_DIR / NA
One or more directories with space available to be used by this program for temporary storage of working files
List[File] []
--USE_JDK_DEFLATER / -use_jdk_deflater
Use the JDK Deflater instead of the Intel Deflater for writing compressed output
Boolean false
--USE_JDK_INFLATER / -use_jdk_inflater
Use the JDK Inflater instead of the Intel Inflater for reading compressed input
Boolean false
--VALIDATION_STRINGENCY / NA
Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:
- STRICT
- LENIENT
- SILENT
ValidationStringency STRICT
--VERBOSITY / NA
Control verbosity of logging.
The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:
- ERROR
- WARNING
- INFO
- DEBUG
LogLevel INFO
--version / NA
display the version number for this tool
boolean false
GATK version 4.1.6.0-SNAPSHOT built at Thu, 2 Apr 2020 14:54:17 -0400.
0 comments
Please sign in to leave a comment.