ValidateSamFile behavior
Regarding the ValidateSamFile command in picard in GATK 2.21.8, command-line:
java -jar $picard ValidateSamFile I=test.sam MODE=VERBOSE
1. I get this warning "ValidateSamFile NM validation cannot be performed without the reference. All other validations will still occur." Is there a command-line option to give a FASTA file with the reference genome? I don't see one in the --help information for ValidateSamFile.
2. Why is a missing @RG header line considered to be an error rather than a warning, and why is a missing RG:X tag in an alignment record considered to be a warning? By my reading of the SAM spec, these are optional features and are not included under "Recommended Practice". Contrast with the @HD header record and NM:i tag, both of which "should be present". Also, bwa mem, which seems to be the de facto standard, does not include @RG or RG:X.
-
HI Robert Edgar
You are using a very old version of GATK that we don't support anymore. Please upgrade to the latest GATK4.1.1.0 version.
-
Yesterday I did a git clone and build per instructions here: https://github.com/broadinstitute/picard
Version shows as this:
java -jar $picard ValidateSamFile --version
2.21.8-1-gc5cd747-SNAPSHOTWhere do I get a supported releases?
Thanks, Robert. -
Hi Robert Edgar
Apologies. You mentioned GATK2 which got me confused. Anyway, you are using is Picard v2 so that's fine.
Now lets answer your questions:
- Yes you can provide a reference fasta using `-R` argument. You can see this in the --help information for ValidateSamFile under the "Optional Common Arguments" section.
- GATK requires read group data and fails without it. See this doc for more info.
-
ValidateSamFile is generating the same warning with 4.1.7.0 Docker version. How can it be fixed? Should it be? It also generates an error
MISSING_PLATFORM_VALUE:Read name A, A platform (PL) attribute was not found for read group
After I had run
AddOrReplaceReadGroups Created read-group ID=1 PL=ILLUMINA LB=normal_1 SM=CTGCTTCC+GATAGATC
-
Hi Robert,
Are you seeing this error even after adding the read groups to the bam file? That should not happen. Can you please share the header of the bam file using this command:
samtools view -H <bamfile>
-
Hi,
I tried the ValidateSam command to check if my bam file was appropriate. I added Readgroups and it showed some NM validation warning. I tried this command-
java -jar picard.jar ValidateSamFile R= genome.fa I=SRR314128_rg.bam MODE=SUMMARY
But it still shows this error-
WARNING:MISSING_TAG_NM 33753200
I am also pasting the header of my bam file here:
@HD VN:1.6 SO:coordinate
@SQ SN:1 LN:30427671
@SQ SN:2 LN:19698289
@SQ SN:3 LN:23459830
@SQ SN:4 LN:18585056
@SQ SN:5 LN:26975502
@RG ID:foo LB:bar PL:illumina SM:Sample1 PU:A123.1
@PG ID:STAR PN:STAR VN:STAR_2.5.0a CL:STAR --runThreadN 12 --genomeDir genome/ --readFilesIn SRR3141288/SRR3141288_1.fastq SRR3141288/SRR3141288_2.fastq --outFileNamePrefix ../SRR3141288
@PG ID:MarkDuplicates VN:2.22.1 CL:MarkDuplicates INPUT=[SRR3141288.sorted.bam] OUTPUT=SRR3141288_md.bam METRICS_FILE=marked_dup_metrics.txt MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false PN:MarkDuplicates
@PG ID:samtools PN:samtools PP:STAR VN:1.10 CL:samtools view -H SRR314128_rg.bam
@PG ID:samtools.1 PN:samtools PP:MarkDuplicates VN:1.10 CL:samtools view -H SRR314128_rg.bam
@CO user command line: STAR --genomeDir genome/ --runThreadN 12 --readFilesIn SRR3141288/SRR3141288_1.fastq SRR3141288/SRR3141288_2.fastq --outFileNamePrefix ../SRR3141288WHAT COULD POSSIBLY BE WRONG?
-
Hello Dhara Awasthi,
Please see this resource we have for diagnosing issues that come up from ValidateSamFile: https://gatk.broadinstitute.org/hc/en-us/articles/360035891231-Errors-in-SAM-or-BAM-files-can-be-diagnosed-with-ValidateSamFile
Hope this helps!
Genevieve
Please sign in to leave a comment.
7 comments