Dragen-gatk pipeline reference
AnsweredSome things are not very clear to me regarding the references used. According to your article de should use dragmap with a hash table generated using an alt-mask-bed. However in the read me present here (https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/README_dragen_gatk_resources.txt ) you write that you generated the hash table with the following command:
dragen-os \
--build-hash-table true \
--ht-reference reference.fasta \
--output-directory /home/data/reference/
We should use the 3 files of the hash table to run dragmap :
URIs:
1) gs://gcp-public-data--broad-references/hg38/v0/dragen_reference/hash_table.cmp
2) gs://gcp-public-data--broad-references/hg38/v0/dragen_reference/reference.bin
3) gs://gcp-public-data--broad-references/hg38/v0/dragen_reference/hash_table.cfg.bin
?
And during the pipeline when we need the reference fasta we should use :
gs://gcp-public-data--broad-references/hg38/v0/dragen_reference/Homo_sapiens_assembly38_masked.fasta
Or
gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta
Thanks in advance.
-
Actually what is very surprising is that if I use dragmap to create the hash table myself from the reference and the alt-mask.bed I get more than these 3 files.
-
I am going to move your post into our Community Discussions -> General Discussion topic, as the germline topic is for reporting bugs and issues with GATK.
You can read more about our forum guidelines and the topics here: Forum Guidelines.
Best,
Pamela
-
Pamela BretscherBretscher thanks you for the correction.
Just to be more precise, i try this reference with these 3 files. But at the end i get a bam file with a lot of regions with read with MAPQ 0 as it was not alt aware. And more importantly these lead to a lot incorrect genotype. For exemple in the child of the trio use i get more than 60 misenses de novo. It's totally impossible to have as much de novo as that.
Best,
Quentin.
-
- I think the README is not correctly up to date. I'll get that fixed, thank you for pointing it out!
- You should use the masked reference to run DRAGMAP. Illumina is planning on releasing a blog post soon with more details about the masked reference.
- Yes, if you create the hash table yourself, you will get more than those 3 files. The 3 we have in the google bucket are the only ones needed for the next step.
- Could you provide more details about the unexpected results so our team can look into your example more closely? Thank you for bringing it up!
Best,
Genevieve
-
I will try to be as precise as possible.
I used the following tools:
Dragmap: 1.2.1
Gatk: 4.2.3.0
I have a family with parents and 2 children.
File used for some command :
Dragmap hashtable:
gs://gcp-public-data--broad-references/hg38/v0/dragen_reference/hash_table.cmp
gs://gcp-public-data--broad-references/hg38/v0/dragen_reference/reference.bin
gs://gcp-public-data--broad-references/hg38/v0/dragen_reference/hash_table.cfg.binGenome reference:
gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta
Str table:
gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.str
First command: dragmap
dragen-os \
-r ${REF_Genome} \
-b ${ubam_DIR}/${ubam} \
--interleaved=1 \
--num-threads ${SLURM_CPUS_PER_TASK} \
| \
samtools view \
-h \
-O BAM \
> ${Align_DIR}/${ID}.align.bam 2> ${Align_DIR}/logs/${ID}.trimmed.align.filtered.logCommand 2: Mergebamalignment
gatk \
--java-options "-Xms${MEMORYxmx}M -Xmx${MEMORYxmx}M" \
MergeBamAlignment \
VALIDATION_STRINGENCY=SILENT \
EXPECTED_ORIENTATIONS=FR \
ATTRIBUTES_TO_RETAIN=X0 \
ATTRIBUTES_TO_REMOVE=RG \
ATTRIBUTES_TO_REMOVE=NM \
ATTRIBUTES_TO_REMOVE=MD \
ALIGNED_BAM=${Align_DIR}/${BAM} \
UNMAPPED_BAM=${ubam_DIR}/${ubam} \
OUTPUT=${OUTPUT_DIR}/${BAM/bam/merged.bem} \
REFERENCE_SEQUENCE=${ref_fasta} \
PAIRED_RUN=true \
SORT_ORDER="unsorted" \
IS_BISULFITE_SEQUENCE=false \
ALIGNED_READS_ONLY=false \
CLIP_ADAPTERS=false \
MAX_RECORDS_IN_RAM=2000000 \
ADD_MATE_CIGAR=true \
MAX_INSERTIONS_OR_DELETIONS=-1 \
PRIMARY_ALIGNMENT_STRATEGY=MostDistant \
PROGRAM_RECORD_ID="dragen-os" \
PROGRAM_GROUP_VERSION="1.2.1" \
PROGRAM_GROUP_COMMAND_LINE="dragen-os -b ${ubam_DIR}/${ubam} -r ${dragen_reference} --interleaved=1" \
PROGRAM_GROUP_NAME="dragen-os" \
UNMAPPED_READ_STRATEGY=COPY_TO_TAG \
ALIGNER_PROPER_PAIR_FLAGS=true \
UNMAP_CONTAMINANT_READS=true \
ADD_PG_TAG_TO_READS=falseCommand 3: Markduplicates
gatk \
--java-options "-Xmx${MEMORYxmx}M" \
MarkDuplicatesSpark \
-I ${BAM_INPUT_DIR}/${BAM_INPUT} \
-O ${BAM_OUTPUT_DIR}/${BAM_OUTPUT} \
-M ${Markduplicate_metrics_DIR}/${BAM_INPUT}.metrics.txt \
--tmp-dir ${tmp_dir} \
--output-shard-tmp-dir ${tmp_dir}/output \
--create-output-bam-index true \
--treat-unsorted-as-querygroup-ordered \
-- --spark-master local[${SLURM_CPUS_PER_TASK}] 2> ${LOGS_DIR}/${BAM_INPUT}.logCommand 4: CalibrateDragstrModel
gatk --java-options "-Xmx${MEMORYxmx}M -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.reference_fasta=${REF_genome}" \
CalibrateDragstrModel \
-I ${BAM_input_DIR}/${BAM_INPUT} \
-R ${REF_genome} \
-O ${output_DIR}/${BAM_INPUT} \
--threads ${SLURM_CPUS_PER_TASK} \
-str ${str_table}Command 5: Haplotypecaller and hardfilter
gatk --java-options "-Xmx${java_memory_size_mb}M -Xms${java_memory_size_mb}M -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10" \
HaplotypeCaller \
-R ${REF_Genome} \
-L ${Scattered_DIR}/temp_${i}_of_${SCATTER_COUNT}/scattered.interval_list \
-I ${BAM_INPUT_DIR}/${BAM_INPUT} \
-O ${temp_gVCF_OUTPUT_DIR}/${i}.${GVCF_OUTPUT} \
-G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation \
-GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 50 -GQB 60 -GQB 70 -GQB 80 -GQB 90 \
-ERC GVCF \
--dragen-mode \
--dragstr-params-path ${dragstr_param_path}/${BAM_INPUT} \
2> ${logs_HC}/${i}.${GVCF_OUTPUT}.log &&
gatk --java-options "-Xmx${java_memory_size_mb}M -Xms${java_memory_size_mb}M" \
VariantFiltration \
-V ${temp_gVCF_OUTPUT_DIR}/${i}.${GVCF_OUTPUT} \
--filter-expression "QUAL < 10.4139" \
--filter-name "DRAGENHardQUAL" \
-O ${temp_gVCF_OUTPUT_DIR}/${i}.${GVCF_OUTPUT/g.vcf/hard_filtered.g.vcf} \
2> ${logs_VariantFiltration}/${i}.${GVCF_OUTPUT}.logCommand 6: GenomicsDBImport
gatk --java-options "-Xmx${memory_java}M -Xms${memory_java}M -XX:ParallelGCThreads=${SLURM_CPUS_PER_TASK}" GenomicsDBImport \
--sample-name-map ${cohort_map} \
--genomicsdb-workspace-path ${VCF_database_DIR_tmp}/Interval_${SLURM_ARRAY_TASK_ID} \
--batch-size 74 \
--reader-threads ${SLURM_CPUS_PER_TASK} \
-L ${Interval} \
--tmp-dir ${TMP_DIR} \
--genomicsdb-shared-posixfs-optimizations \
2> ${logs_DIR}/Interval_${SLURM_ARRAY_TASK_ID}.logCommand 7: GenotypeGVCFs
gatk \
--java-options "-Xmx${memory_java}M -Xms${memory_java}M -XX:ParallelGCThreads=${SLURM_CPUS_PER_TASK}" \
GenotypeGVCFs \
-R ${REF_Genome} \
-V gendb://${vcf_database_tmp} \
-O ${TMP_DIR}/gentaumix_interval_${SLURM_ARRAY_TASK_ID}_raw.vcf.gz \
-D ${DBSNP} \
--sequence-dictionary ${Dict} \
-L ${Interval} \
-G StandardAnnotation -G AS_StandardAnnotation \
--only-output-calls-starting-in-intervals \
--merge-input-intervals \
2> ${log_DIR}/Interval_${SLURM_ARRAY_TASK_ID}Then i check the result with the tools FindMendelianViolations.
I obtained these results :
FAMILY_ID MOTHER FATHER OFFSPRING OFFSPRING_SEX NUM_VARIANT_SITES NUM_DIPLOID_DENOVO NUM_HOMVAR_HOMVAR_HET NUM_HOMREF_HOMVAR_HOM NUM_HOM_HET_HOM NUM_HAPLOID_DENOVO NUM_HAPLOID_OTHER NUM_OTHER TOTAL_MENDELIAN_VIOLATIONS pologne MNM00255 MNM00257 MNM00256 Male 4760479 1419 64 1900 976 56 21 15 4451 pologne MNM00255 MNM00257 MNM00254 Male 4764068 1213 54 1937 821 47 9 15 4096 The number of de novo variant seems too high
-
Quentin Chartreux could you clarify what your ${REF_genome} variable is?
-
I see that i have two Time the same variable name in different commands that didn't mean the same things.
For the dragmap command REF_genome is the path to the directory that contain the hash table.
For other command REF_genome=/path/to/Homo_sapiens_assembly38.fasta
-
I see! Yes, it looks like you are using the correct references. At this point the DRAGEN-GATK pipeline is only for single sample analysis. HaplotypeCaller should be run with a single sample and without the GVCF mode. Could you verify if these unexpected results also exist in the output from HaplotypeCaller then hard filtering (of a single sample)?
-
Yes I'm pretty sure the problem is with the joint genotyping step.
So your recommandation is perform haplotypecaller in single sample, perform hard filtering and then merge the vcf to obtain a multi sample vcf (but in these case WE need to assume that missing genotype is 0/0) ? do you plan to implement joint genotyping? the dragen pipeline does. -
The DRAGEN-GATK 1.0 release is just for single sample analysis. So, any issues with joint genotyping could be because that is not what we intended for the use case.
You can read more in these blog posts:
-
Genevieve Brandt (she/her) this is the masked reference blog post you were referring to right? https://www.illumina.com/science/genomics-research/articles/dragen-demystifying-reference-genomes.html
-
Tiffany Miller yes, that's it!
Please sign in to leave a comment.
12 comments