Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Dragen-gatk pipeline reference

Answered
0

12 comments

  • Avatar
    Quentin Chartreux

    Actually what is very surprising is that if I use dragmap to create the hash table myself from the reference and the alt-mask.bed I get more than these 3 files.

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Quentin Chartreux,

    I am going to move your post into our Community Discussions -> General Discussion topic, as the germline topic is for reporting bugs and issues with GATK.

    You can read more about our forum guidelines and the topics here: Forum Guidelines.

    Best,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Quentin Chartreux

    Pamela BretscherBretscher thanks you for the correction. 

     

    Just to be more precise, i try this reference with these 3 files. But at the end i get a bam file with a lot of regions with read with MAPQ 0 as it was not alt aware. And more importantly these lead to a lot incorrect genotype. For exemple in the child of the trio use i get more than 60 misenses de novo. It's totally impossible to have as much de novo as that. 

     

    Best,

     

    Quentin. 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Quentin Chartreux,

    • I think the README is not correctly up to date. I'll get that fixed, thank you for pointing it out!
    • You should use the masked reference to run DRAGMAP. Illumina is planning on releasing a blog post soon with more details about the masked reference.
    • Yes, if you create the hash table yourself, you will get more than those 3 files. The 3 we have in the google bucket are the only ones needed for the next step.
    • Could you provide more details about the unexpected results so our team can look into your example more closely? Thank you for bringing it up!

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Quentin Chartreux

    Hi Genevieve-Brandt-she-her

     

    I will try to be as precise as possible.

    I used the following tools:

    Dragmap: 1.2.1

    Gatk: 4.2.3.0

    I have a family with parents and 2 children.

    File used for some command : 

    Dragmap hashtable:

     gs://gcp-public-data--broad-references/hg38/v0/dragen_reference/hash_table.cmp
     gs://gcp-public-data--broad-references/hg38/v0/dragen_reference/reference.bin
    gs://gcp-public-data--broad-references/hg38/v0/dragen_reference/hash_table.cfg.bin

    Genome reference: 

    gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta

    Str table: 

    gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.str

    First command: dragmap

    dragen-os \
    -r ${REF_Genome} \
    -b ${ubam_DIR}/${ubam} \
    --interleaved=1 \
    --num-threads ${SLURM_CPUS_PER_TASK} \
    | \
    samtools view \
    -h \
    -O BAM \
    > ${Align_DIR}/${ID}.align.bam 2> ${Align_DIR}/logs/${ID}.trimmed.align.filtered.log

     

    Command 2: Mergebamalignment

    gatk \
    --java-options "-Xms${MEMORYxmx}M -Xmx${MEMORYxmx}M" \
    MergeBamAlignment \
    VALIDATION_STRINGENCY=SILENT \
    EXPECTED_ORIENTATIONS=FR \
    ATTRIBUTES_TO_RETAIN=X0 \
    ATTRIBUTES_TO_REMOVE=RG \
    ATTRIBUTES_TO_REMOVE=NM \
    ATTRIBUTES_TO_REMOVE=MD \
    ALIGNED_BAM=${Align_DIR}/${BAM} \
    UNMAPPED_BAM=${ubam_DIR}/${ubam} \
    OUTPUT=${OUTPUT_DIR}/${BAM/bam/merged.bem} \
    REFERENCE_SEQUENCE=${ref_fasta} \
    PAIRED_RUN=true \
    SORT_ORDER="unsorted" \
    IS_BISULFITE_SEQUENCE=false \
    ALIGNED_READS_ONLY=false \
    CLIP_ADAPTERS=false \
    MAX_RECORDS_IN_RAM=2000000 \
    ADD_MATE_CIGAR=true \
    MAX_INSERTIONS_OR_DELETIONS=-1 \
    PRIMARY_ALIGNMENT_STRATEGY=MostDistant \
    PROGRAM_RECORD_ID="dragen-os" \
    PROGRAM_GROUP_VERSION="1.2.1" \
    PROGRAM_GROUP_COMMAND_LINE="dragen-os -b ${ubam_DIR}/${ubam} -r ${dragen_reference} --interleaved=1" \
    PROGRAM_GROUP_NAME="dragen-os" \
    UNMAPPED_READ_STRATEGY=COPY_TO_TAG \
    ALIGNER_PROPER_PAIR_FLAGS=true \
    UNMAP_CONTAMINANT_READS=true \
    ADD_PG_TAG_TO_READS=false

    Command 3: Markduplicates

    gatk \
    --java-options "-Xmx${MEMORYxmx}M" \
    MarkDuplicatesSpark \
    -I ${BAM_INPUT_DIR}/${BAM_INPUT} \
    -O ${BAM_OUTPUT_DIR}/${BAM_OUTPUT} \
    -M ${Markduplicate_metrics_DIR}/${BAM_INPUT}.metrics.txt \
    --tmp-dir ${tmp_dir} \
    --output-shard-tmp-dir ${tmp_dir}/output \
    --create-output-bam-index true \
    --treat-unsorted-as-querygroup-ordered \
    -- --spark-master local[${SLURM_CPUS_PER_TASK}] 2> ${LOGS_DIR}/${BAM_INPUT}.log

     

    Command 4: CalibrateDragstrModel 

    gatk --java-options "-Xmx${MEMORYxmx}M -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.reference_fasta=${REF_genome}" \
    CalibrateDragstrModel \
    -I ${BAM_input_DIR}/${BAM_INPUT} \
    -R ${REF_genome} \
    -O ${output_DIR}/${BAM_INPUT} \
    --threads ${SLURM_CPUS_PER_TASK} \
    -str ${str_table}

    Command 5: Haplotypecaller and hardfilter

    gatk --java-options "-Xmx${java_memory_size_mb}M -Xms${java_memory_size_mb}M -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10" \
    HaplotypeCaller \
    -R ${REF_Genome} \
    -L ${Scattered_DIR}/temp_${i}_of_${SCATTER_COUNT}/scattered.interval_list \
    -I ${BAM_INPUT_DIR}/${BAM_INPUT} \
    -O ${temp_gVCF_OUTPUT_DIR}/${i}.${GVCF_OUTPUT} \
    -G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation \
    -GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 50 -GQB 60 -GQB 70 -GQB 80 -GQB 90 \
    -ERC GVCF \
    --dragen-mode \
    --dragstr-params-path ${dragstr_param_path}/${BAM_INPUT} \
    2> ${logs_HC}/${i}.${GVCF_OUTPUT}.log &&
    gatk --java-options "-Xmx${java_memory_size_mb}M -Xms${java_memory_size_mb}M" \
    VariantFiltration \
    -V ${temp_gVCF_OUTPUT_DIR}/${i}.${GVCF_OUTPUT} \
    --filter-expression "QUAL < 10.4139" \
    --filter-name "DRAGENHardQUAL" \
    -O ${temp_gVCF_OUTPUT_DIR}/${i}.${GVCF_OUTPUT/g.vcf/hard_filtered.g.vcf} \
    2> ${logs_VariantFiltration}/${i}.${GVCF_OUTPUT}.log

    Command 6: GenomicsDBImport 

    gatk --java-options "-Xmx${memory_java}M -Xms${memory_java}M -XX:ParallelGCThreads=${SLURM_CPUS_PER_TASK}" GenomicsDBImport \
    --sample-name-map ${cohort_map} \
    --genomicsdb-workspace-path ${VCF_database_DIR_tmp}/Interval_${SLURM_ARRAY_TASK_ID} \
    --batch-size 74 \
    --reader-threads ${SLURM_CPUS_PER_TASK} \
    -L ${Interval} \
    --tmp-dir ${TMP_DIR} \
    --genomicsdb-shared-posixfs-optimizations \
    2> ${logs_DIR}/Interval_${SLURM_ARRAY_TASK_ID}.log

    Command 7: GenotypeGVCFs 

    gatk \
    --java-options "-Xmx${memory_java}M -Xms${memory_java}M -XX:ParallelGCThreads=${SLURM_CPUS_PER_TASK}" \
    GenotypeGVCFs \
    -R ${REF_Genome} \
    -V gendb://${vcf_database_tmp} \
    -O ${TMP_DIR}/gentaumix_interval_${SLURM_ARRAY_TASK_ID}_raw.vcf.gz \
    -D ${DBSNP} \
    --sequence-dictionary ${Dict} \
    -L ${Interval} \
    -G StandardAnnotation -G AS_StandardAnnotation \
    --only-output-calls-starting-in-intervals \
    --merge-input-intervals \
    2> ${log_DIR}/Interval_${SLURM_ARRAY_TASK_ID}

     

    Then i check the result with the tools FindMendelianViolations.

    I obtained these results : 

    FAMILY_ID MOTHER FATHER OFFSPRING OFFSPRING_SEX NUM_VARIANT_SITES NUM_DIPLOID_DENOVO NUM_HOMVAR_HOMVAR_HET NUM_HOMREF_HOMVAR_HOM NUM_HOM_HET_HOM NUM_HAPLOID_DENOVO NUM_HAPLOID_OTHER NUM_OTHER TOTAL_MENDELIAN_VIOLATIONS
    pologne MNM00255 MNM00257 MNM00256 Male 4760479 1419 64 1900 976 56 21 15 4451
    pologne MNM00255 MNM00257 MNM00254 Male 4764068 1213 54 1937 821 47 9 15 4096

    The number of de novo variant seems too high

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Quentin Chartreux could you clarify what your ${REF_genome} variable is?

    0
    Comment actions Permalink
  • Avatar
    Quentin Chartreux

    Genevieve-Brandt-she-her

     

    I see that i have two Time the same variable name in different commands that didn't mean the same things. 

     

    For the dragmap command REF_genome is the path to the directory that contain the hash table. 

    For other command REF_genome=/path/to/Homo_sapiens_assembly38.fasta

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    I see! Yes, it looks like you are using the correct references. At this point the DRAGEN-GATK pipeline is only for single sample analysis. HaplotypeCaller should be run with a single sample and without the GVCF mode. Could you verify if these unexpected results also exist in the output from HaplotypeCaller then hard filtering (of a single sample)?

    0
    Comment actions Permalink
  • Avatar
    Quentin Chartreux

    Yes I'm pretty sure the problem is with the joint genotyping step.
    So your recommandation is perform haplotypecaller in single sample, perform hard filtering and then merge the vcf to obtain a multi sample vcf (but in these case WE need to assume that missing genotype is 0/0) ? do you plan to implement joint genotyping? the dragen pipeline does.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    The DRAGEN-GATK 1.0 release is just for single sample analysis. So, any issues with joint genotyping could be because that is not what we intended for the use case.

    You can read more in these blog posts:

    0
    Comment actions Permalink
  • 0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Tiffany Miller yes, that's it!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk