Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

All intervals were filtered in Filterintervals

0

11 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi 侯旻,

    Are you trying to follow this gCNV pipeline? https://gatk.broadinstitute.org/hc/en-us/articles/360035531152--How-to-Call-common-and-rare-germline-copy-number-variants

    If so, I noticed that you want to select --padding 0 not --bin-length 0 for PreprocessIntervals.

    Let me know if this answers your question.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    侯旻

    Hi Genevieve Brandt

    Thank you for your comment. Yes, I followed by the gCNV pipeline and I selected both bin length and padding as 0. Actually, I tried bin length 860 which is the shortest length of all genes and the program worked well at that time. But the result seems to be too strict when I compared the ploidy results to depth result. For example, one gene showed depth is almost four or five times higher than average sequencing depth while the ploidy result showed 3n for that gene which we expected 4n or more for that gene.

    Best regards

    Min

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thank you Min for the update. I will follow up with my team to see if there are any parameters you can use to make the gCNV pipeline more sensitive. 

    Is this WGS or exome data? Could you also supply all the gCNV commands you ran to get the results that were too strict?

    0
    Comment actions Permalink
  • Avatar
    侯旻

    Hi Genevieve

    Thank you so much for the support. I used exome data for the analysis. I attached commands of one species as an example. Because my study species doesn't have WGS so I ran the program in cohort mode (use the same bam file to run AnnotateIntervals). The intervals bed file included taste receptor genes (around 40) and about 83 neutral regions (including 3 regions are from X chromosome: NR83-85) which are supposed to be single copy.

    Looking forward to your reply

    Best regards

    Min

     /usr/local/bin/gatk-4.2.5.0/gatk PreprocessIntervals \
              -R Sent_3mapping_final.fasta \
              -L Sent_TAS2R.bed \
              --bin-length 870 \
              --padding 0 \
              --interval-merging-rule OVERLAPPING_ONLY \
              -O pre_Sent.interval_list

    java -jar /usr/local/bin/picard.jar AddOrReplaceReadGroups INPUT=Sent_H2_rmd_bqsr.bam OUTPUT=Sent_addRG_1.bam RGID=Sent RGLB=Library1 RGPL=illumina RGPU=Sent RGSM=sample1 VALIDATION_STRINGENCY=LENIENT

    java -jar /usr/local/bin/picard.jar AddOrReplaceReadGroups INPUT=Sent_H2_rmd_bqsr.bam OUTPUT=Sent_addRG_2.bam RGID=Sent RGLB=Library2 RGPL=illumina RGPU=Sent RGSM=sample2 VALIDATION_STRINGENCY=LENIENT

    samtools index Sent_addRG_1.bam

    for i in `ls *bam`; do         
    sp=`basename ${i} _addRG_1.bam`
    /usr/local/bin/gatk-4.2.5.0/gatk CollectReadCounts \
    -I ${sp}_addRG_1.bam \
    -L pre_${sp}.interval_list \
    --interval-merging-rule OVERLAPPING_ONLY \
    -format TSV -O ${sp}_addRG1.tsv
    done

    for i in `ls *bam`; do
    sp=`basename ${i} _addRG_1.bam`
    /usr/local/bin/gatk-4.2.5.0/gatk AnnotateIntervals -R ${sp}_3mapping_final.fasta -L pre_${sp}.interval_list --interval-merging-rule OVERLAPPING_ONLY -O ann_${sp}1.tsv
    done

    for i in `ls *bam`; do
    sp=`basename ${i} _addRG_1.bam`
    /usr/local/bin/gatk-4.2.5.0/gatk FilterIntervals -L pre_${sp}.interval_list --annotated-intervals ann_${sp}1.tsv --interval-merging-rule OVERLAPPING_ONLY -O ${sp}_filter.interval_list
    done

     

     samtools index Sent_addRG_2.bam
     
    for i in `ls *bam`; do
    sp=`basename ${i} _addRG_2.bam`
    /usr/local/bin/gatk-4.2.5.0/gatk CollectReadCounts \
    -I ${sp}_addRG_2.bam \
    -L pre_${sp}.interval_list \
    --interval-merging-rule OVERLAPPING_ONLY \
    -format TSV -O ${sp}_addRG2.tsv
    done

    source activate gatk

    /usr/local/bin/gatk-4.2.5.0/gatk DetermineGermlineContigPloidy --input Sent_addRG1.tsv --input Sent_addRG2.tsv --contig-ploidy-priors ploidy_prior.tsv --output Sent_contig --output-prefix Sent

    for i in `ls *1.tsv`; do
    sp=`basename ${i} _addRG1.tsv`
    /usr/local/bin/gatk-4.2.5.0/gatk GermlineCNVCaller --run-mode COHORT -L ${sp}_filter.interval_list --interval-merging-rule OVERLAPPING_ONLY --contig-ploidy-calls ${sp}_contig/${sp}-calls -I ${sp}_addRG1.tsv -I ${sp}_addRG2.tsv --output gcnv_${sp} --output-prefix ${sp}
    done

    for i in `ls *1.tsv`; do
    sp=`basename ${i} _addRG1.tsv`
    /usr/local/bin/gatk-4.2.5.0/gatk PostprocessGermlineCNVCalls --calls-shard-path gcnv_${sp}/${sp}-calls --model-shard-path gcnv_${sp}/${sp}-model --sample-index 0 --autosomal-ref-copy-number 2 --allosomal-contig NR83_Can_0-1578 --allosomal-contig NR84_Can_0-2157 --allosomal-contig NR85_Can_0-2154 --output-genotyped-intervals gen_${sp}.vcf --output-genotyped-segments seg_${sp}.vcf --output-denoised-copy-ratios ${sp}_copy.tsv --contig-ploidy-calls ${sp}_contig/${sp}-calls
    done

    vcftools --vcf gen_Sent.vcf --extract-FORMAT-info CN --out Sent_CN

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thank you Min! I expect that I will be able to get back to you with a follow up Monday or Tuesday next week.

    0
    Comment actions Permalink
  • Avatar
    侯旻

    Thank you for being so supportive!

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Min,

    I had the developers take a look at your request to see what they would recommend. They recommended to stick with the default value for --bin-length to avoid your  original issue of filtering out all of the intervals.

    For your unexpected results, I first wanted to point out that ploidy is calculated on a contig level while copy number is calculated on the gene level. So make sure you are looking at the correct metric for the gene of interest. You may be running into an issue with the quality covering that gene which is why the copy number is lower than expected. 

    You can check the segments file to see if the quality is high or low. We normally don't see issues with sensitivity when the quality is high. Depending on the event type, here are some example QS thresholds. They are human derived but should give you some basis to go off of:

    private int hetDelQSThreshold = 100;
    private int homDelQSThreshold = 400;
    private int dupeQSThreshold = 50;

    Any calls with QS < 50 are dubious.

    Let us know what you find and if this helps!

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    侯旻

    Hi, Genevieve

    Sorry for the late reply and I really appreciate everything you and your team done for me. Thank you so much and thank for the developers' recommendation.

    About unexpected results, sorry for the misdiscribed issue. I wanted to know the copy number of each exome data. I also checked the segment vcf files to confirm the QS value and I found that all of them are higher than 55 which means the quality is high.

    This week I also discussed with my lab member who is using the GATK to analyze the human sequence data. We thought since I used exome gene and neutral reference region to determine the germline contig ploidy which is supposed to use the contig data or chromosome data (because I don't have the information for any of them), it may be the limitation for the rest analysis steps. What's more, many neutral reference regions are 100 even 1000 times longer than exome regions and according to the bin length I set, those long-length neutral regions will be separated into many parts. Low mapping quality of some parts will influence the average depth when DetermineGermlineContigPloidy is applied. So we decided to filter those low mapping quality reads first maybe by using the MappingQualityReadFilter.

    Best regards

    Min

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks Min! Keep in mind though that the quality threshold differs for different events. For some events, 55 is not high.

    Let us know if you have any other questions.

    0
    Comment actions Permalink
  • Avatar
    侯旻

    Thanks Genevieve and your team. You have been so helpful and supportive. I will keep in mind about the quailty thing.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    You're welcome!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk