Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Haplotype Caller General Question

Answered
0

9 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi Jordi D,

    This is extra information about the phasing of your variants. We have an article about phasing here: https://gatk.broadinstitute.org/hc/en-us/articles/360050354712-What-is-physical-phasing.

    Please let me know if you have any further questions.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Jordi D

    Hi Genevieve-Brandt-she-her.

    Thank you for the answer, I have carefully read the article.

    However, now I wonder if those variants with PID and PGT information, can be succesfully filtered with 'Select Variants' and JEXL.

    I found a variant (which I used as a control) with the following 4 genotypes from a 4 sample vcf file:

    0|1:32,59:91:99:0|1:4511278_C_T:2360,0,1146:4511278     0|1:34,52:86:99:0|1:4511278_C_T:2081,0,1244:4511278     0|1:86,37:123:99:0|1:4511278_C_T:1295,0,3442:4511278    0|1:42,102:144:99:0|1:4511278_C_T:4117,0,1450:4511278

    Then I used the following 'SelectVariants' filtering options in order to obtain those variants with at least 1 called variant and at least 1 VAR allele:

    -select 'vc.getCalledChrCount() != 0 && (vc.getHetCount() >= 1 or vc.getHomVarCount() >= 1)'

    Then, the variant mentioned above was filtered out. I guess that 'SelectVariants' did not work with phased genotypes, didn't?

    Any advice will be appreciated

    Thank you in advance

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Jordi D What is the reason you want to filter the phased variants?

    0
    Comment actions Permalink
  • Avatar
    Jordi D

    Hi Genevieve-Brandt-she-her,

    in fact I want to filter out any variant (phased or unphased) that doesn't meet the parameters. I am dealing with a multi-sample vcf (~80 samples) and I previously used a DEPTH and a GQ filter using VariantFiltration. Then, I would like to get rid of positions that become "no-called" due to those filters (./.:3,0:3:Depth;GQ:9:0,9,110) and also positions without a VAR allele (I guess that this is similar to looking for AC >=1).

    I mean, after VariantFiltration I got certain position with the following genotypes:

    0/0:33,0:33:PASS:99:.:.:0,99,1176       0/0:20,0:20:PASS:60:.:.:0,60,685        0|1:8,19:27:PASS:99:0|1:139213_A_G:605,0,223:139213     ./.:11,0:11:Depth:27:.:.:0,27,405

    One of them was excluded due to Depth, two were HOMOREF and I got one phased variant as a hetereozygote. However, that variant was filtered out.

    Hope this helps to understand the purpose!

    Thank you.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks, Jordi D. SelectVariants should work on phased variants. Can you share the complete variant lines that you are looking to filter out? It would be easier for me to see that way.

    0
    Comment actions Permalink
  • Avatar
    Jordi D

    Hi again Genevieve-Brandt-she-her,

    First, I used the following code to filter the variants:

    gatk VariantFiltration \
                    -R ${refDir}/${assembly}/Homo_sapiens_assembly.fasta \
                    -V ${outDir}/prova.vcf \
                    -O ${outDir}/prova_filtered_1.vcf \
                    -genotype-filter-name "Depth" -genotype-filter-expression "DP < 20" \
                    --set-filtered-genotype-to-no-call true \
                    -genotype-filter-name "GQ" -genotype-filter-expression "GQ < 20"


            ## Getting control genotypes (at least 1 called && at least 1 VAR allele & BIALLELIC)
            gatk SelectVariants \
                    -R ${refDir}/${assembly}/Homo_sapiens_assembly.fasta \
                    -V ${outDir}/prova_filtered_1.vcf \
                    -L ${refDir}/${assembly}/Exome_V6.bed \
                    -O ${outDir}/prova_filtered_2.vcf \
                    -select 'vc.getCalledChrCount() != 0 && (vc.getHetCount() >= 1 or vc.getHomVarCount() >= 1)' \
                    --restrict-alleles-to BIALLELIC \
                    --exclude-filtered true

    Input file prova.vcf was obtained just with example purposes, I mean, I took 6 variants for just 4 samples with several features: no-called, homoref...etc

    1       17385   .       G       A       124.13  PASS    BaseQRankSum=-2.016e+00;DP=39;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.125;MQ=37.09;MQRankSum=0.142;QD=7.30;ReadPosRankSum=-2.640e-01;SOR=0.382;VQSLOD=-2.319e+00;culprit=MQ   GT:AD:DP:FT:GQ:PL       ./.:10,0:10:Depth;GQ:0:0,0,329  ./.:3,0:3:Depth;GQ:9:0,9,110    ./.:9,0:9:Depth;GQ:9:0,9,294    ./.:12,5:17:Depth:99:132,0,359
    1       139213  .       A       G       553.75  VQSRTrancheSNP99.90to100.00     AC=1;AF=0.250;AN=4;BaseQRankSum=-8.660e-01;DP=82;ExcessHet=3.0103;FS=1.603;MLEAC=1;MLEAF=0.125;MQ=28.60;MQRankSum=1.63;QD=22.15;ReadPosRankSum=1.11;SOR=0.412;VQSLOD=-3.058e+00;culprit=MQ      GT:AD:DP:FT:GQ:PGT:PID:PL:PS    0/0:28,0:28:PASS:78:.:.:0,78,1170       ./.:16,0:16:Depth:42:.:.:0,42,630       0|1:9,16:25:PASS:99:0|1:139213_A_G:563,0,240:139213     ./.:11,0:11:Depth:33:.:.:0,33,452
    1       139233  .       C       A       595.75  VQSRTrancheSNP99.90to100.00     AC=1;AF=0.167;AN=6;BaseQRankSum=-1.619e+00;DP=93;ExcessHet=3.0103;FS=1.690;MLEAC=1;MLEAF=0.125;MQ=29.16;MQRankSum=1.46;QD=22.06;ReadPosRankSum=-2.230e-01;SOR=0.503;VQSLOD=-2.956e+00;culprit=MQ        GT:AD:DP:FT:GQ:PGT:PID:PL:PS    0/0:33,0:33:PASS:99:.:.:0,99,1176       0/0:20,0:20:PASS:60:.:.:0,60,685        0|1:8,19:27:PASS:99:0|1:139213_A_G:605,0,223:139213     ./.:11,0:11:Depth:27:.:.:0,27,405
    1       762273  .       G       A       4927.29 PASS    AC=8;AF=1.00;AN=8;BaseQRankSum=2.16;DP=151;ExcessHet=3.0103;FS=2.746;MLEAC=8;MLEAF=1.00;MQ=40.64;MQRankSum=-2.252e+00;QD=25.36;ReadPosRankSum=-4.270e-01;SOR=1.751;VQSLOD=-1.602e+00;culprit=MQ GT:AD:DP:GQ:PL  1/1:2,33:35:55:1089,55,0        1/1:0,29:29:87:1091,87,0        1/1:0,40:40:99:1439,120,0       1/1:0,36:36:99:1316,108,0
    1       865628  .       G       A       1055    PASS    AC=2;AF=0.250;AN=8;BaseQRankSum=-1.537e+00;DP=149;ExcessHet=3.6798;FS=0.868;MLEAC=2;MLEAF=0.250;MQ=60.00;MQRankSum=0.00;QD=13.19;ReadPosRankSum=0.135;SOR=0.531;VQSLOD=15.66;culprit=MQRankSum  GT:AD:DP:GQ:PL  0/0:31,0:31:90:0,90,930 0/1:21,19:40:99:498,0,635       0/0:36,0:36:99:0,99,1485        0/1:20,20:40:99:569,0,597
    1       878314  .       G       C       1087    PASS    AC=2;AF=0.250;AN=8;BaseQRankSum=2.26;DP=141;ExcessHet=3.6798;FS=0.000;MLEAC=2;MLEAF=0.250;MQ=60.00;MQRankSum=0.00;POSITIVE_TRAIN_SITE;QD=15.75;ReadPosRankSum=-1.390e-01;SOR=0.676;VQSLOD=16.66;culprit=MQRankSum
           GT:AD:DP:GQ:PL  0/1:18,17:36:99:542,0,467       0/1:15,19:34:99:557,0,418       0/0:33,0:33:99:0,99,1250        0/0:36,0:36:99:0,99,1485

    Then, once I ran the code above, I got just 3 variants:

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  05-045  06-084  18-017  18-018
    1       762273  .       G       A       4927.29 PASS    AC=8;AF=1.00;AN=8;BaseQRankSum=2.16;DP=151;ExcessHet=3.0103;FS=2.746;MLEAC=8;MLEAF=1.00;MQ=40.64;MQRankSum=-2.252e+00;QD=25.36;ReadPosRankSum=-4.270e-01;SOR=1.751;VQSLOD=-1.602e+00;culprit=MQ GT:AD:DP:GQ:PL  1/1:2,33:35:55:1089,55,0        1/1:0,29:29:87:1091,87,0        1/1:0,40:40:99:1439,120,0       1/1:0,36:36:99:1316,108,0
    1       865628  .       G       A       1055    PASS    AC=2;AF=0.250;AN=8;BaseQRankSum=-1.537e+00;DP=149;ExcessHet=3.6798;FS=0.868;MLEAC=2;MLEAF=0.250;MQ=60.00;MQRankSum=0.00;QD=13.19;ReadPosRankSum=0.135;SOR=0.531;VQSLOD=15.66;culprit=MQRankSum  GT:AD:DP:GQ:PL  0/0:31,0:31:90:0,90,930 0/1:21,19:40:99:498,0,635       0/0:36,0:36:99:0,99,1485        0/1:20,20:40:99:569,0,597
    1       878314  .       G       C       1087    PASS    AC=2;AF=0.250;AN=8;BaseQRankSum=2.26;DP=141;ExcessHet=3.6798;FS=0.000;MLEAC=2;MLEAF=0.250;MQ=60.00;MQRankSum=0.00;POSITIVE_TRAIN_SITE;QD=15.75;ReadPosRankSum=-1.390e-01;SOR=0.676;VQSLOD=16.66;culprit=MQRankSum
           GT:AD:DP:GQ:PL  0/1:18,17:36:99:542,0,467       0/1:15,19:34:99:557,0,418       0/0:33,0:33:99:0,99,1250        0/0:36,0:36:99:0,99,1485

    Given that the goal of the SelectVariants (see below) was to retain variants with at least 1 call and with at least 1 VAR allele according to:

    vc.getCalledChrCount() != 0 && (vc.getHetCount() >= 1 or vc.getHomVarCount() >= 1)'

    What can we say about that position?

    1       139233  .       C       A       595.75  VQSRTrancheSNP99.90to100.00     AC=1;AF=0.167;AN=6;BaseQRankSum=-1.619e+00;DP=93;ExcessHet=3.0103;FS=1.690;MLEAC=1;MLEAF=0.125;MQ=29.16;MQRankSum=1.46;QD=22.06;ReadPosRankSum=-2.230e-01;SOR=0.503;VQSLOD=-2.956e+00;culprit=MQ        GT:AD:DP:FT:GQ:PGT:PID:PL:PS    0/0:33,0:33:PASS:99:.:.:0,99,1176       0/0:20,0:20:PASS:60:.:.:0,60,685        0|1:8,19:27:PASS:99:0|1:139213_A_G:605,0,223:139213     ./.:11,0:11:Depth:27:.:.:0,27,405

    As far as I know , the variant meets the criteria, didn't?

    Hope this helps to clarify what I am triying to do.

    And thank you in advance for your help.

     

    0
    Comment actions Permalink
  • Avatar
    Jordi D

    Hi all, Genevieve-Brandt-she-her,

    just to summarize what I am looking for:

    Given the position

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  05-045  06-084  18-017  18-018
    1       139213  .       A       G       553.75  VQSRTrancheSNP99.90to100.00     AC=1;AF=0.250;AN=4;BaseQRankSum=-8.660e-01;DP=82;ExcessHet=3.0103;FS=1.603;MLEAC=1;MLEAF=0.125;MQ=28.60;MQRankSum=1.63;QD=22.15;ReadPosRankSum=1.11;SOR=0.412;VQSLOD=-3.058e+00;culprit=MQ      GT:AD:DP:FT:GQ:PGT:PID:PL:PS    0/0:28,0:28:PASS:78:.:.:0,78,1170       ./.:16,0:16:Depth:42:.:.:0,42,630       0|1:9,16:25:PASS:99:0|1:139213_A_G:563,0,240:139213     ./.:11,0:11:Depth:33:.:.:0,33,452

    The following code:

    SelectVariants    -select 'vc.getGenotype("18-017").isHet()'    --exclude-filtered true

    doesn't return the expected genotype 

    0|1:9,16:25:PASS:99:0|1:139213_A_G:563,0,240:139213

    Is this the expected behaviour of the SelectVariants function?

    Thanks

     

     

     

    0
    Comment actions Permalink
  • Avatar
    Jordi D

    Finally I got the mistake: since the SelectVariants was ran after a VQSR process, some variants were labeled with a FILTER field:

    VQSRTrancheSNP99.90to100.00

    Then the parameter "exclude-filtered" should be set to false in order to include all variants.

    Thank you and apologise for any inconvenience.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Jordi D,
    Yes, thank you for following up with these details and the solution you found! Glad you are seeing what you expect now.
    Best,
    Genevieve

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk