Calling variants on chromosome Y
Hello
I'm trying to use GATK 4.1 to analyze WGS data I have.
At this point I'm only testing and optimizing my pipeline so I am only working with 4 samples (2 males and 2 females).
I've followed the information available a the Best Practices Workflow entitled "Germline short variant discovery (SNPs + Indels)". (https://gatk.broadinstitute.org/hc/en-us/articles/360035535932)
I have produced the Analysis-Ready BAM files and got down to the VCF files made by GenotypeGVCFs.
As a quick sanity check I went to see the variant calls on chromosome Y and I'm a little confused by what I see.
First off the for the males in my test cohort, multiple sites are listed as heterozygous. Others are reported as missed calls (ie: `./.`)
If I go over to the female samples I see that they have variant calls for the Y chromosome as well.
In both cases the VCF files exclude the pseudo-autosomal regions on chrY.
Looking at the BAM files I can clearly see that the males have read counts for known Y chromosome genes and the females do not.
I'm wondering what commands or steps I need to run on the data to get the correct gentoyping reported in the VCF file.
Right now my pipeline produces one genomicDB for each chromosome.
I'm guessing that when I generate the VCF files with GenotypeGVCFs I need to adjust the command for chromosome Y (?)
Something like this:
```
gatk GenotypeGVCFs \
-R hg38.ref/Homo_sapiens_assembly38.fasta \
-D hg38.ref/Homo_sapiens_assembly38.dbsnp138.vcf \
-V gendb://genomicDB_chrY \
-O chrY.gatk_hg38.vcf.gz \
-ploidy 1
```
Is this correct?
I'm running:
GATK v4.1.4.1
HTSJDK v2.21.0
Picard v2.21.2
On RHEL 7.7
With Oracle Java 1.8.0
Thanks in advance for any and all help you can provide
Damian
-
Hi Damian, I don't believe that HaplotypeCaller plays very nice with chrY, so it may be most expedient to just exclude them from the female analysis (if you're certain they don't actually have Y chromosomes...).
Would you mind checking the female samples to see if there are any reads that map to chrY, how many variant calls you get, and also which regions they are in? You need to make sure your reads are not in autosomal regions. If your female samples aren't getting hits, then we don't really have an explanation aside from possible contamination. -
Hi Derek
Thanks for getting back to me.
I've checked the BAM files for each patient looking at the average read depth of the gene SRY. It's on the Y chromosome and should not have read counts for the females.
I used this command:
samtools depth -r chrY:2786840-2787751 --reference hg38.ref/Homo_sapiens_assembly38.fasta sample_id.recal.bam -q 0 -Q 0 > sample_id.SRY.readdepth.txtThis was run on the "analysis-ready" BAM files produced using best practices.
The male samples had an average read depth of about 20X in that region.
The female samples had zero reads.
I checked the VCF files and can see that coordinates encompassed by the pseudo-autosomal regions are not represented in them.
-
So, if I'm understanding correctly, the depth for one of the y chromosome genes only has coverage in the male samples, which is what we should expect. That's good, but it doesn't tell us much about the state of your phantom variants, though.Could you check the depth in your BAM files specifically at the sites where you saw those stray variants?
In GATK or BWA, we do not explicitly model sex chromosome counts... it is possible that there are sections of chrY that reads will map to, since they program is agnostic towards the sex/chromosome of the individuals. -
So I looked at the BAM files for the sites with the stray variants. They are definitely getting a lot of reads supporting their call (some in excess of 100).
Extracting about 20 of these aligned reads at random I BLAST'ed them against the genome. Their single best alignment is on the Y-chromosome. So BWA is doing it's job.
I looked up the regions were many of the reads are aligning on the Y-chromosome and found that most of these reads are aligning to pseudo genes on the Y-chromosome. Many of these pseudo genes are also near the centromere.
I was thinking: should I exclude reads aligned to the pseudo genes of chromosome Y from the BAM files and then run the genotyping part of the pipeline to see if this makes a difference?
-
"I was thinking: should I exclude reads aligned to the pseudo genes of chromosome Y from the BAM files and then run the genotyping part of the pipeline to see if this makes a difference?"
I don't think it would hurt to try. This seems like one of those cases where, if you're positive that your female samples don't have a Y chromosome (and thus, the reads you're getting are confirmed as false positives), it is acceptable to exclude them from your analysis.
I would be more concerned that the similar hits you are getting from your male samples may be false positives as well.
-
This is what I see for the chrX genotypes of a randomly selected male
GT:count
1/1:1702
0/1:371
1/2:14Consider that GQ>=15 means 95% confidence
---
chrY is worse because the majority are labeled het
GT:count
0/1:141
1/1:29---
This makes me question everything. It's disappointing that there isn't even a warning issued about ploidy options. Perhaps some of these het are in the PAR region.
GATK v4.4.0.0 w default HaplotypeCaller options
-
Hi Layne Sadler
GATK HaplotypeCaller engine currently does not have direct options for specifying allosomal chromosomes. Currently not very many genotyping tools have these options available directly If you wish to get hemizygous calls from non-PAR regions our recommendation will be to use an interval-list to specify regions for calling with ploidy 1. Of course this call should be made separately from the rest of the chromosomes and should be merged later. Also keep in mind that this configuration may require additional handling on the joint genotyping side where you may need to handle male and female subjects separately and combining the whole data may need additional tinkering.
Regards.
Post is closed for comments.
7 comments