This notebook shows how to perform the analysis that is alluded to in this tutorial. Namely, the notebook performs a comparison of gCNV calls to 1000 Genomes Project truthset calls using the tutorial data. The above-linked tutorial illustrates with sample NA19017 as does this notebook. A straightforward exercise is to recapitulate the notebook analysis for results from using different parameters, for differently sized data or for another sample in the cohort.
The sections of this notebook are as follows.
Download an HTML version and the .ipynb
format notebook here.
Install and import Python packages
Prerequisites include Python 3, Jupyter, GATK4 and BedTools. Notebook results use v3.7.1, v4.4.0, v4.1.0.0 and v2.27.1, respectively, and assumes a Unix environment. Installation requires root access and presumes igv is setup as detailed in this issue ticket.
! pip install numpy pandas matplotlib seaborn igv
import igv import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns; sns.set()
1. Visualize callset overlap with IGV
This exercise uses the igv.js-jupyter extension, which is interactive.
1.1 Prepare truthset calls for comparison
This preparation omits conversion of truthset calls to SEG format. Navigate to directory with the tutorial data and list vcf.gz format files.
%cd /Users/shlee/Documents/gatk4/gatk4_gcnv/tutorial ! ls *.vcf.gz
Subset 24-sample variant callset to samples of interest and the regions the gCNV analysis covers. The input VCF was previously subset to the 24 samples, chr20 and variant CNV events.
! gatk4100 SelectVariants \ -V 1kgpSV-v2-20130502_cohort24-chr20varCNVs.vcf.gz \ --exclude-non-variants true \ --remove-unused-alternates true \ --keep-original-ac true \ --keep-original-dp true \ --sample-name NA19017 \ -L chr20sub_rmvChr.bed \ -O 1kgpSV_NA19017_chr20sub.vcf.gz
1.2 Visualize truthset sample calls alongside gCNV sample calls
Create an IGV browser in your notebook, through which you will visualize your calls.
b = igv.Browser( { "genome": "hg38", "locus": [ "chr20:1-32000", "chr20:1,560,577-1,625,582", "chr20:43,641,139-43,647,469", "chr20:44,662,510-44,707,929", "chr20:55,798,693-55,875,381", "chr20:63,092,730-63,094,955" ] } ) b.show()
Load data tracks.
b.load_track( { "name": "1kgp", "url": "1kgpSV_NA19017_chr20sub.vcf.gz", "format": "vcf", "type": "variant" } ) b.load_track( { "name": "1kgp.seg", "displayMode": "EXPANDED", "url": "1kgpSV_NA19017_CN.seg", "format": "seg", "indexed": False } ) b.load_track( { "name": "gCNV", "url": "genotyped-segments-cohort24-NA19017.vcf.gz", "format": "vcf", "type": "variant" } ) b.load_track( { "name": "gCNV.seg", "displayMode": "EXPANDED", "url": "NA19017_chr20var.seg", "format": "seg", "indexed": False } )
'OK'
b.get_svg()
'OK'
b.display_svg()
2. Compare callset overlap with BedTools
2.1 Prepare gCNV calls for comparison
This tutorial illustrates conversion of the gCNV workflow VCF results to SEG format. Here we subset to chr20 data where CN is non-diploid and also convert to BED format. First, navigate to directory with the tutorial data and list SEG format files.
%cd /Users/shlee/Documents/gatk4/gatk4_gcnv/tutorial ! ls *.seg
Subset to chr20 data where CN does not equal 2.
! cat genotyped-segments-cohort24-NA19017.vcf.gz.seg | awk '$2 == "CHROM" || $2 =="chr20"' | awk '$6 != 2' > NA19017_chr20var.seg
! head -n 3 NA19017_chr20var.seg ! tail -n 3 NA19017_chr20var.seg ! wc -l NA19017_chr20var.seg
NA19017 CHROM POS END NA19017.NP NA19017.CN NA19017 chr20 133001 135000 2 1 NA19017 chr20 1606001 1609000 3 1 NA19017 chr20 64291001 64293000 2 1 NA19017 chr20 64295001 64296000 1 0 NA19017 chr20 64326001 64334000 8 3 52 NA19017_chr20var.seg
Convert SEG format to BED format data.
! cut -f2-4 NA19017_chr20var.seg > case.intervals ! awk '{OFS="\t"} NR>1 {print $1, $2-1, $3}' case.intervals > case.bed ! head -n 3 case.bed ! tail -n 3 case.bed ! wc -l case.bed
chr20 133000 135000 chr20 1606000 1609000 chr20 1914000 1915000 chr20 64291000 64293000 chr20 64295000 64296000 chr20 64326000 64334000 51 case.bed
2.2 Prepare truthset calls for comparison
Use the single-sample VCF prepared in section 1.1.
Examine the ALT alleles present, then examine the genotype designations.
! gzcat 1kgpSV_NA19017_chr20sub.vcf.gz | grep -v '#' | cut -f5 | sort | uniq
! gzcat 1kgpSV_NA19017_chr20sub.vcf.gz | grep -v '#' | cut -f10 | sort | uniq
0|1 1|0 1|1
Convert VCF to table.
! gatk4100 VariantsToTable \ -V 1kgpSV_NA19017_chr20sub.vcf.gz \ -O 1kgpSV_NA19017_chr20sub.table \ -F CHROM -F POS -F END
Convert table to BED format data .
! sed 's/^/chr/' 1kgpSV_NA19017_chr20sub.table | awk '{OFS="\t"} NR>1 {print $1, $2-1, $3}' > 1kgpSV_NA19017_chr20sub.bed
! head -n3 1kgpSV_NA19017_chr20sub.bed ! tail -n3 1kgpSV_NA19017_chr20sub.bed ! wc -l 1kgpSV_NA19017_chr20sub.bed
chr20 1408451 1410171 chr20 1580092 1613039 chr20 1609254 1613207 chr20 55859461 55865660 chr20 63093184 63094468 chr20 63093348 63094249 21 1kgpSV_NA19017_chr20sub.bed
2.3 Use bedtools coverage to measure overlap between the two callsets
Here we examine overlap only and ignore whether copy-number states match.
- The tutorial's smallest data, consisting of twelve regions, covers 1.4 Mb on chr20. Regions correspond to those in the truthset with events above 1000 bases.
- The tutorial's chr20sub data covers a larger portion of chr20, around ~15 Mb.
This article gives interpretation guidelines.
! bedtools coverage -a 1kgpSV_NA19017_chr20sub.bed -b case.bed > 1kgpSV_NA19017_chr20sub_vs_case.coverage aTruthBCase = pd.read_csv('1kgpSV_NA19017_chr20sub_vs_case.coverage', sep='\t', header=None) aTruthBCase.columns = ['chrom', 'start', 'end', 'nFeatures-B', 'bases-B', 'bases-A', 'fractionBases-B/A'] aTruthBCase aTruthBCase.sort_values('bases-A')
chrom | start | end | nFeatures-B | bases-B | bases-A | fractionBases-B/A | |
---|---|---|---|---|---|---|---|
8 | chr20 | 23122679 | 23122981 | 0 | 0 | 302 | 0.000000 |
4 | chr20 | 2379276 | 2379872 | 0 | 0 | 596 | 0.000000 |
5 | chr20 | 3104161 | 3104765 | 0 | 0 | 604 | 0.000000 |
10 | chr20 | 32739537 | 32740213 | 0 | 0 | 676 | 0.000000 |
9 | chr20 | 24923059 | 24923778 | 0 | 0 | 719 | 0.000000 |
6 | chr20 | 15730960 | 15731792 | 0 | 0 | 832 | 0.000000 |
20 | chr20 | 63093348 | 63094249 | 0 | 0 | 901 | 0.000000 |
19 | chr20 | 63093184 | 63094468 | 0 | 0 | 1284 | 0.000000 |
15 | chr20 | 44667865 | 44669273 | 1 | 1000 | 1408 | 0.710227 |
0 | chr20 | 1408451 | 1410171 | 0 | 0 | 1720 | 0.000000 |
17 | chr20 | 52142653 | 52144379 | 1 | 1000 | 1726 | 0.579374 |
3 | chr20 | 2014836 | 2016906 | 1 | 1000 | 2070 | 0.483092 |
16 | chr20 | 44698668 | 44700883 | 3 | 2215 | 2215 | 1.000000 |
12 | chr20 | 33698902 | 33701182 | 1 | 2000 | 2280 | 0.877193 |
7 | chr20 | 21305449 | 21308134 | 1 | 2000 | 2685 | 0.744879 |
14 | chr20 | 43643221 | 43645948 | 2 | 1779 | 2727 | 0.652365 |
13 | chr20 | 34653921 | 34656669 | 1 | 2000 | 2748 | 0.727802 |
11 | chr20 | 32937722 | 32941600 | 1 | 3000 | 3878 | 0.773595 |
2 | chr20 | 1609254 | 1613207 | 0 | 0 | 3953 | 0.000000 |
18 | chr20 | 55859461 | 55865660 | 2 | 2000 | 6199 | 0.322633 |
1 | chr20 | 1580092 | 1613039 | 1 | 3000 | 32947 | 0.091055 |
Perform the reciprocal comparison.
! bedtools coverage -b 1kgpSV_NA19017_chr20sub.bed -a case.bed > 1kgpSV_NA19017_chr20sub_vs_case_reverse.coverage bTruthACase = pd.read_csv('1kgpSV_NA19017_chr20sub_vs_case_reverse.coverage', sep='\t', header=None) bTruthACase.columns = ['chrom', 'start', 'end', 'nFeatures-B', 'bases-B', 'bases-A', 'fractionBases-B/A'] bTruthACase
chrom | start | end | nFeatures-B | bases-B | bases-A | fractionBases-B/A | |
---|---|---|---|---|---|---|---|
0 | chr20 | 133000 | 135000 | 0 | 0 | 2000 | 0.000 |
1 | chr20 | 1606000 | 1609000 | 1 | 3000 | 3000 | 1.000 |
2 | chr20 | 1914000 | 1915000 | 0 | 0 | 1000 | 0.000 |
3 | chr20 | 2015000 | 2016000 | 1 | 1000 | 1000 | 1.000 |
4 | chr20 | 2824000 | 2825000 | 0 | 0 | 1000 | 0.000 |
5 | chr20 | 11633000 | 11638000 | 0 | 0 | 5000 | 0.000 |
6 | chr20 | 12802000 | 12807000 | 0 | 0 | 5000 | 0.000 |
7 | chr20 | 18602000 | 18603000 | 0 | 0 | 1000 | 0.000 |
8 | chr20 | 18951000 | 18954000 | 0 | 0 | 3000 | 0.000 |
9 | chr20 | 21306000 | 21308000 | 1 | 2000 | 2000 | 1.000 |
10 | chr20 | 21911000 | 21912000 | 0 | 0 | 1000 | 0.000 |
11 | chr20 | 23426000 | 23428000 | 0 | 0 | 2000 | 0.000 |
12 | chr20 | 23428000 | 23429000 | 0 | 0 | 1000 | 0.000 |
13 | chr20 | 23430000 | 23431000 | 0 | 0 | 1000 | 0.000 |
14 | chr20 | 24408000 | 24409000 | 0 | 0 | 1000 | 0.000 |
15 | chr20 | 32723000 | 32724000 | 0 | 0 | 1000 | 0.000 |
16 | chr20 | 32938000 | 32941000 | 1 | 3000 | 3000 | 1.000 |
17 | chr20 | 33448000 | 33456000 | 0 | 0 | 8000 | 0.000 |
18 | chr20 | 33699000 | 33701000 | 1 | 2000 | 2000 | 1.000 |
19 | chr20 | 34230000 | 34231000 | 0 | 0 | 1000 | 0.000 |
20 | chr20 | 34654000 | 34656000 | 1 | 2000 | 2000 | 1.000 |
21 | chr20 | 42615000 | 42619000 | 0 | 0 | 4000 | 0.000 |
22 | chr20 | 43643000 | 43644000 | 1 | 779 | 1000 | 0.779 |
23 | chr20 | 43644000 | 43645000 | 1 | 1000 | 1000 | 1.000 |
24 | chr20 | 44668000 | 44669000 | 1 | 1000 | 1000 | 1.000 |
25 | chr20 | 44698000 | 44699000 | 1 | 332 | 1000 | 0.332 |
26 | chr20 | 44699000 | 44700000 | 1 | 1000 | 1000 | 1.000 |
27 | chr20 | 44700000 | 44701000 | 1 | 883 | 1000 | 0.883 |
28 | chr20 | 47828000 | 47829000 | 0 | 0 | 1000 | 0.000 |
29 | chr20 | 47829000 | 47830000 | 0 | 0 | 1000 | 0.000 |
30 | chr20 | 47830000 | 47834000 | 0 | 0 | 4000 | 0.000 |
31 | chr20 | 47834000 | 47835000 | 0 | 0 | 1000 | 0.000 |
32 | chr20 | 47900000 | 47901000 | 0 | 0 | 1000 | 0.000 |
33 | chr20 | 47901000 | 47902000 | 0 | 0 | 1000 | 0.000 |
34 | chr20 | 48504000 | 48509000 | 0 | 0 | 5000 | 0.000 |
35 | chr20 | 48511000 | 48516000 | 0 | 0 | 5000 | 0.000 |
36 | chr20 | 52143000 | 52144000 | 1 | 1000 | 1000 | 1.000 |
37 | chr20 | 55808000 | 55809000 | 0 | 0 | 1000 | 0.000 |
38 | chr20 | 55863000 | 55864000 | 1 | 1000 | 1000 | 1.000 |
39 | chr20 | 55864000 | 55865000 | 1 | 1000 | 1000 | 1.000 |
40 | chr20 | 61943000 | 61944000 | 0 | 0 | 1000 | 0.000 |
41 | chr20 | 61944000 | 61945000 | 0 | 0 | 1000 | 0.000 |
42 | chr20 | 61946000 | 61947000 | 0 | 0 | 1000 | 0.000 |
43 | chr20 | 63965000 | 63966000 | 0 | 0 | 1000 | 0.000 |
44 | chr20 | 64094000 | 64095000 | 0 | 0 | 1000 | 0.000 |
45 | chr20 | 64132000 | 64134000 | 0 | 0 | 2000 | 0.000 |
46 | chr20 | 64174000 | 64175000 | 0 | 0 | 1000 | 0.000 |
47 | chr20 | 64175000 | 64176000 | 0 | 0 | 1000 | 0.000 |
48 | chr20 | 64291000 | 64293000 | 0 | 0 | 2000 | 0.000 |
49 | chr20 | 64295000 | 64296000 | 0 | 0 | 1000 | 0.000 |
50 | chr20 | 64326000 | 64334000 | 0 | 0 | 8000 | 0.000 |
The first aTruthBCase bedtools coverage result shows gCNV misses some truthset events. These tend to be smaller. The resolution of events gCNV will detect depends in part on coverage bin size. The tutorial data bin size is 1000 bases.
2.4 Create a pandas summary metrics table
The metrics are from the perspective of the truthset. Previously, we imported bedtools coverage results into a pandas dataframe called aTruthBCase.
Create metrics to summarize the bedtools coverage results.
summaryMetrics = [aTruthBCase['bases-A'].sum(), aTruthBCase['bases-B'].sum()] summaryMetricsDb = pd.DataFrame(summaryMetrics).transpose() summaryMetricsDb.columns = ['total-A', 'total-B'] summaryMetricsDb['total-B/A'] = summaryMetricsDb['total-B']/summaryMetricsDb['total-A'] summaryMetricsDb['n-events-sum'] = aTruthBCase['nFeatures-B'].sum() summaryMetricsDb['n-events-min'] = aTruthBCase['nFeatures-B'].min() summaryMetricsDb['n-events-max'] = aTruthBCase['nFeatures-B'].max() summaryMetricsDb['n-events-at-0'] = aTruthBCase[aTruthBCase['nFeatures-B'] == 0].shape[0] summaryMetricsDb['n-events-at-1'] = aTruthBCase[aTruthBCase['nFeatures-B'] == 1].shape[0] summaryMetricsDb['n-events-at-2'] = aTruthBCase[aTruthBCase['nFeatures-B'] == 2].shape[0] summaryMetricsDb['n-events-at-3'] = aTruthBCase[aTruthBCase['nFeatures-B'] == 3].shape[0] summaryMetricsDb['fraction-events-uncalled'] = summaryMetricsDb['n-events-at-0'] / len(aTruthBCase.index) summaryMetricsDb['fraction-events-called'] = 1 - summaryMetricsDb['fraction-events-uncalled'] intermed1 = aTruthBCase.loc[aTruthBCase['nFeatures-B'] > 0, ['bases-B', 'bases-A']] intermed2 = [intermed1['bases-A'].sum(), intermed1['bases-B'].sum()] intermed3 = intermed2[1]/intermed2[0] summaryMetricsDb['non-zero-events-B/A'] = intermed3 summaryMetricsDb['n-rows'] = len(aTruthBCase['nFeatures-B'] !=0) summaryMetricsDb['n-min-50%-overlap'] = len(aTruthBCase.loc[aTruthBCase['fractionBases-B/A'] >= 0.5]) summaryMetricsDb['n-rows-non-0'] = summaryMetricsDb['n-rows'] - summaryMetricsDb['n-events-at-0'] summaryMetricsDb['fraction-50%-overlap-of-non-0'] = summaryMetricsDb['n-min-50%-overlap'] / summaryMetricsDb['n-rows-non-0'] summaryMetricsDb
Here, we arbitrarily define an overlap minimum of 50% of the truthset call as something of interest. The stringency for these types of concordance calculations can be much more refined, e.g. 50% reciprocal overlap.
2.5 Subset the events with 50% or more truthset overlap
concordant = aTruthBCase.loc[aTruthBCase['fractionBases-B/A'] > 0.5, ['chrom', 'start', 'end', 'bases-B', 'bases-A', 'fractionBases-B/A']] concordant.sort_values('fractionBases-B/A')
chrom | start | end | bases-B | bases-A | fractionBases-B/A | |
---|---|---|---|---|---|---|
17 | chr20 | 52142653 | 52144379 | 1000 | 1726 | 0.579374 |
14 | chr20 | 43643221 | 43645948 | 1779 | 2727 | 0.652365 |
15 | chr20 | 44667865 | 44669273 | 1000 | 1408 | 0.710227 |
13 | chr20 | 34653921 | 34656669 | 2000 | 2748 | 0.727802 |
7 | chr20 | 21305449 | 21308134 | 2000 | 2685 | 0.744879 |
11 | chr20 | 32937722 | 32941600 | 3000 | 3878 | 0.773595 |
12 | chr20 | 33698902 | 33701182 | 2000 | 2280 | 0.877193 |
16 | chr20 | 44698668 | 44700883 | 2215 | 2215 | 1.000000 |
Convert concordant regions to list of intervals.
concordant.to_csv('concordant.bed', columns=["chrom", "start", "end"], sep='\t', index=False, header=False) ! head concordant.bed
Subset gCNV calls to those concordant with the truthset.
! gatk4100 SelectVariants \ -V genotyped-segments-cohort24-NA19017.vcf.gz \ -L concordant.bed \ -O segments-NA19017-concordant.vcf.gz
Check the number of records before and after subsetting.
! gzcat genotyped-segments-cohort24-NA19017.vcf.gz | grep -v '#' | wc -l ! gzcat segments-NA19017-concordant.vcf.gz | grep -v '#' | wc -l
91 23
3. Explore the gCNV calls
3.1 Plot histograms of event sizes and quality annotations
Get description of gCNV workflow variant annotations. Those that start with 'Q' denote quality annotations.
! gzcat genotyped-segments-cohort24-NA19017.vcf.gz | grep '##FORMAT=<ID='
This will print out a list of format tags from the header of your file. Here's a summary of some you will see.
Annotation | Description |
---|---|
CN | Segment most-likely copy-number call |
GT | Segment genotype |
NP | Number of points (i.e. targets or bins) in the segment |
QA | Complementary Phred-scaled probability that all points (i.e. targets or bins) in the segment agree with the segment copy-number call |
QS | Complementary Phred-scaled probability that at least one point (i.e. target or bin) in the segment agrees with the segment copy-number call |
QSE | Complementary Phred-scaled probability that the segment end position is a genuine copy-number changepoint |
QSS | Complementary Phred-scaled probability that the segment start position is a genuine copy-number changepoint |
Prepare the callsets for use with pandas.
! gatk4100 VariantsToTable \ -V genotyped-segments-cohort24-NA19017.vcf.gz \ -O segments-NA19017.table \ -F CHROM -F POS -F END -GF CN -GF NP -GF QA -GF QS -GF QSE -GF QSS ! gatk4100 VariantsToTable \ -V segments-NA19017-concordant.vcf.gz \ -O segments-NA19017-concordant.table \ -F CHROM -F POS -F END -GF CN -GF NP -GF QA -GF QS -GF QSE -GF QSS
Create pandas dataframe from table and subset to non-diploid segments for ALL calls.
gcnv = pd.read_csv('segments-NA19017.table', sep='\t') gcnvVar = pd.DataFrame(gcnv.loc[gcnv['NA19017.CN'] != 2]) gcnvVar
Create pandas dataframe from table and subset to non-diploid segments for CONCORDANT calls.
gcnvConcordant = pd.read_csv('segments-NA19017-concordant.table', sep='\t') gcnvConcordantVar = pd.DataFrame(gcnvConcordant.loc[gcnvConcordant['NA19017.CN'] != 2]) gcnvConcordantVar
CHROM | POS | END | NA19017.CN | NA19017.NP | NA19017.QA | NA19017.QS | NA19017.QSE | NA19017.QSS | |
---|---|---|---|---|---|---|---|---|---|
1 | chr20 | 21306001 | 21308000 | 0 | 2 | 286 | 497 | 322 | 286 |
4 | chr20 | 32938001 | 32941000 | 1 | 3 | 114 | 278 | 30 | 13 |
7 | chr20 | 33699001 | 33701000 | 1 | 2 | 45 | 107 | 76 | 45 |
10 | chr20 | 34654001 | 34656000 | 1 | 2 | 81 | 109 | 57 | 92 |
12 | chr20 | 43643001 | 43644000 | 1 | 1 | 30 | 30 | 30 | 30 |
13 | chr20 | 43644001 | 43645000 | 0 | 1 | 209 | 209 | 209 | 209 |
15 | chr20 | 44668001 | 44669000 | 1 | 1 | 29 | 29 | 29 | 29 |
17 | chr20 | 44698001 | 44699000 | 1 | 1 | 40 | 40 | 40 | 40 |
18 | chr20 | 44699001 | 44700000 | 0 | 1 | 309 | 309 | 309 | 309 |
19 | chr20 | 44700001 | 44701000 | 1 | 1 | 147 | 147 | 147 | 147 |
21 | chr20 | 52143001 | 52144000 | 1 | 1 | 77 | 77 | 77 | 77 |
Note the concordant calls are all CN0 or CN1 events. The truthset provides only deletions!
Plot the size distribution as a histogram. Here, utilize the fact that NP represents the number of bins. An extension of this analysis would be to draw histograms of insertions versus deletions. Then, overlay event sizes of concordant calls.
binSize = 1000 events = gcnvVar['NA19017.NP']*binSize sns.distplot(events, kde=False, bins=np.linspace(1000,10000,10)) eventsConcordant = gcnvConcordantVar['NA19017.NP']*binSize sns.distplot(eventsConcordant, kde=False, bins=np.linspace(1000,10000,10)) plt.suptitle('Histogram of gCNV event sizes', fontsize=24)
Text(0.5, 0.98, 'Histogram of gCNV event sizes')
Similarly plot the quality scores QA and QS, then overlay QA and QS histograms of concordant calls
fig, axs = plt.subplots(1,2, figsize=(18,6)) qa = gcnvVar['NA19017.QA'] qs = gcnvVar['NA19017.QS'] sns.distplot(qa, ax=axs[0], bins=np.linspace(0,600,20), rug=True) sns.distplot(qs, ax=axs[1], bins=np.linspace(0,600,20), rug=True) axs[0].set_xlabel('QA', fontsize = 16) axs[1].set_xlabel('QS', fontsize = 16) plt.suptitle('Histograms of gCNV quality scores (all vs. concordant)', fontsize = 24) axs[0].set_xlim(0,600) axs[1].set_xlim(0,600) qaConcordant = gcnvConcordantVar['NA19017.QA'] qsConcordant = gcnvConcordantVar['NA19017.QS'] sns.distplot(qaConcordant, ax=axs[0], bins=np.linspace(0,600,20), rug=True) sns.distplot(qsConcordant, ax=axs[1], bins=np.linspace(0,600,20), rug=True)
<matplotlib.axes._subplots.AxesSubplot at 0x12c059048>
There is an ever-so-slight rightward shift in qualities for the concordant calls.
Could there be a correlation between event size and quality score?
3.2 Summarize metrics on gCNV calls by event type
Collect some metrics
binSize = 1000 deletions = gcnvVar.loc[gcnvVar['NA19017.CN'] < 2] insertions = gcnvVar.loc[gcnvVar['NA19017.CN'] > 2] gcnvVarSummary = pd.DataFrame([np.count_nonzero(gcnvVar['NA19017.CN'] == 0)], columns=['cn0']) gcnvVarSummary['cn1'] = np.count_nonzero(gcnvVar['NA19017.CN'] == 1) gcnvVarSummary['cn3'] = np.count_nonzero(gcnvVar['NA19017.CN'] == 3) gcnvVarSummary['cn4'] = np.count_nonzero(gcnvVar['NA19017.CN'] == 4) gcnvVarSummary['cn5'] = np.count_nonzero(gcnvVar['NA19017.CN'] == 5) gcnvVarSummary['n-total'] = len(gcnvVar.index) gcnvVarSummary['n-dels'] = gcnvVarSummary['cn0'] + gcnvVarSummary['cn1'] gcnvVarSummary['n-amps'] = gcnvVarSummary['cn3'] + gcnvVarSummary['cn4'] + gcnvVarSummary['cn5'] gcnvVarSummary['bases-dels'] = deletions['NA19017.NP'].sum()*binSize gcnvVarSummary['bases-amps'] = insertions['NA19017.NP'].sum()*binSize gcnvVarSummary['avg-del-size'] = round(gcnvVarSummary['bases-dels'] / gcnvVarSummary['n-dels']).astype(int) gcnvVarSummary['avg-amp-size'] = round(gcnvVarSummary['bases-amps'] / gcnvVarSummary['n-amps']).astype(int) gcnvVarSummary['at1kb'] = len(gcnvVar[gcnvVar['NA19017.NP'] == 1]) gcnvVarSummary['fractionAt1kb'] = gcnvVarSummary['at1kb']/len(gcnvVar[gcnvVar['NA19017.NP'] >= 1]) gcnvVarSummary
Over half of the events are exactly at 1kbp, the coverage bin size. Most of the events are deletions. The next notebook analysis examines gCNV callset annotations using bigger data.
0 comments
Please sign in to leave a comment.