Genome STRiP CNVDiscovery pipeline error
Hi,
I'm using the latest version of svtoolkit (r1949) to call CNV with WGS data. It threw an error as follow:
##### ERROR MESSAGE: No read depth coverage file found in metadata directory
But I ran preprocessing pipeline successfully. There is depth directory in metadata... And I also noticed that in 1000G_phase1_20101123_mdv1/metadata there is a file named depth.dat but it was not in metadata directory which was produced by preprocessing pipeline...
So I'm doubting if it's a bug of the latest version?
Thank you very much.
Best,
Beilei
-
HI Beilei Bian
I am adding Bob Handsaker to this thread. Bob will be able to help you out with your query.
-
You need to tell me a little more about your pipeline. There are two ways Genome STRiP stores metadata now.
One, the traditional way, creates a directory with multiple files/sub-directories. The second way, developed for more efficient cloud processing, creates a .zip file that contains the metadata files (this is not exactly a zip of the metadata directory, as it can be processed without unzipping, but it is similar in layout).
In either case, if preprocessing ran without error, there should be a depth.dat file, either in the directory or within the metadata.zip archive. If there is not, then I suspect something went wrong in preprocessing.
-
Hi Bob,
Thanks for your reply. Could you help me inspect my pipeline?
SV_DIR="/30days/beileibian/CNV/svtoolkit"
classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"
java -Xmx4g -cp ${classpath} \
org.broadinstitute.gatk.queue.QCommandLine \
-S ${SV_DIR}/qscript/SVPreprocess.q \
-S ${SV_DIR}/qscript/SVQScript.q \
-cp ${classpath} \
-gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
-configFile ${SV_DIR}/conf/genstrip_parameters.txt \
-R /30days/beileibian/CNV/1000G_phase1_20101123_mdv1/reference/human_g1k_v37.fasta \
-genomeMaskFile /30days/beileibian/CNV/1000G_phase1_20101123_mdv1/svmasks/human_g1k_v37.mask.36.fasta \
-copyNumberMaskFile /30days/beileibian/CNV/1000G_phase1_20101123_mdv1/cn2/cn2_mask_g1k_v37.fasta \
-ploidyMapFile /30days/beileibian/CNV/1000G_phase1_20101123_mdv1/reference/human_g1k_v37.ploidy.map \
-I /30days/beileibian/CNV/test/input_bam_files3.list \
-md /30days/beileibian/CNV/test/metadata5 \
-computeGCProfiles true \
-bamFilesAreDisjoint true \
-jobLogDir /30days/beileibian/CNV/test/logs5 \
-runJust take a look at the structure of metadata directory. Thanks.
-
Two comments:
First, you are using a reference metadata bundle from 1000 Genomes phase 1, which is both very old and more importantly is designed for a data set with reads as short as 36bp (which was true of some of the original 1000 Genomes sequencing data). If you are using more modern sequencing, e.g. with 150bp paired reads, you should probably be using a different reference metadata bundle. This also depends on what reference you data is aligned to, of course.
Second, from various missing files, it is clear that preprocessing did not complete successfully. You should check the log files. But first I would suggest evaluating whether you are using the right/best reference and reference metadata bundle.
-
Hi Bob,
Thanks for pointing out that I used the wrong reference. This time I used hs37b5 but still didn't generate depth.dat. I checked log files and found two errors:
ERROR 19:45:27,694 FunctionEdge - Error: samtools index /30days/beileibian/CNV/test/metadata7/headers.bam
ERROR 19:45:27,765 FunctionEdge - Contents of /30days/beileibian/CNV/MND/test/SVPreprocess-4.out:ERROR 22:59:53,523 FunctionEdge - Error: 'java' '-Xmx2048m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/gpfs1/homes/beileibian/.queue/tmp' '-cp' '/30days/beileibian/CNV/svtoolkit/lib/SVToolkit.jar:/30days/beileibian/CNV/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/30days/beileibian/CNV/svtoolkit/lib/gatk/Queue.jar' org.broadinstitute.sv.main.SVCommandLine '-T' 'ComputeMetadataWalker' '-R' '/30days/beileibian/CNV/1000G_phase3_101/human_g1k_hs37d5.fasta' '-I' '/30days/beileibian/CNV/test/alignments/1000002.dedup.realigned.recalibrated.bam' '-disableGATKTraversal' 'true' '-md' '/30days/beileibian/CNV/test/metadata7' '-computeGCProfiles' 'true' '-computeReadCounts' 'true' '-depthFile' '/30days/beileibian/CNV/test/metadata7/depth/1000002.dedup.realigned.recalibrated.depth.txt' '-spanFile' '/30days/beileibian/CNV/test/metadata7/spans/1000002.dedup.realigned.recalibrated.spans.txt' '-gcProfileFile' '/30days/beileibian/CNV/test/metadata7/gcprofile/1000002.dedup.realigned.recalibrated.bam.gcprof.zip' '-readCountFile' '/30days/beileibian/CNV/test/metadata7/rccache/1000002.dedup.realigned.recalibrated.rc.bin' '-configFile' '/30days/beileibian/CNV/svtoolkit/conf/genstrip_parameters.txt' '-ploidyMapFile' '/30days/beileibian/CNV/1000G_phase3_101/human_g1k_hs37d5.ploidymap.txt' '-genomeMaskFile' '/30days/beileibian/CNV/1000G_phase3_101/human_g1k_hs37d5.svmask.fasta' '-copyNumberMaskFile' '/30days/beileibian/CNV/1000G_phase3_101/human_g1k_hs37d5.gcmask.fasta' '-readDepthMaskFile' '/30days/beileibian/CNV/1000G_phase3_101/human_g1k_hs37d5.rdmask.bed' '-minMapQ' '10' '-maxInsertSizeStandardDeviations' '3' '-insertSizeRadius' '10.0' '-referenceProfile' '/30days/beileibian/CNV/test/metadata7/gcprofile/reference.gcprof.zip'
ERROR 22:59:53,575 FunctionEdge - Contents of /30days/beileibian/CNV/test/logs7/SVPreprocess-13.out:But no errors were reported in both SVPreprocess-4.out and SVPreprocess-13.out. I don't know what's going on...I'm using single thread to run the pipeline and have not installed LSF on server yet. It seems that depth.dat and spans.dat were not generated successfully.
-
This kind of symptom is usually because the job is being killed by the scheduler or by the OS for some reason.
I'm not sure what you mean by "I'm using single thread to run the pipeline". What jobRunner are you using? ParallelShell?
-
Yes, I used shell... I don't have any Cluster management software at this moment. So any solutions for that?
Here I only used 2 WGS samples to test if the pipeline can run successfully and see what the results look like. Thanks.
-
Perhaps we should back up if you are using -jobRunner Shell.
Can you manually run 'samtools index /30days/beileibian/CNV/test/metadata7/headers.bam' ?
-
And when you run through Queue, is there a headers.bam.bai file produced?
-
Yes, I can and there is a headers.bam.bai produced in metadata.
But I will have a large scale WGS dataset later to be run. So in this situation, should I install LSF on the cluster? Or can I just seperate the data into different batches and run with shell? Could you give me some advices?
-
It depends a bit on what you mean by large scale. But in general you need either a compute cluster or to run on the cloud. If possible, my advice would be to run on the google cloud, using our pipelines on Terra. Our terra pipelines also support running in Terra on data that is in an external bucket outside of Terra.
-
I mean about 5000~10000 individuals. I understand... Thanks.
Let's go back to my original question... Is it possible to run preprocessing pipeline with Shell? (If that is complicated, no worries, I will try to use LSF or cloud in the near future...)
-
You should be able to run preprocessing (for example on a single sample) using -jobRunner Shell. I have occasionally seen intermittent problems with -jobRunner Shell. I have had better luck using -jobRunner ParallelShell -maxConcurrentRun N. If you have multiple cores available, and sufficient memory, this can be faster. Using -maxConcurrentRun 1 should be the same as -jobRunner Shell, but it uses a different code path which may be a little more robust.
If your jobs are still failing, there are some things you can do to debug. You can use -jobWrapperScript wrapper.sh which will wrap this script around every command. You can use it to set special environment variables, load libraries, check exit status, etc. A trivial example of wrapper.sh:
#!/bin/bash
echo "The wrapped cmd=$@"
"$@" 2>&1You an also try to run the failing command outside of Queue.
-
Hi Bob,
-jobRunner ParallelShell works! Preprocessing completed successfully! Thank you so much.
BTW, does the CNV pipeline require a minimum genomes? It must be more than 20? Otherwise, this pipeline cannot be run successfully?
Sorry for my stupid questions since it's my first time to use it...
-
Thanks - that's good to know. I had seen reports of -jobRunner Shell being a little unstable sometimes.
The question of minimum number of genomes is hard to answer precisely. I haven't tried to titrate calling performance against number of genomes processed together. In our cloud-based calling pipelines, we do the calling in batches where batch size is a tradeoff between processing cost and good calling performance (small batches reduce cost but too small batches have worse calling performance). Currently, we aim for batches in the range of 100 samples (with a ideal range of 75 to 200).
-
Thanks! But I met another problem which is same with https://gatkforums.broadinstitute.org/gatk/discussion/5580/one-plot-error-about-cnvpipeline-stage5
I have inspected some files as you suggested in the old post.
Two errors:
ERROR 13:37:49,922 FunctionEdge - Error: 'java' '-Xmx2048m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/gpfs1/homes/beileibian/.queue/tmp' '-cp' '/30days/beileibian/CNV/svtoolkit/lib/SVToolkit.jar:/30days/beileibian/CNV/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/30days/beileibian/CNV/svtoolkit/lib/gatk/Queue.jar' '-cp' '/30days/beileibian/CNV/svtoolkit/lib/SVToolkit.jar:/30days/beileibian/CNV/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/30days/beileibian/CNV/svtoolkit/lib/gatk/Queue.jar' 'org.broadinstitute.gatk.queue.QCommandLine' '-cp' '/30days/beileibian/CNV/svtoolkit/lib/SVToolkit.jar:/30days/beileibian/CNV/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/30days/beileibian/CNV/svtoolkit/lib/gatk/Queue.jar' '-S' '/30days/beileibian/CNV/svtoolkit/qscript/discovery/cnv/CNVDiscoveryStage5.q' '-S' '/30days/beileibian/CNV/svtoolkit/qscript/discovery/cnv/CNVDiscoveryStageBase.q' '-S' '/30days/beileibian/CNV/svtoolkit/qscript/discovery/cnv/CNVDiscoveryGenotyper.q' '-S' '/30days/beileibian/CNV/svtoolkit/qscript/SVQScript.q' '-gatk' '/30days/beileibian/CNV/svtoolkit/lib/gatk/GenomeAnalysisTK.jar' '-jobLogDir' '/30days/beileibian/CNV/test/cnv_stage5/logs' '-memLimit' '2.0' '-jobRunner' 'ParallelShell' '-gatkJobRunner' 'ParallelShell' -run '-runDirectory' '/30days/beileibian/CNV/MND/cnv_stage5' '-sentinelFile' '/30days/beileibian/CNV/test/cnv_sentinel_files/stage_5.sent' --disableJobReport '-configFile' '/30days/beileibian/CNV/svtoolkit/conf/genstrip_parameters.txt' '-R' '/30days/beileibian/CNV/1000G_phase3_101/human_g1k_hs37d5.fasta' '-ploidyMapFile' '/30days/beileibian/CNV/1000G_phase3_101/human_g1k_hs37d5.ploidymap.txt' '-genomeMaskFile' '/30days/beileibian/CNV/1000G_phase3_101/human_g1k_hs37d5.svmask.fasta' '-genomeMaskFile' '/30days/beileibian/CNV/1000G_phase3_101/human_g1k_hs37d5.lcmask.fasta' '-copyNumberMaskFile' '/30days/beileibian/CNV/1000G_phase3_101/human_g1k_hs37d5.gcmask.fasta' '-readDepthMaskFile' '/30days/beileibian/CNV/1000G_phase3_101/human_g1k_hs37d5.rdmask.bed' '-genderMaskBedFile' '/30days/beileibian/CNV/1000G_phase3_101/human_g1k_hs37d5.gendermask.bed' '-vdjBedFile' '/30days/beileibian/CNV/1000G_phase3_101/human_g1k_hs37d5.vdjregions.bed' '-genderMapFile' '/30days/beileibian/CNV/test/metadata7/sample_gender.report.txt' '-md' '/30days/beileibian/CNV/test/metadata7' -disableGATKTraversal '-I' '/30days/beileibian/CNV/test/bam_headers/merged_headers.bam' '-vpsReportsDirectory' '/30days/beileibian/CNV/test/cnv_stage4' '-selectedSamplesList' '/30days/beileibian/CNV/test/cnv_stage5/eval/DiscoverySamples.list'
ERROR 13:37:49,925 FunctionEdge - Contents of /30days/beileibian/CNV/test/logs7/CNVDiscoveryPipeline-12.out:
ERROR 13:37:33,365 FunctionEdge - Error: 'Rscript' '/30days/beileibian/CNV/svtoolkit/R/discovery/plot_vps.R' '/30days/beileibian/CNV/test/cnv_stage5/eval/VariantsPerSample.report.dat' '/30days/beileibian/CNV/test/cnv_stage5/eval/VariantsPerSample.report.pdf'
ERROR 13:37:33,373 FunctionEdge - Contents of /30days/beileibian/CNV/test/cnv_stage5/logs/CNVDiscoveryStage5-2.out:
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
Calls: plotVariantsPerSamplePDF ... plotVariantsPerSample -> plot -> plot.default -> xy.coords
In addition: Warning message:
In max(vpsData, na.rm = T) :
no non-missing arguments to max; returning -Inf
Execution haltedHere is my code:
SV_DIR="/30days/beileibian/CNV/svtoolkit"
classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"
java -Xmx4g -cp ${classpath} \
org.broadinstitute.gatk.queue.QCommandLine \
-S ${SV_DIR}/qscript/discovery/cnv/CNVDiscoveryPipeline.q \
-S ${SV_DIR}/qscript/SVQScript.q \
-cp ${classpath} \
-gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
-configFile ${SV_DIR}/conf/genstrip_parameters.txt \
-R /30days/beileibian/CNV/1000G_phase3_101/human_g1k_hs37d5.fasta \
-I /30days/beileibian/CNV/test/metadata7/headers.bam \
-md /30days/beileibian/CNV/test/metadata7 \
-ploidyMapFile /30days/beileibian/CNV/1000G_phase3_101/human_g1k_hs37d5.ploidymap.txt \
-runDirectory /30days/beileibian/CNV/test \
-jobLogDir /30days/beileibian/CNV/test/logs7 \
-intervalList /30days/beileibian/CNV/test/reference_chromosomes.list \
-tilingWindowSize 1000 \
-tilingWindowOverlap 500 \
-maximumReferenceGapLength 1000 \
-boundaryPrecision 100 \
-minimumRefinedLength 500 \
-jobRunner ParallelShell \
-gatkJobRunner ParallelShell \
-maxConcurrentRun 5 \
-run- I used ParallelShell here, but got the first error again... so strange
- For the second error, I checked several files:
- headers.bam
samtools view -H headers.bam
@HD VN:1.6 GO:none SO:coordinate
@SQ SN:1 LN:249250621
@SQ SN:2 LN:243199373
@SQ SN:3 LN:198022430
@SQ SN:4 LN:191154276
@SQ SN:5 LN:180915260
@SQ SN:6 LN:171115067
@SQ SN:7 LN:159138663
@SQ SN:8 LN:146364022
@SQ SN:9 LN:141213431
@SQ SN:10 LN:135534747
@SQ SN:11 LN:135006516
@SQ SN:12 LN:133851895
@SQ SN:13 LN:115169878
@SQ SN:14 LN:107349540
@SQ SN:15 LN:102531392
@SQ SN:16 LN:90354753
@SQ SN:17 LN:81195210
@SQ SN:18 LN:78077248
@SQ SN:19 LN:59128983
@SQ SN:20 LN:63025520
@SQ SN:21 LN:48129895
@SQ SN:22 LN:51304566
@SQ SN:X LN:155270560
@SQ SN:Y LN:59373566
@SQ SN:MT LN:16569
@SQ SN:GL000207.1 LN:4262
@SQ SN:GL000226.1 LN:15008
@SQ SN:GL000229.1 LN:19913
@SQ SN:GL000231.1 LN:27386
@SQ SN:GL000210.1 LN:27682
@SQ SN:GL000239.1 LN:33824
@SQ SN:GL000235.1 LN:34474
@SQ SN:GL000201.1 LN:36148
@SQ SN:GL000247.1 LN:36422
@SQ SN:GL000245.1 LN:36651
@SQ SN:GL000197.1 LN:37175
@SQ SN:GL000203.1 LN:37498
@SQ SN:GL000246.1 LN:38154
@SQ SN:GL000249.1 LN:38502
@SQ SN:GL000196.1 LN:38914
@SQ SN:GL000248.1 LN:39786
@SQ SN:GL000244.1 LN:39929
@SQ SN:GL000238.1 LN:39939
@SQ SN:GL000202.1 LN:40103
@SQ SN:GL000234.1 LN:40531
@SQ SN:GL000232.1 LN:40652
@SQ SN:GL000206.1 LN:41001
@SQ SN:GL000240.1 LN:41933
@SQ SN:GL000236.1 LN:41934
@SQ SN:GL000241.1 LN:42152
@SQ SN:GL000243.1 LN:43341
@SQ SN:GL000242.1 LN:43523
@SQ SN:GL000230.1 LN:43691
@SQ SN:GL000237.1 LN:45867
@SQ SN:GL000233.1 LN:45941
@SQ SN:GL000204.1 LN:81310
@SQ SN:GL000198.1 LN:90085
@SQ SN:GL000208.1 LN:92689
@SQ SN:GL000191.1 LN:106433
@SQ SN:GL000227.1 LN:128374
@SQ SN:GL000228.1 LN:129120
@SQ SN:GL000214.1 LN:137718
@SQ SN:GL000221.1 LN:155397
@SQ SN:GL000209.1 LN:159169
@SQ SN:GL000218.1 LN:161147
@SQ SN:GL000220.1 LN:161802
@SQ SN:GL000213.1 LN:164239
@SQ SN:GL000211.1 LN:166566
@SQ SN:GL000199.1 LN:169874
@SQ SN:GL000217.1 LN:172149
@SQ SN:GL000216.1 LN:172294
@SQ SN:GL000215.1 LN:172545
@SQ SN:GL000205.1 LN:174588
@SQ SN:GL000219.1 LN:179198
@SQ SN:GL000224.1 LN:179693
@SQ SN:GL000223.1 LN:180455
@SQ SN:GL000195.1 LN:182896
@SQ SN:GL000212.1 LN:186858
@SQ SN:GL000222.1 LN:186861
@SQ SN:GL000200.1 LN:187035
@SQ SN:GL000193.1 LN:189789
@SQ SN:GL000194.1 LN:191469
@SQ SN:GL000225.1 LN:211173
@SQ SN:GL000192.1 LN:547496
@SQ SN:NC_007605 LN:171823
@SQ SN:hs37d5 LN:35477943
@RG ID:HYFC5CCXY-1-None PL:ILLUMINA PU:HYFC5CCXY.1 LB:None SM:1000002 CN:kccg
@RG ID:HYGYTCCXY-5-None PL:ILLUMINA PU:HYGYTCCXY.5 LB:None SM:1000009 CN:kccg
@PG ID:GATK IndelRealigner CL:knownAlleles=[(RodBinding name=knownAlleles source=1000G_phase1.indels.b37.vcf.gz), (RodBinding name=knownAlleles2 source=Mills_and_1000G_gold_standard.indels.b37.vcf.gz)] targetIntervals=realign.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null
@PG ID:GATK PrintReads VN:3.3-0-g37228af CL:readGroup=null platform=null number=-1 sample_file=[] sample_name=[] simplify=false no_pg_tag=false
@PG ID:bwa PN:bwa VN:0.7.10-r789 CL:bwa mem -t 32 genome.fa /home/dnanexus/in/reads/HYFC5CCXY_1_190325_FD02930251_Homo-sapiens__R_180606_ANJHEN_DNA_M001_R1.fastq.gz /home/dnanexus/in/reads2/HYFC5CCXY_1_190325_FD02930251_Homo-sapiens__R_180606_ANJHEN_DNA_M001_R2.fastq.gz -M -R @RG\tID:HYFC5CCXY-1-None\tPL:ILLUMINA\tPU:HYFC5CCXY.1\tLB:None\tSM:1000002\tCN:kccg
@PG ID:bwa.2 PN:bwa VN:0.7.10-r789 CL:bwa mem -t 32 genome.fa /home/dnanexus/in/reads/HYGYTCCXY_5_190325_FD02930216_Homo-sapiens__R_180606_ANJHEN_DNA_M001_R1.fastq.gz /home/dnanexus/in/reads2/HYGYTCCXY_5_190325_FD02930216_Homo-sapiens__R_180606_ANJHEN_DNA_M001_R2.fastq.gz -M -R @RG\tID:HYGYTCCXY-5-None\tPL:ILLUMINA\tPU:HYGYTCCXY.5\tLB:None\tSM:1000009\tCN:kccg- sample_gender.report.txt
SAMPLE DOSAGE_X DOSAGE_Y KARYOTYPE GENDER
1000002 1.94 0.00 XX Female
1000009 1.96 0.00 XX Female- rd.dat
SAMPLE seq_1 seq_2 seq_3 seq_4 seq_5 seq_6 seq_7 seq_8 seq_9 seq_10 seq_11 seq_12 seq_13 seq_14 seq_15 seq_16 seq_17 seq_18 seq_19 seq_20 seq_21 seq_22 seq_X seq_Y
1000002 2.0396 2.0411 2.0266 2.0337 2.0200 2.0168 2.0017 2.0064 1.9882 1.9872 1.9820 1.9772 1.9941 1.9718 1.9874 2.0338 1.9522 1.9751 1.9239 1.9511 2.0469 1.9343 1.9344 0.0358
1000009 2.0333 2.0313 2.0182 2.0237 2.0145 2.0131 1.9976 2.0060 1.9927 1.9918 1.9852 1.9854 1.9935 1.9529 2.0200 2.0247 1.9602 1.9851 1.9182 1.9647 2.0483 1.9379 1.9551 0.0397- reference_chromosomes.list
6
19- eval directory
SelectedVariants.list is an empty file...
it looks like no copy number variants there... I can't understand what's going on -
Sorry it took me a little while to get back to this. Thank very much for the detailed info!
I would dig into the GenotypeSiteFilters.* files, especially GenotypeSiteFilters.summary.dat. They are text files. Also VariantsPerSample.report.dat.
-
Thank you so much!
VariantsPerSample.report.dat is an empty file...
GenotypeSiteFilters.summary.dat
FILTER SINGLE_COUNT SINGLE_FRACTION TOTAL_COUNT TOTAL_FRACTION
CLUSTERSEP 104529 1.0000 104529 1.0000
GenotypeSiteFilters.report.dat
ID FILTER
CNV_19_60798_80848 CLUSTERSEP
CNV_19_66175_89754 CLUSTERSEP
CNV_19_80848_93020 CLUSTERSEP
CNV_19_89754_94587 CLUSTERSEP
CNV_19_93020_96226 CLUSTERSEP
CNV_19_94587_99302 CLUSTERSEP
CNV_19_96226_102275 CLUSTERSEP
CNV_19_99302_103758 CLUSTERSEP
CNV_19_102275_106401 CLUSTERSEP
CNV_19_103758_108840 CLUSTERSEP
CNV_19_106401_115446 CLUSTERSEP
CNV_19_108840_131191 CLUSTERSEP
CNV_19_115446_146959 CLUSTERSEP
CNV_19_131191_158468 CLUSTERSEP
...... -
OK. The head of CopyNumberClass.report.dat and ClusterSeparation.report.dat might be helpful.
But probably the next thing I would do is to run PlotGenotypingResults on one of the sites to see what is going on. Ideally maybe not a site near the telomere, but in a more well-behaved part of the chromosome. The CNV pipeline is prospectively running the GS read-depth genotyping model on sliding windows across the chromosome. The filters are saying that there is poor cluster separation at every site.
It may just be some corner case because you are using only two samples.
-
CopyNumberClass.report.dat
ID CALLRATE CNMIN CNMAX CNALLELES NNONREF NVARIANT CNCATEGORY CNDIST NDEL NDUP
CNV_19_60798_80848 0.000 NA NA 0 0 0 NA NA 0 0
CNV_19_66175_89754 0.000 NA NA 0 0 0 NA NA 0 0
CNV_19_80848_93020 0.000 NA NA 0 0 0 NA NA 0 0
CNV_19_89754_94587 0.000 NA NA 0 0 0 NA NA 0 0
CNV_19_93020_96226 0.000 NA NA 0 0 0 NA NA 0 0
CNV_19_94587_99302 0.000 NA NA 0 0 0 NA NA 0 0
CNV_19_96226_102275 0.000 NA NA 0 0 0 NA NA 0 0
CNV_19_99302_103758 0.000 NA NA 0 0 0 NA NA 0 0
CNV_19_102275_106401 0.000 NA NA 0 0 0 NA NA 0 0
CNV_19_103758_108840 0.000 NA NA 0 0 0 NA NA 0 0
CNV_19_106401_115446 0.000 NA NA 0 0 0 NA NA 0 0
ClusterSeparation.report.dat
ID GSM1 GSM2 GSCLUSTERSEP GSCLUSTERSEPWEIGHTEDMEAN GSCLUSTERSEPWEIGHTEDMEDIAN
CNV_19_60798_80848 NA NA NA NA NA
CNV_19_66175_89754 NA NA NA NA NA
CNV_19_80848_93020 NA NA NA NA NA
CNV_19_89754_94587 NA NA NA NA NA
CNV_19_93020_96226 NA NA NA NA NA
CNV_19_94587_99302 NA NA NA NA NA
CNV_19_96226_102275 NA NA NA NA NA
OK. Actually I have 20 samples to be run together. I have 10 metadata directories and each of them has two samples...But I can't run it successfully either. -
To debug this two-sample run, you would have to dig into the genotyping directories (e.g. stage2) and run PlotGenotypingResults on one of the sites. I don't have an example directory to give you more explicit instructions, but in any genotyping run directory you should be able to run PlotGenotypingResults. Don't specify the -runDirectory argument, just use the -vcf argument to point to the genotyping result file and use -site to specify which site to plot.
You should also be able to run the 20 samples together by specifying 10 -md arguments. This isn't a scalable approach, because it has to merge the metadata on the fly, but you can probably get away with 10 md arguments. You can also regenerate the metadata in a single batch of 20 samples, or you can use SVMergeMetadata.q to merge the 10 metadata directories into one.
-
Hi Bob,
I used 20 samples but got the same error... And I plotted the results as you suggested... Looks like nothing there. I'm wondering if the first error caused the second error?
ERROR 02:13:13,838 FunctionEdge - Error: 'java' '-Xmx2048m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/gpfs1/homes/beileibian/.queue/tmp'
ERROR 02:12:56,394 FunctionEdge - Error: 'Rscript' '/30days/beileibian/CNV/svtoolkit/R/discovery/plot_vps.R' '/30days/beileibian/CNV/test/cnv_stage5/eval/VariantsPerSample.report.dat' '/30days/beileibian/CNV/test/cnv_stage5/eval/VariantsPerSample.report.pdf'
ERROR 02:12:56,403 FunctionEdge - Contents of /30days/beileibian/CNV/test/cnv_stage5/logs/CNVDiscoveryStage5-2.out:
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
Calls: plotVariantsPerSamplePDF ... plotVariantsPerSample -> plot -> plot.default -> xy.coords
In addition: Warning message:
In max(vpsData, na.rm = T) :
no non-missing arguments to max; returning -Inf
Execution haltedMy understanding is that it should be like this on CNV browser:
-
Yes you are right, it should look something like you posted. Something is not right in your run, but it's hard for me to guess what.
The symptom looks like there is no read depth data being aggregated for the samples. However the outputs from preprocessing look good, including rd.dat, which is aggregating the same data across each chromosome.
Focus on the run directories under stage2. Can you send or post some representative data from some of the files? Perhaps also check any log files you find in the stage2 run directories.
One thing that can cause symptoms like this is a mismatch in chromosome names (e.g. "19" vs. "chr19"). But it looks like you are using "19" and so is your reference genome, and the preprocessing worked so the reference must match the bam/cram files).
I presume these results are from a fresh run in a fresh, empty run directory.
-
One more thought: The logs should also have a message about opening the read count cache (the rccache.bin file). If this file was not being found for some reason, then it would cause this behavior (because headers.bam contains no reads).
-
Hi Bob,
Can I send you some files via email?
-
That's fine. Feel free to reach out directly. You can email or I think Broad has public incoming ftp.
-
Did you receive my email? Don't worry. I just want to make sure I used the correct address...
Please sign in to leave a comment.
27 comments