How to work with large data set with: GenomicsDBImport
a)How do I work with large data set with: GenomicsDBImport
Please correct me if I am wrong, but following the tutorials, it seems there are two major options to work with large dataset with GenomicsDBImport:
1. Divide the data to intervals and make intervals work in parallel, using these two params:
a. --max-num-intervals-to-import-in-parallel for num' of parallel intervals.
b. -L for num of intervals. (for example -L 20 was given in https://gatk.broadinstitute.org/hc/en-us/articles/360036883491-GenomicsDBImport )
2. Divide the data to batches and make them work in parallels using these two params:
a. --batch-size (if >=100 use some flag)- number of samples for which readers are open at once
b. --reader-threads- num of simultaneous threads to use when opening VCFs in batches
* Is that correct?
* Is it allowed to combine the two possibilities in the same run?
*What exactly does -L 20 mean? does it mean the genome is divided into 20 fragments?
Thank you,
Arik
-
1. Yes, what you wrote is correct. At least one interval is required for GenomicsDBImport, so yes, you can do multiple of these options at once.
2. Please see this document on intervals: https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-lists
Please sign in to leave a comment.
1 comment