Generate FASTQ file(s) from Illumina basecall read data.
This tool generates FASTQ files from data in an Illumina BaseCalls output directory. Separate FASTQ files are created for each template, barcode, and index (molecular barcode) read. Briefly, the template reads are the target sequence of your experiment, the barcode sequence reads facilitate sample demultiplexing, and the index reads help mitigate instrument phasing errors. For additional information on the read types, please see the following reference here.
In the absence of sample pooling (multiplexing) and/or barcodes, then an OUTPUT_PREFIX (file directory) must be provided as the sample identifier. For multiplexed samples, a MULTIPLEX_PARAMS file must be specified. The MULTIPLEX_PARAMS file contains the list of sample barcodes used to sort template, barcode, and index reads. It is essentially the same as the BARCODE_FILE used in theExtractIlluminaBarcodes tool.
Files from this tool use the following naming format: {prefix}.{type}_{number}.fastq with the {prefix} indicating the sample barcode, the {type} indicating the types of reads e.g. index, barcode, or blank (if it contains a template read). The {number} indicates the read number, either first (1) or second (2) for paired-end sequencing.
Usage examples:
Example 1: Sample(s) with either no barcode or barcoded without multiplexing
java -jar picard.jar IlluminaBasecallsToFastq \
READ_STRUCTURE=25T8B25T \
BASECALLS_DIR=basecallDirectory \
LANE=001 \
OUTPUT_PREFIX=noBarcode.1 \
RUN_BARCODE=run15 \
FLOWCELL_BARCODE=abcdeACXX
Example 2: Multiplexed samples
java -jar picard.jar IlluminaBasecallsToFastq \
READ_STRUCTURE=25T8B25T \
BASECALLS_DIR=basecallDirectory \
LANE=001 \
MULTIPLEX_PARAMS=demultiplexed_output.txt \
RUN_BARCODE=run15 \
FLOWCELL_BARCODE=abcdeACXX
The FLOWCELL_BARCODE is required if emitting Casava 1.8-style read name headers.
Category Base Calling
Overview
IlluminaBasecallsToFastq (Picard) specific arguments
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Arguments | |||
--BASECALLS_DIR -B |
null | The basecalls directory. | |
--LANE -L |
null | Lane number. | |
--MULTIPLEX_PARAMS |
null | Tab-separated file for creating all output FASTQs demultiplexed by barcode for a lane with single IlluminaBasecallsToFastq invocation. The columns are OUTPUT_PREFIX, and BARCODE_1, BARCODE_2 ... BARCODE_X where X = number of barcodes per cluster (optional). Row with BARCODE_1 set to 'N' is used to specify an output_prefix for no barcode match. | |
--OUTPUT_PREFIX -O |
null | The prefix for output FASTQs. Extensions as described above are appended. Use this option for a non-barcoded run, or for a barcoded run in which it is not desired to demultiplex reads into separate files by barcode. | |
--READ_STRUCTURE -RS |
null | A description of the logical structure of clusters in an Illumina Run, i.e. a description of the structure IlluminaBasecallsToSam assumes the data to be in. It should consist of integer/character pairs describing the number of cycles and the type of those cycles (B for Sample Barcode, M for molecular barcode, T for Template, and S for skip). E.g. If the input data consists of 80 base clusters and we provide a read structure of "28T8M8B8S28T" then the sequence may be split up into four reads: * read one with 28 cycles (bases) of template * read two with 8 cycles (bases) of molecular barcode (ex. unique molecular barcode) * read three with 8 cycles (bases) of sample barcode * 8 cycles (bases) skipped. * read four with 28 cycles (bases) of template The skipped cycles would NOT be included in an output SAM/BAM file or in read groups therein. | |
--RUN_BARCODE |
null | The barcode of the run. Prefixed to read names. | |
Optional Tool Arguments | |||
--APPLY_EAMSS_FILTER |
true | Apply EAMSS filtering to identify inappropriately quality scored bases towards the ends of reads and convert their quality scores to Q2. | |
--arguments_file |
[] | read one or more arguments files and add them to the command line | |
--BARCODES_DIR -BCD |
null | The barcodes directory with _barcode.txt files (generated by ExtractIlluminaBarcodes). If not set, use BASECALLS_DIR. | |
--COMPRESS_OUTPUTS -GZIP |
false | Compress output FASTQ files using gzip and append a .gz extension to the file names. | |
--FIRST_TILE |
null | If set, this is the first tile to be processed (used for debugging). Note that tiles are not processed in numerical order. | |
--FLOWCELL_BARCODE |
null | The barcode of the flowcell that was sequenced; required if emitting Casava1.8-style read name headers | |
--FORCE_GC |
true | If true, call System.gc() periodically. This is useful in cases in which the -Xmx value passed is larger than the available memory. | |
--help -h |
false | display the help message | |
--IGNORE_UNEXPECTED_BARCODES -INGORE_UNEXPECTED |
false | Whether to ignore reads whose barcodes are not found in MULTIPLEX_PARAMS. Useful when outputting FASTQs for only a subset of the barcodes in a lane. | |
--INCLUDE_NON_PF_READS -NONPF |
true | Whether to include non-PF reads | |
--MACHINE_NAME |
null | The name of the machine on which the run was sequenced; required if emitting Casava1.8-style read name headers | |
--MAX_READS_IN_RAM_PER_TILE |
1200000 | Configure SortingCollections to store this many records before spilling to disk. For an indexed run, each SortingCollection gets this value/number of indices. | |
--MINIMUM_QUALITY |
2 | The minimum quality (after transforming 0s to 1s) expected from reads. If qualities are lower than this value, an error is thrown.The default of 2 is what the Illumina's spec describes as the minimum, but in practice the value has been observed lower. | |
--NUM_PROCESSORS |
0 | The number of threads to run in parallel. If NUM_PROCESSORS = 0, number of cores is automatically set to the number of cores available on the machine. If NUM_PROCESSORS < 0, then the number of cores used will be the number available on the machine less NUM_PROCESSORS. | |
--READ_NAME_FORMAT |
CASAVA_1_8 | The read name header formatting to emit. Casava1.8 formatting has additional information beyond Illumina, including: the passing-filter flag value for the read, the flowcell name, and the sequencer name. | |
--TILE_LIMIT |
null | If set, process no more than this many tiles (used for debugging). | |
--version |
false | display the version number for this tool | |
Optional Common Arguments | |||
--COMPRESSION_LEVEL |
5 | Compression level for all compressed files created (e.g. BAM and VCF). | |
--CREATE_INDEX |
false | Whether to create a BAM index when writing a coordinate-sorted BAM file. | |
--CREATE_MD5_FILE |
false | Whether to create an MD5 digest for any BAM or FASTQ files created. | |
--GA4GH_CLIENT_SECRETS |
client_secrets.json | Google Genomics API client_secrets.json file path. | |
--MAX_RECORDS_IN_RAM |
500000 | When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed. | |
--QUIET |
false | Whether to suppress job-summary info on System.err. | |
--REFERENCE_SEQUENCE -R |
null | Reference sequence file. | |
--TMP_DIR |
[] | One or more directories with space available to be used by this program for temporary storage of working files | |
--USE_JDK_DEFLATER -use_jdk_deflater |
false | Use the JDK Deflater instead of the Intel Deflater for writing compressed output | |
--USE_JDK_INFLATER -use_jdk_inflater |
false | Use the JDK Inflater instead of the Intel Inflater for reading compressed input | |
--VALIDATION_STRINGENCY |
STRICT | Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. | |
--VERBOSITY |
INFO | Control verbosity of logging. | |
Advanced Arguments | |||
--showHidden |
false | display hidden arguments | |
Deprecated Arguments | |||
--ADAPTERS_TO_CHECK |
[] | Deprecated (No longer used). Which adapters to look for in the read. |
Argument details
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
--ADAPTERS_TO_CHECK / NA
Deprecated (No longer used). Which adapters to look for in the read.
List[IlluminaAdapterPair] []
--APPLY_EAMSS_FILTER / NA
Apply EAMSS filtering to identify inappropriately quality scored bases towards the ends of reads and convert their quality scores to Q2.
boolean true
--arguments_file / NA
read one or more arguments files and add them to the command line
List[File] []
--BARCODES_DIR / -BCD
The barcodes directory with _barcode.txt files (generated by ExtractIlluminaBarcodes). If not set, use BASECALLS_DIR.
File null
--BASECALLS_DIR / -B
The basecalls directory.
R File null
--COMPRESS_OUTPUTS / -GZIP
Compress output FASTQ files using gzip and append a .gz extension to the file names.
boolean false
--COMPRESSION_LEVEL / NA
Compression level for all compressed files created (e.g. BAM and VCF).
int 5 [ [ -∞ ∞ ] ]
--CREATE_INDEX / NA
Whether to create a BAM index when writing a coordinate-sorted BAM file.
Boolean false
--CREATE_MD5_FILE / NA
Whether to create an MD5 digest for any BAM or FASTQ files created.
boolean false
--FIRST_TILE / NA
If set, this is the first tile to be processed (used for debugging). Note that tiles are not processed in numerical order.
Integer null
--FLOWCELL_BARCODE / NA
The barcode of the flowcell that was sequenced; required if emitting Casava1.8-style read name headers
String null
--FORCE_GC / NA
If true, call System.gc() periodically. This is useful in cases in which the -Xmx value passed is larger than the available memory.
Boolean true
--GA4GH_CLIENT_SECRETS / NA
Google Genomics API client_secrets.json file path.
String client_secrets.json
--help / -h
display the help message
boolean false
--IGNORE_UNEXPECTED_BARCODES / -INGORE_UNEXPECTED
Whether to ignore reads whose barcodes are not found in MULTIPLEX_PARAMS. Useful when outputting FASTQs for only a subset of the barcodes in a lane.
boolean false
--INCLUDE_NON_PF_READS / -NONPF
Whether to include non-PF reads
boolean true
--LANE / -L
Lane number.
R Integer null
--MACHINE_NAME / NA
The name of the machine on which the run was sequenced; required if emitting Casava1.8-style read name headers
String null
--MAX_READS_IN_RAM_PER_TILE / NA
Configure SortingCollections to store this many records before spilling to disk. For an indexed run, each SortingCollection gets this value/number of indices.
int 1200000 [ [ -∞ ∞ ] ]
--MAX_RECORDS_IN_RAM / NA
When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
Integer 500000 [ [ -∞ ∞ ] ]
--MINIMUM_QUALITY / NA
The minimum quality (after transforming 0s to 1s) expected from reads. If qualities are lower than this value, an error is thrown.The default of 2 is what the Illumina's spec describes as the minimum, but in practice the value has been observed lower.
int 2 [ [ -∞ ∞ ] ]
--MULTIPLEX_PARAMS / NA
Tab-separated file for creating all output FASTQs demultiplexed by barcode for a lane with single IlluminaBasecallsToFastq invocation. The columns are OUTPUT_PREFIX, and BARCODE_1, BARCODE_2 ... BARCODE_X where X = number of barcodes per cluster (optional). Row with BARCODE_1 set to 'N' is used to specify an output_prefix for no barcode match.
Exclusion: This argument cannot be used at the same time as OUTPUT_PREFIX
.
R File null
--NUM_PROCESSORS / NA
The number of threads to run in parallel. If NUM_PROCESSORS = 0, number of cores is automatically set to the number of cores available on the machine. If NUM_PROCESSORS < 0, then the number of cores used will be the number available on the machine less NUM_PROCESSORS.
Integer 0 [ [ -∞ ∞ ] ]
--OUTPUT_PREFIX / -O
The prefix for output FASTQs. Extensions as described above are appended. Use this option for a non-barcoded run, or for a barcoded run in which it is not desired to demultiplex reads into separate files by barcode.
Exclusion: This argument cannot be used at the same time as MULTIPLEX_PARAMS
.
R File null
--QUIET / NA
Whether to suppress job-summary info on System.err.
Boolean false
--READ_NAME_FORMAT / NA
The read name header formatting to emit. Casava1.8 formatting has additional information beyond Illumina, including: the passing-filter flag value for the read, the flowcell name, and the sequencer name.
The --READ_NAME_FORMAT argument is an enumerated type (ReadNameFormat), which can have one of the following values:
- CASAVA_1_8
- ILLUMINA
ReadNameFormat CASAVA_1_8
--READ_STRUCTURE / -RS
A description of the logical structure of clusters in an Illumina Run, i.e. a description of the structure IlluminaBasecallsToSam assumes the data to be in. It should consist of integer/character pairs describing the number of cycles and the type of those cycles (B for Sample Barcode, M for molecular barcode, T for Template, and S for skip). E.g. If the input data consists of 80 base clusters and we provide a read structure of "28T8M8B8S28T" then the sequence may be split up into four reads:
* read one with 28 cycles (bases) of template
* read two with 8 cycles (bases) of molecular barcode (ex. unique molecular barcode)
* read three with 8 cycles (bases) of sample barcode
* 8 cycles (bases) skipped.
* read four with 28 cycles (bases) of template
The skipped cycles would NOT be included in an output SAM/BAM file or in read groups therein.
R String null
--REFERENCE_SEQUENCE / -R
Reference sequence file.
File null
--RUN_BARCODE / NA
The barcode of the run. Prefixed to read names.
R String null
--showHidden / -showHidden
display hidden arguments
boolean false
--TILE_LIMIT / NA
If set, process no more than this many tiles (used for debugging).
Integer null
--TMP_DIR / NA
One or more directories with space available to be used by this program for temporary storage of working files
List[File] []
--USE_JDK_DEFLATER / -use_jdk_deflater
Use the JDK Deflater instead of the Intel Deflater for writing compressed output
Boolean false
--USE_JDK_INFLATER / -use_jdk_inflater
Use the JDK Inflater instead of the Intel Inflater for reading compressed input
Boolean false
--VALIDATION_STRINGENCY / NA
Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:
- STRICT
- LENIENT
- SILENT
ValidationStringency STRICT
--VERBOSITY / NA
Control verbosity of logging.
The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:
- ERROR
- WARNING
- INFO
- DEBUG
LogLevel INFO
--version / NA
display the version number for this tool
boolean false
GATK version 4.1.6.0-SNAPSHOT built at Thu, 2 Apr 2020 14:54:17 -0400.
0 comments
Please sign in to leave a comment.