Personal tools
You are here: Home Documentation File Formats

File Formats

This page describes the file formats used for downloads from this site.

CRITICAL NOTE

File fomats have changed as of flowcell 14 (4/27/2009). In particular, the quality scores are now in Phred format, rather than Illumina. Old datasets are still in the old format, which is documented here: Old File Formats

Sequence .fastq Files

Note: These fastq files contain Sanger/Phred quality scores, rather than Illumina scores which are sometimes used in Illumina fastq files.

These files contain sequences produced by the Illumina Bustard base caller represented as fastq formatted files where each set of 4 rows corresponds to a single read from the sequencing machine. The first of the four rows is in the following format:

@lane:tile:x_coordinate_on_tile:y_coordinate_on_tile:quality_filter

Where quality_filter represents a single boolean value in which "Y" means the read failed filtering performed by the Illumina Pipeline, "N" otherwise. The second row is the sequence of the given read, the third row is just a "+", and the fourth row is the quality string in symbolic ASCII format (ASCII character code = phred quality value + 33).

The fastq format is documented here: http://maq.sourceforge.net/fastq.shtml

Paired End Data

Data from a paired end run is provided in two files with names ending in "pair1.fastq" and "pair2.fastq." Each of these files have the same number of reads, and are in the same order representing the pair associations (together the first reads from each file make the first read pair, and so on).

Description of Quality Scores

Quality scores in these files are represented as Phred scores with ASCII encoding (ASCII character code = phred quality value + 33). The difference between Solexa/Illumina scores and Phred Quality scores is documented here: http://illumina.ucr.edu/illumina_docs/Alignment%20Scoring%20Guide%20and%20FAQ.html and here: http://maq.sourceforge.net/qual.shtml. According to Illumina "the highest Solexa base score and the Phred score are asymptotically identical. In English this means that for scores of about 15 and above they are so close as to be effectively the same." If you are filtering based on numeric quality scores, it is usually unnecessary to perform any quality score conversion if you set your threshold to around 15 or higher, but keep in mind that the ASCII encoded quality scores in the fastq file use a different character ranges to represent the same values depending on weather they are Illumina or Phred quality scores.

Many bioinformatic tools have a command line option allowing you to select either Phred or Solexa/Illumina scaled quality scores from an input file. In addition, scripts exist which will convert between these formats.

Quality Score Normalization

No alignment normalization has been performed on these quality scores, they have only been calibrated using the default precalculated qtable (quality table) provided by the Illumina pipeline basecaller (Bustard).

Description of Quality Filter

The default Illumina pipeline quality filter is used, which uses a threshold of CHASTITY >= 0.6. Chastity for a given base call is defined as "the ratio of the highest of the four (base type) intensities to the sum of highest two." This filter is used to identify clusters with a low signal to noise ratio, often as a result of two adjacent clusters being so physically close together that their signals cannot be measured independently.

All reads (both failing and passing the filter) are included in the fastq file, and the filter information is encoded in the id for each read (as described above in the .fastq section).

Filtering by Quality Filter

For many experiments it is necessary to remove reads which failed the Illumina chastity filter before further analysis. Here is a simple example of how to use R/BioConductor to import the file "unfilteredInput.fastq" and produce a file called "filteredOutput.fastq" which contains only reads passing the filter. This example uses a regular expression to keep only reads with ":N" at the end of the id line.

library(ShortRead) # load library
reads <- readFastq("unfilteredInput.fastq") # import reads
reads <- reads[grepl("^@.* [^:]*:N:[^:]*:", id(reads), perl="TRUE")] # filter reads
writeFastq(reads, "filteredOutput.fastq") # export filtered reads

This can be run on any computer (such as Biocluster) with a large amount of memory (as much as 16gb depending on dataset), R, and the BioConductor ShortRead Package. To invoke R from the command line use the command "R", and to quit use the command "q()".

Document Actions