Personal tools
You are here: Home Documentation Old File Formats

Old File Formats

This page describes the file formats used for downloads from this site.

Sequence .txt Files

These files contain sequences produced by the Illumina Bustard base caller represented as tab separated text files where each row corresponds to a single read from the sequencing machine. The first column is in the format lane:tile:x_coordinate_on_tile:y_coordinate_on_tile. The second column is the sequence corresponding to the read. The third column is the quality string in symbolic numeric format. There is a numeric score for each base, each separated by a single space. These values are non-alignment-normalized (aka raw) Solexa/Illumina (not Phred) quality scores. The fourth column is a single boolean value where "Y" means the read passed quality filtering, and "N" means that it failed to pass the filter.

Sequence .fastq Files

Note: These fastq files contain Illumina/Solexa quality scores instead of Phred scores sometimes used in fastq formatted files (see "Description of Quality Scores" below).

These files contain sequences produced by the Illumina Bustard base caller represented as fastq formatted files where each set of 4 rows corresponds to a single read from the sequencing machine. The first of the four rows is in the following format:

@lane:tile:x_coordinate_on_tile:y_coordinate_on_tile:quality_filter

Where quality_filter represents a single boolean value in which "Y" means the read passed the default quality filtering performed by the Illumina Pipeline. The second row is the sequence of the given read, the third row is in the same as the first (except with a + instead of an @ at the beginning), and the fourth row is the quality string in symbolic ASCII format (ASCII character code = non-alignment-normalized Illumina/Solexa quality value + 64).

The fastq format is documented here: http://maq.sourceforge.net/fastq.shtml

Sequence Quality Files

If the filename ends with ".bz2" then the file has been compressed with the bzip2 archiver, and must be decompressed before use with the unix command "bunzip2 <filename>". This will only work if your system has the bunzip2 application installed.

These files contain quality scores produced by the Illumina Bustard base caller represented as tab separated text files where each row corresponds to a single read from the sequencing machine. 

The first column is in the format lane:tile:x_coordinate_on_tile:y_coordinate_on_tile. The remaining columns contain 4 space separated scores for each base in the order A C G T, the highest of these scores pertaining to the called base.

Description of Quality Scores

Quality scores in these files are represented as non-alignment-normalized (aka raw) Solexa/Illumina scores. The sequence .txt, and sequence quality .txt files contain numeric quality scores, while the .fastq file contains the same scores with ASCII encoding (ASCII character code = Illumina/Solexa quality value + 64). The difference between Solexa/Illumina scores and Phred Quality scores is documented here: http://illumina.ucr.edu/illumina_docs/Alignment%20Scoring%20Guide%20and%20FAQ.html and here: http://maq.sourceforge.net/qual.shtml. According to Illumina "the highest Solexa base score and the Phred score are asymptotically identical. In English this means that for scores of about 15 and above they are so close as to be effectively the same." If you are filtering based on numeric quality scores, it is usually unnecessary to perform any quality score conversion if you set your threshold to around 15 or higher, but keep in mind that the ASCII encoded quality scores in the fastq file use a different character ranges to represent the same values depending on weather they are Illumina or Phred quality scores.

Many bioinformatic tools have a command line option allowing you to select either Phred or Solexa/Illumina scaled quality scores from an input file. In addition, scripts exist which will convert between these formats. 

Converting .fastq Files from Illumina to Phred Score Format

The MAQ alignment tool includes a file format converter that can be used as follows, once MAQ is installed:

maq sol2sanger illumina_format_input.fastq phred_format_output.fastq

It is essential to perform this conversion before using a tool that requires phred formatted scores, or else the validity of your analysis may be compromised.

Important Note: If your samples can be aligned to a valid reference genome without adapter trimming, and you intend to make use of these scores in your analysis please contact us for a copy of your alignment-normalized quality scores as produced by Eland.

Description of Quality Filter

The default Illumina pipeline quality filter is used, which uses a threshold of CHASTITY >= 0.6. Chastity for a given base call is defined as "the ratio of the highest of the four (base type) intensities to the sum of highest two." This filter is used to identify clusters with a low signal to noise ratio, often as a result of two adjacent clusters being so physically close together that their signals cannot be measured independently.

Document Actions