Constrained traversal of repeats with paired sequences. Sébastien Boisvert, Élénie Godzaridis, François Laviolette & Jacques Corbeil. First Annual RECOMB Satellite Workshop on Massively Parallel Sequencing, March 26-27 2011, Vancouver, BC, Canada.
Background: New DNA sequencers yield millions of reads. These can be paired under various physical constraints. Furthermore, statistical constraints can accompany physical constraints to enhance the resulting assembly. Repeated regions remain mostly uncharacterized in de novo assemblies when obtained from short reads. We aimed to improve the quality of assembly using our novel approach.
Results: Here we describe an algorithm that effectively utilise paired information to walk across repeated regions using statistical constraints and optimal read markers. We obtained 43 contiguous sequences covering 99.59% of the genome of Escherichia coli K-12 MG1655 (no misassembly; 1 substitution error; no unknown nucleotide) with three libraries with outer distances of 200, 1000 and 10000 base pairs. This result was obtained without scaffolding thus showing the feasibility of repeats traversal during genome assembly.
Conclusions: Reads in pairs allow the traversal of some repeated regions, but not all, even if the distance is larger than the repeated length because some repeats are not bridged by pairs. The proposed algorithm permits optimal placement of these bridging pairs. Hence, statistical constraints improve traversal of repeats.