Hybrid assembly of the large and highly repetitive genome of. We call our system the maryland superread celera assembler abbreviated masurca and pronounced mazurka. Building on the conclusions of gage and the assemblathon, imetamos runs. Masurca can assemble data sets containing only short reads from illumina sequencing or a mixture of short reads and long reads sanger, 454, pacbio and nanopore. Institute for physical sciences and technology, university of maryland, college park, md 20742. For assembly with illuminaonly data, the nga50 contig size for masurca assembly was twice as big compared with the allpathslg assembly, whereas the number of errors was 62% larger. Hybrid assembly of the large and highly repetitive genome. The following is required all current major linux distributions include. Short read assembly is a critical part of most genome studies using. The university of maryland genome assembly group developing methods for improving genome assembly. Umd genome group masurca, genome assembler university of.
Automated ensemble assembly and validation of microbial genomes. Not unexpectedly, the mmu16 dataset was more challenging than the bacterial genome. Underlying software includes jellyfish kmer counter, a modified version of the celera assembler, superreads method for extending short reads and. Soapdenovo2 produced small contigs with a large number of errors. The theory and practice of genome sequence assembly. The best assembly of this dataset, as selected by imetamos, was masurca k 35. It might work on other unix like systems but it is not well tested. The megareads software, which is now incorporated into the masurca assembler, can handle hybrid assemblies of almost any plant or animal genome, including genomes as large as the 22 gbp loblolly pine. The annotation and the genomic position are shown on the consensus sequence. We use this method to produce an assembly of the large and complex genome of. Motivation secondgeneration sequencing technologies produce high coverage of the genome by short reads at a low cost, which has prompted development of new assembly methods. Masurca maryland superread celera assembler is a whole genome assembly package that can combine short and long reads from different sequencing hardware. Automated ensemble assembly and validation of microbial. Masurca assembler was genome and data dependent, as it.