BaseClear’s variant detection pipeline renewed
Variant detection is an important application of next-generation sequencing experiments. It involves the alignment of sequence reads to a reference and the detection of mutations …Read more
As the costs for sequencing have significantly decreased over the last years the number of whole genome sequencing (WGS) is vastly increasing. Often the end result consists of assembled and annotated genomes. Researchers are required to submit these to the National Centre for Biotechnology Information (NCBI) archive prior to publication of the manuscript. Although genome submissions can seem like a small step in the process, it may be a complex task to share sequence data through these databases as many validation criteria apply. For this reason in the blog we provide some recommendations for researchers who want to submit their genome sequence to get this job done.
To help the scientific community in facilitating genome submissions we have published an article in collaboration with the bioinformatics group of the Wageningen University*. The aim is to save valuable time and money when submitting your genome sequences, but also the article provides some feedback for organizations that are responsible for genome (and functional) databases to put more focus on data standardization and facilitate clearer acceptance criteria. This blog is a short overview of our findings, for the complete article see http://bib.oxfordjournals.org/content/early/2015/12/10/bib.bbv104.short?rss=1.
Preparing genome sequences for a GenBank submission is not straightforward. Building an annotated genome involves genome assembly, structural annotation, and functional annotation, and requires proper validation after each step. There are many tools and pipelines available for this task and although FASTA, GFF and GenBank are considered standard formats, methods produce a great variety in output, both in content and file format. The NCBI aims to convert output into standardized database formats. For this they offer conversion tools to make sure that the same rules apply to all genomes entered. Despite their efforts scientist experience many different issues arising from genome submissions. The recommendations below are targeted at researchers that want to submit their data. In the complete article more details can be found. The article also provides suggestions for the developers of assembly, annotation and submission tools, which are not discussed in this blog.
The preparation of GenBank-compliant genome submissions can be a time-consuming process for many researchers. Key point are to produce a GenBank-compliant genome sequence first, then annotate the accepted sequence. Make full use of GenBank’s manuals and error reports of tbl2asn, and avoid error propagation in sequence databases by using strict criteria in functional annotation. With the recommendations presented in the published article of BaseClear and Wageningen University researchers can save valuable time and money with their genome preparation and submission. Also we hope to make a positive contribution to an enhanced genome data infrastructure and accessibility, a goal that is also actively pursued by the Data FAIRport initiative (http://datafairport.org/).
And of course keep in mind that if you still need a little help you can always get in contact with BaseClear’s bioinformaticians through our contact form. 🙂
Walter Pirovano – Director Bioinformatics department BaseClear B.V.
* A special thanks goes to Sandra Smit and Martijn Derks from Wageningen University