As the costs for sequencing have significantly decreased over the last years the number of whole genome sequencing (WGS) is vastly increasing. Often the end result consists of assembled and annotated genomes. Researchers are required to submit these to the National Centre for Biotechnology Information (NCBI) archive prior to publication of the manuscript. Although genome submissions can seem like a small step in the process, it may be a complex task to share sequence data through these databases as many validation criteria apply. For this reason in the blog we provide some recommendations for researchers who want to submit their genome sequence to get this job done.
To help the scientific community in facilitating genome submissions we have published an article in collaboration with the bioinformatics group of the Wageningen University*. The aim is to save valuable time and money when submitting your genome sequences, but also the article provides some feedback for organizations that are responsible for genome (and functional) databases to put more focus on data standardization and facilitate clearer acceptance criteria. This blog is a short overview of our findings, for the complete article see http://bib.oxfordjournals.org/content/early/2015/12/10/bib.bbv104.short?rss=1.
Preparing genome sequences for a GenBank submission is not straightforward. Building an annotated genome involves genome assembly, structural annotation, and functional annotation, and requires proper validation after each step. There are many tools and pipelines available for this task and although FASTA, GFF and GenBank are considered standard formats, methods produce a great variety in output, both in content and file format. The NCBI aims to convert output into standardized database formats. For this they offer conversion tools to make sure that the same rules apply to all genomes entered. Despite their efforts scientist experience many different issues arising from genome submissions. The recommendations below are targeted at researchers that want to submit their data. In the complete article more details can be found. The article also provides suggestions for the developers of assembly, annotation and submission tools, which are not discussed in this blog.
Recommendations for NCBI-compliant genome assembly submissions
- It is strongly recommended to remove foreign DNA before the annotation, to submit the ‘clean’ assembly, and, only on acceptance by GenBank, annotate the genome.
- It is important to make full use of GenBank’s error reports, otherwise some issues preventing submission will remain invisible.
Recommendations for NCBI-compliant structural genome annotations
- It is recommended for researchers to specify or check the minimum intron lengths during annotation and to have a close look at the gene properties, including exon and intron lengths, before submission.
- It is the responsibility of the user to explicitly annotate five-prime and three-prime partial genes in the feature table format, a five-column, tab-delimited table of feature locations and qualifiers. In addition, tbl2asn generates a warning if a partial gene does not start on a consensus splice site, but partial genes are often annotated differently.
- At present, the NCBI also allows submission of annotated scaffolds, rather than contigs. A problem that may arise during the annotation of scaffolds is that some prediction tools allow start and end sites within the gaps. This is not accepted by GenBank, these features will have to be removed before submission, which can be time consuming if done manually. The problem could be avoided by annotating the (ungapped) contigs, rather than scaffolds, but this requires coordinate conversion from contigs to scaffolds if the scaffolds are to be submitted.
Recommendations for NCBI-compliant functional genome annotation
- Avoid error propagation in sequence databases by using strict E-values and criteria for homology-based functional annotation.
- Some UniProt/Swiss-Prot protein Product Names need to be adjusted before genome submission. GenBank’s error report can be used to identify the names that prevent submission, and often, manual correction is necessary.
The preparation of GenBank-compliant genome submissions can be a time-consuming process for many researchers. Key point are to produce a GenBank-compliant genome sequence first, then annotate the accepted sequence. Make full use of GenBank’s manuals and error reports of tbl2asn, and avoid error propagation in sequence databases by using strict criteria in functional annotation. With the recommendations presented in the published article of BaseClear and Wageningen University researchers can save valuable time and money with their genome preparation and submission. Also we hope to make a positive contribution to an enhanced genome data infrastructure and accessibility, a goal that is also actively pursued by the Data FAIRport initiative (http://datafairport.org/).
And of course keep in mind that if you still need a little help you can always get in contact with BaseClear’s bioinformaticians through our contact form. 🙂
Walter Pirovano – Director Bioinformatics department BaseClear B.V.
* A special thanks goes to Sandra Smit and Martijn Derks from Wageningen University