Pathogen detection by shotgun metagenome sequencing

Whether it is Salmonella in your chicken sandwich, Listeria in your favourite cheese, or Aspergillus on your fruits, unwanted microbes in the food production chain are a danger to human health, and are costly to monitor for the food industry. Therefore, it is important to have an efficient, fast and accurate way to detect pathogens along the entire food production chain.

Historically, culture-based methods are seen as the golden standard in the detection of foodborne pathogens. Although these type of tests are very affordable and simple, they have notable downsides. Culture-based methods are time consuming, require additional analysis as a follow up, and will only show you pathogens that will grow on your media of choice. In recent years metagenomic sequencing techniques have developed significantly, becoming an accessible, faster and affordable alternative testing method. For this reason an increasing number of companies in the food industry is looking at metagenomics for pathogen detection. However, for the data analysis and interpretation appropriate and validated bioinformatics and biostatistical tools are required.

Therefore, BaseClear has developed an innovative pathogen detection method that is comprehensive, fast and able to detect contaminants at low abundance levels. This rapid method is assembly-free (based on state-of-the-art tools Kraken2 [1] and Bracken [2]) and uses specific pathogen databases and downstream analyses to check for low abundant potentially pathogenic species. Our pipeline can be used on metagenomic sequencing data from an wide array of food industry samples, for example dairy, meat, plant based and pre-processed meals.

Detecting pathogens below 0,01% abundance

In metagenomics analysis generally low abundant species are difficult to distinguish from false positive hits. Common bioinformatics tools discard all results below 1% abundance, which means detecting low abundant pathogens is not possible. However many pathogens can act at a low infective dose, e.g. Salmonella spp. [3].

Our newly developed bioinformatics method can distinguish false positives from low abundant hits. We tested this with a sequenced mock community including the low abundant species shown in Table 1.

Species Abundance
Salmonella enterica 0,01%
Enterococcus faecalis 0,001%
Clostridium perfringens 0,0001%

Table 1: Overview of pathogens present at low abundance in the ZymoBIOMICS mock community.

To distinguish true positives from false positives, we use the amount of distinct genomic regions that are found in the data of your samples. Our pipeline usually finds sequencing reads matching throughout the entire genome of species that are truly present in a sample. On the other hand, species that are not truly present but are detected by mistake only show a few genomic regions with a match. This could be due to sequencing error. We can use this amount of  distinct matching regions in the genome, or ‘minimizers’ to distinguish between true and false positives by setting a threshold. This is visualised in Figure 1, where we plot the amount of minimizers for every hit in the data. Here we see that many species have a high amount of distinct genomic regions (red/orange), while there is a long tail of samples with a low amount of distinct regions (blue). The species expected to be truly present in the sample coincide with the species with high abundant distinct regions. Based on testing and validation on different simulated samples, we could set a distinct minimizer threshold system. For example, this  allowed us to detect both S. enterica and E. faecalis present at only 0,01 and 0,001 % in the mock community.

Figure 1: The amount of distinct minimizers plotted per species detected in a  mock community. The red and orange coloured dota points indicate species expected and found in the sample. The green line represents the threshold used. The blue coloured data points indicate false positives that are filtered out.

Functional analysis to detect true pathogenic species

When a potential pathogenic species is detected, it is not always known if this species is an actual pathogen. Looking into the functional potential (genes present in their genome) gives the user an idea if the detected species might actually be able to act harmful.

Therefore, we also implemented an additional tool (based on HUMAnN3, [4]) to detect which gene families and pathways are present in the sample after detection of the potential pathogen. We simulated samples in silico, which contains pathogenic Salmonella enterica strains. We compared the gene families detected with the gene families known to be present in different pathogenic S. enterica subtypes [5].

We found multiple virulence factors such as:

  • Agf (Thin aggregative fimbriae (or curli): Aids in attachment to the villi of enterocytes, also cause the bacteria to become attached to each other.
  • Lpf (Long polar fimbriae): Extracellular matrix adhesin involved in intestinal colonization.
  • VI antigen: Prevents antibody-mediated opsonization, increase resistance to host peroxide and resistance to complement activation by the alternate pathway and complement-mediated lysis.
  • CdtB: Involves chromatin disruption, which leads to G2/M-phase growth arrest of the target cell and ultimately cell death.

If such virulence factors are detected it is very likely that the pathogen detected in your sample is actually harmful. This functional analysis thus offers an extra layer of information on top of the taxonomic profiles, and can aid in distinguishing contaminated from non-contaminated samples.


With the development of our pathogen detection pipeline, we can now offer false-positive adjusted species-level taxonomic analysis by shotgun metagenomics sequencing. With this pipeline it is possible to quickly scan for food pathogens in metagenome data. We are able to detect known and unknown pathogens. We combined state-of-the-art tools into a pipeline combined with a more user friendly output and a way to better distinguish low abundant pathogens from false positives.

Advantages of our pathogen detection pipeline;

  • Dedicated databases covering pathogenic genera from different domains
  • The database is customizable according to the client needs
  • False-positive adjustment reduces the number of low abundance false positives
  • Fast detection of contamination
  • Detection of low abundant species
  • User friendly output table
  • Additional functional analysis to investigate the pathogenicity of the detected species

Altogether, this integrated metagenome pipeline offers a reliable approach to quickly detect pathogens in the food industry environment, which has great potential for accurate risk assessment, food safety and public health. And at the same time this method replaces more laborious and less precise traditional methods of pathogen detection.

Get in touch with one of our experts!

Contact BaseClear


  1. Wood, D. E., Lu, J., & Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. Genome biology, 20(1), 1-13.
  2. Lu, J., Breitwieser, F. P., Thielen, P., & Salzberg, S. L. (2017). Bracken: estimating species abundance in metagenomics data. PeerJ Computer Science, 3, e104.
  3. Blaser, M. J., & Newman, L. S. (1982). A review of human salmonellosis: I. Infective dose. Reviews of infectious diseases, 4(6), 1096-1106.
  4. Franzosa, E. A., et al. (2018). Species-level functional profiling of metagenomes and metatranscriptomes. Nature methods, 15(11), 962-968.
  5. Chen, L., Yang, J., Yu, J., Yao, Z., Sun, L., Shen, Y., & Jin, Q. (2005). VFDB: a reference database for bacterial virulence factors. Nucleic acids research, 33(suppl_1), D325-D328.

Convinced? Get in touch

Get a quoteMeet baseClearContact form
Get in touch