Type Strain Genome Server

Use of this form is free for academic purposes. For all other uses, please contact the authors via the feedback form.

"Academic purposes" do not include running queries just to collect and potentially plagiarize the type-strain genomes and other associated meta data included in the TYGS database. If you are unsure if and how to properly use the TYGS, do not hesitate to contact the TYGS team via the contact form.

Browsing the TYGS web page is free and does not require registration.

For submitting a TYGS job, an e-mail address has to be provided along with some genome sequences and/or GenBank accession IDs, the subject of the subsequent analysis. The e-mail address is the only piece of person-related information stored by the TYGS. All data associated with a TYGS job, including the e-mail address, are deleted after the job has finished and an additional amount of time has passed. The exact time of deletion is indicated in the notification e-mail and in the information badges in the top-right corner of the TYGS result page.

Additional information is found in the general privacy statement of the Leibniz Institute DSMZ – German Collection of Microorganisms and Cell Cultures, in accordance with the General Data Protection Regulation.

The TYGS is currently limited to 20 genomes not because of a limitation of the method but because we have to keep an eye on the compute cluster usage. But for most use cases this limit is clearly more than enough. However, we usually offer to increase the upload cap for a given email address on TYGS beyond the default of 20 user genomes when asked for, assuming the requested upload cap is not exorbitantly high.

Before asking for an increased upload cap, please consider and prepare the following aspects though:

  1. How large is your final dataset?
  2. Is your dataset really final?
  3. In case you want to analyze few own isolates together with a rather large list of 'reference strains' which you have obtained from some other web site:
    • Which insights do you expect to get from such an analysis?
    • Why can't you analyze your isolates through the TYGS without these 'reference strains' (also please read the FAQ item on type strains vs. 'reference strains')?
  4. Have you removed redundant strains (e.g. clonal strains) from the dataset to reduce its size and, if not, why?
  5. Is your dataset a rather diverse selection of strains or do the strains belong to a group of closely related strains?
  6. Are your files properly labelled (e.g. labels will be shown in the trees)?
  7. Does your dataset contain type-strain genomes and, if yes, why? The TYGS already provides type-strain genomes.
  8. Do you want to restrict the TYGS analysis to the uploaded genome sequences or do you want to use the TYGS in standard mode (i.e. the TYGS will determine closely related type strain genomes)?

How does the procedure work?

Please use the TYGS feedback form and answer the above questions as complete as possible. Upon reception of your request, we will check it and might come back to you with further questions (basically to narrow down the computational complexity of your specific job). If everything is fine, we will set up an exception for your e-mail address which allows you to submit TYGS requests with the requested number of user sequences. If the number of strains exceeds ~100, the web server might dislike this large amount of data and we have to submit the files on your behalf from within our network. In that case the file exchange will be organized via a confidential folder on our institutional cloud.

The exception will be usually valid for a week but can normally be extended on request. Please note that your job(s) will use some resources of our in-house compute cluster. That is, please do not start multiple redundant jobs of that size if this can be avoided.

Please see the GGDC/VICTOR FAQ for detailed instructions on how to specify accession numbers.

After a click on the "browse" button in the submission form, you can hold the CTRL key in the files selection dialog and use the left mouse button to click and select a custom set of genome files. The key combination CTRL+A will select all files in the folder. If you hold the SHIFT key instead, you can select a range of files. Both techniques should work in any type of browser and these mechanisms are entirely independent of the TYGS.

Type strains form the backbone of prokaryotic systematics as nomenclatural types of species and subspecies, and comparisons with established type strains are mandatory when classifying novel strains (PMID: 19700448).

The good news is that the TYGS is designed to automatically determine type strains closely related to your query genome(s).

On contrary, the term 'reference strain' is often used synonymously with the term 'type strain' but in fact the former term is not sharply defined and just an arbitrary label which can basically be put on any strain, even on those which are in fact no type strains. But this can lead to serious taxonomic confusions, if strains are falsely mistaken as type strains. That is, be careful when preparing your dataset and when collecting lists of 'reference strains' from other web pages.

In default mode, the TYGS will already determine a set of closely related type-strain genomes per each of your provided genomes. The exact procedure is described in the TYGS publication. If you are still uploading type strain genomes, this will of course result in duplicate genome sequences throughout your results because your uploaded genomes will perfectly match with the respective type-strain genome from the TYGS database. In case you have clicked the checkbox 'Restrict job to above genome(s)?', the TYGS will skip the determination of closely related type-strain genomes entirely and only focus on the genome sequences and accessions you have provided via the submission form.

While the TYGS database attempts to be as comprehensive as possible, a specific type-strain genome may be missing for a variety of reasons:

  • The genome has not yet been sequenced.
  • The genome sequence has been obtained but not been deposited in public databases.
  • The genome sequence has been deposited in public databases but cannot be identified because its metadata lack crucial information.
  • The genome sequence has been deposited in public databases but cannot be identified because a deposit of the type strain was used that is unknown to the TYGS database.
  • The genome sequence was identified in public databases but is still under investigation by the TYGS team.
  • The genome sequence can be identified in public databases but fails the TYGS quality checks.

If you are aware of a type-strain genome sequence that is missing in the TYGS database, please contact the maintainers and provide information on this genome sequence as indicated above.

Prior to sending a report on a missing type-strain genome sequence, make sure the species or subspecies is not contained in the list for manual genome selection. If it is contained, you may better file a report on a type-strain genome sequence that is contained but not automatically found.

Please see the GGDC/VICTOR FAQ for detailed explanations of why an e-mail may get lost. Also note that the TYGS results are additionally displayed on a website.

Table 3 of the TYGS result page contains the pairwise dDDH values between your user genomes and the selected type-strain genomes. The dDDH values are provided along with their confidence intervals (C.I.) for the three different GBDP (Genome BLAST Distance Phylogeny) formulas:

  • formula d0 (a.k.a. GGDC formula 1): length of all HSPs divided by total genome length
  • formula d4 (a.k.a. GGDC formula 2): sum of all identities found in HSPs divided by overall HSP length
  • formula d6 (a.k.a. GGDC formula 3): sum of all identities found in HSPs divided by total genome length

More info on these formulas and the underlying GBDP method is found in the literature.

Note: Formula d4 is independent of genome length and is thus robust against the use of incomplete draft genomes. For other reasons for preferring formula d4, see the FAQ. Formulae d0 and d6 reflect the genome pair's (dis-)similarity in gene content.

For some species the TYGS database contains the genome sequences of more than one strain deposit (e.g. ATCC and DSM). Now, if a user-provided genome sequence results in a close match with such a species all strain deposits of that species are usually included in the TYGS result. The main reason is that the scientific literature reports rare cases in which such strains unexpectedly differ to a considerable extent thus indicating a strain confusion or contamination (please find an example here). That way, the TYGS is an important tool to uncover such irregularities.

Apart from that, we think that having more than one strain deposit of the same species included in the dataset is at most a cosmetic issue, not a scientific problem. If you still want to remove such "duplicates", you have the option to download the trees in Newick format and remove them. We however advise against post-manipulation of ready-made results.

The TYGS is showing such matches because it may be important for any taxonomist to be aware of all closely related species or subspecies even if their names are not (yet) validly published. Since several criteria have to be met before a new species name is validly published (see details on LPSN page), the entire process might take some time. For example, it can well be that a novel species or subspecies name was already proposed in an effective publication but has still not been announced in a Validation List. In theory, a second team might now start working on the description of the same taxon, resulting in redundant work. That is, if your novel strain is placed in the same species or subspecies cluster as a species or subspecies with a not (yet) validly published name, we recommend to get in touch with its authors. An effective publication may be available and the valid publication of the name may be imminent. And even in the case of a low probability of a forthcoming validation it is often worth reporting phylogenetically close relationships to taxa that lack a validly published name. Their names occur in databases anyway and their analyses may yield valuable information.

When two species are regarded as heterotypic synonyms, this does not affect the status of type strains as type strains. The type strain of the younger heterotypic synonym remains the type strain of the younger heterotypic synonym; it neither becomes the type strain of the older heterotypic synonym nor does it lose its status as type strain entirely. The TYGS aims at an unambiguous relationship between taxon names and type strains. For each type strain the taxon name (or set of taxon names) for which it is the type strain must be shown to achieve this, even if the taxon name is actually believed to be a younger heterotypic synonym and thus not believed to be the correct name. Notably, as specified in the International Code of Nomenclature of Prokaryotes, if alternatives are available the choice of the correct name depends on taxonomic opinion.

Moreover, the analyses conducted with the TYGS yield taxon boundaries themselves. Information on heterotypic synonyms is thus part of the outcome and should not be part of the input. Often the TYGS results simply confirm known synonym relationships. But surprises are also possible. Additional information on known heterotypic synonyms is available from LPSN, to which the TYGS results are linked. Heterotypic synonyms at the species or subspecies level may contain distinct genus names. To assess the affiliation to a genus it is necessary to consider the position of the type species of the genus.

Click on the refresh symbol close to "Click to load or refresh tree (page needs to be viewed in https session)" on the tree page.

For very diverse datasets of strains, the average branch support, even of the genome-based phylogeny, might be too low which is not unlikely for such datasets. In general, if certain parts in any given phylogeny are not well resolved (i.e. low branch support), these parts are not interpretable.

In case of the TYGS, an optional proteome-based GBDP analysis will become available on user request if the dataset is not too large (< 30 strains) and the average branch support of the genome-based tree is smaller than 60%. If these conditions are met, you will find an order button on the respective TYGS result page below the phylogenies table.

The method behind proteome-based GBDP analyses is similiar to the nucleotide-based GBDP analyses except for the use of the entire proteome in the former case. The method has been described here and here and, moreover, has been successfully applied in large-scale phylogenomic studies such as:

1. Lagkouvardos I, Pukall R, Abt B, Foesel BU, Meier-Kolthoff JP, Kumar N, et al. The Mouse Intestinal Bacterial Collection (miBC) provides host-specific insight into cultured diversity and functional potential of the gut microbiota. Nat Microbiol. 2016;1: 16131. doi:10.1038/nmicrobiol.2016.131

2. Barka EA, Vatsa P, Sanchez L, Gaveau-vaillant N, Jacquard C, Meier-Kolthoff JP, et al. Taxonomy, physiology, and natural products of Actinobacteria. Microbiol Mol Biol Rev. 2016;80: iii. doi:10.1128/MMBR.00044-16

3. Nouioui I, Ghodhbane-Gtari F, Montero-Calasanz M del C, Göker M, Meier-Kolthoff JP, Schumann P, et al. Proposal of a type strain for Frankia alni ( Woronin 1866 ) Von Tubeuf 1895 , emended description of Frankia alni, and recognition of Frankia casuarinae sp. nov. and Frankia elaeagni sp. Int J Syst Evol Microbiol. 2016;published. doi:10.1099/ijsem.0.001496

4. Hahnke RL, Meier-Kolthoff JP, García-Lopez M, Mukherjee S, Huntemann M, Ivanova NN, et al. Genome-based taxonomic classification of Bacteroidetes. Front Microbiol. 2016;7: 2003. doi:10.3389/fmicb.2016.02003

5. Simon M, Scheuner C, Meier-Kolthoff JP, Brinkhoff T, Wagner-Döbler I, Ulbrich M, et al. Phylogenomics of Rhodobacteraceae reveals evolutionary adaptation to marine and non-marine habitats. ISME J. 2017;11: 1483–1499. doi:10.1038/ismej.2016.198

6. Montero-Calasanz M del C, Meier-Kolthoff JP, Zhang DF, Yaramis A, Rohde M, Woyke T, et al. Genome-scale data call for a taxonomic rearrangement ofGeodermatophilaceae. Front Microbiol. 2017;8: 1–15. doi:10.3389/fmicb.2017.02501

7. Mukherjee S, Seshadri R, Varghese NJ, Eloe-Fadrosh EA, Meier-Kolthoff JP, Göker M, et al. 1,003 reference genomes of bacterial and archaeal isolates expand coverage of the tree of life. Nat Biotechnol. 2017;35: 676–683. doi:10.1038/nbt.3886

8. Carro L, Nouioui I, Sangal V, Meier-Kolthoff JP, Trujillo ME, Montero-Calasanz M del C, et al. Genome-based classification of micromonosporae with a focus on their biotechnological and ecological potential. Sci Rep. 2018;8: 525. doi:10.1038/s41598-017-17392-0

9. Nouioui I, Carro L, García-López M, Meier-Kolthoff JP, Woyke T, Kyrpides NC, et al. Genome-based taxonomic classification of the phylum Actinobacteria. Front Microbiol. 2018;9: 1–119. doi:10.3389/fmicb.2018.02007

10. García-López M, Meier-Kolthoff JP, Tindall BJ, Gronow S, Woyke T, Kyrpides NC, et al. Analysis of 1,000 type-strain genomes improves taxonomic classification of Bacteroidetes. Front Microbiol. 2019;10: 2083. doi:10.3389/fmicb.2019.02083

11. Hördt A, García-López M, Meier-Kolthoff JP, Schleuning M, Weinhold LM, Tindall BJ, et al. Analysis of 1,000+ type-strain genomes substantially improves taxonomic classification of Alphaproteobacteria. Front Microbiol. 2020;11: 468. doi:10.3389/fmicb.2020.00468

12. Dedysh SN, Henke P, Ivanova AA, Kulichevskaya IS, Philippov DA, Meier-Kolthoff JP, et al. 100-year-old enigma solved: identification, genomic characterization and biogeography of the yet uncultured Planctomyces bekefii. Environ Microbiol. 2020;22: 198–211. doi:10.1111/1462-2920.14838

13. Strepis N, Naranjo HD, Meier-Kolthoff J, Göker M, Shapiro N, Kyrpides N, et al. Genome-guided analysis allows the identification of novel physiological traits in Trichococcus species. BMC Genomics. 2020;21: 1–13. doi:10.1186/s12864-019-6410-x

14. Schumann P, Kalensee F, Cao J, Criscuolo A, Clermont D, Köhler JM, et al. Reclassification of Haloactinobacterium glacieicola as Occultella glacieicola gen. nov., comb. nov., of Haloactinobacterium album as Ruania alba comb. nov, with an emended description of the genus Ruania, recognition that the genus names Haloactinobacteri. Int J Syst Evol Microbiol. 2021;71. doi:https://doi.org/10.1099/ijsem.0.004769

15. Heidler von Heilborn D, Reinmüller J, Hölzl G, Meier-Kolthoff JP, Woehle C, Marek M, et al. Sphingomonas aliaeris sp . nov ., a new species isolated from pork steak packed under modified atmosphere. Int J Syst Evol Microbiol. 2021;71: 004973. doi:10.1099/ijsem.0.004973

In recent years heat maps and clusterings/dendrograms inferred from (dis-)similarity matrices have become rather popular assets in many taxonomic papers and species descriptions of microbes. One wonders, however, whether there is an actual scientific need for such methods.

The TYGS reports all relevant dDDH values, i.e., all comparisons between user-defined genomes and closest type-strain genomes and all comparisons between user-defined genomes. If you are running the TYGS in the mode "restricted to user genomes", the TYGS will report all pairwise dDDH values and these can, in principle, be easily transformed to a cross table. But the TYGS does not offer this is a built-in feature for reasons given below.

The three main applications of all-vs-all dDDH (dis-)similarity matrices are:

1) [Clusterings/Dendrograms]: One can apply some type of hierarchical clustering for obtaining an assignment of strains into species clusters. But if this is not properly done, the results can be misleading, especially if the underlying data are not ultrametric (i.e., if the organisms have not evolved under a molecular clock). Dendrograms inferred via hierarchical clustering are not phylogenetic trees and they do not normally show branch support. As a consequence, dendrograms should not be interpreted. Unfortunately, oftentimes such dendrograms are wrongly presented in scientific papers as "trees", thereby blurring the difference between them and phylogenetic trees.

2) [Heat maps]: Heat maps only allow for a visual exploration of a data set but they do neither replace a proper species clustering nor phylogenetic approaches. It may or may not be easy to decode the quantitative information that is conveyed by a heat map because of the need of a colour gradient. In particular, the human brain may notice sharp contrasts between adjacent bits of an image but may be poor at comparing shading in non-adjacent regions of a visualization. On top of that, the depicted (dis-)similarity values are not of primary taxonomic interest. They are just crude estimates for the true phylogenetic (dis-)similarity and their proper interpretation is achieved by inferring a phylogenetic tree.

3) [Phylogenetic trees]: A phylogenetic tree is the appropriate way to use (dis-)similarity matrices for taxonomic purposes, particularly in conjunction with branch support. The TYGS thus conducts and reports not only a distinct type-based species clustering (see TYGS paper), which is different to a standard hierarchical clustering, but also infers phylogenies with branch support on which these clusters are annotated. In that manner potential misinterpretations can be avoided or at least mitigated and usually, depending on the data, a well-informed taxonomic decision be made.

Short answer: Because there are issues with POCP and AAI that have already been discussed in the literature (see below). However, genera can usually be delineated using the TYGS by properly interpreting the TYGS phylogenies (see below).

Long answer: The pragmatic prokaryotic species concept uses DDH and a 70% threshold for the comparison of novel strains to a set of type strains (digital DDH mimics DDH without the known pitfalls of traditional DDH) [Meier-Kolthoff et al. (2013a)]. This concept works relatively well because closely related organisms usually evolved at a similar speed thus resulting in matrices of pairwise (dis-)similarities that are oftentimes nearly ultrametric [Meier-Kolthoff et al. (2014a)]. The latter is important because the application of any type of threshold to a given distance or similarity matrix will only properly work under this condition [Meier-Kolthoff et al. (2014a)]. Now, the less related organisms are, the less ultrametric the underyling (dis-)similarity data matrix will be. This is frequently the case when working on datasets covering entire genera, families or higher taxa and this is also the reason for why one won't find generally accepted universal genus (or higher taxa) delineation cutoffs for dDDH, ANI etc.

Now, even though POCP and AAI were introduced for genus delineation, these approaches were also criticized in the literature in various aspects. A brief summary of these issues is found in [Barco et al. (2020)]:

[...] Methods to demarcate genera have been proposed that are based on either AAI (18) or the percentage of conserved proteins (POCP; 19). The former method provided a range of AAI values (65 to 72%) that were originally obtained by correlation to a now-outdated 16S rRNA gene identity threshold for genus. The POCP method directly relies on the 16S rRNA gene sequence, which is in some cases insensitive to evolutionary changes in the rest of the genome of a given organism, as revealed by different species sharing >99% identity over the length of this gene. This method also arbitrarily sets a genus boundary at a POCP value of 50%. Additionally, the generally used arbitrary genus threshold of 95% 16S rRNA gene identity has been recently revisited to a lower minimum value of 94.5%, with a median sequence identity of 96.4% and confidence interval of 94.55 to 95.05% [...]

Moreover, AAI was suggested in their original work to be capable of providing insights into the higher level taxonomy. But AAI pairwise values alone, even if visualized as a dendrogram, do not replace truly genome-scale phylogenies with branch support (e.g. GBDP-based phylogenies as provided by the TYGS).

Regarding POCP, studies concluded that POCP is not universally applicable:

[...] In this context, the 50% POCP boundary is not an appropriate metric to delineate genera within Methylococcaceae. The use of the POCP has, similarly, been shown to be ineffective in delineating genera within the families Bacillaceae (Aliyu et al., 2016), Burkholderiaceae (Lopes-Santos et al., 2017), Neisseriaceae (Li et al., 2017), and Rhodobacteraceae (Wirth and Whitman, 2018), among others. [...]

But what to do instead?

To the best of our knowledge, decisions on higher level classification should be inferred from well-resolved phylogenies by comparison of, for example, relative subtree heights and by how uniform the proposed taxa (e.g. genera) are in terms of sequence divergence among one another. For example, when we conducted the large taxonomic studies on Actinobacteria or Bacteroidetes, we used the principles of phylogenetic systematics and taxonomic conservatism to repair obviously non-monophyletic taxa.

In general, when one is interested in delineating novel genera, or the higher level classification, existing taxa can serve as a guide. For example by comparing how the different genera in a given family are nested in the phylogenetic tree (relative heights of their subtrees), one can usually find a conservative delineation into genera that makes these newly created genera uniform in terms of sequence divergence when compared to the other ones in the family.

Note, if the TYGS whole genome-based GBDP analysis is not well resolved, one can even order an additional proteome-based GBDP analysis.

Yes, the TYGS has an API for the programmatic download of results. Please find a detailed description here.

In addition to the routine import of taxonomic and genomic information, the user-visible changes that were applied to the database after the initial Nature Communications publication are listed on the News and Changelogs page.