link 1. What are thiopeptides
link 2. Definition of a biosynthetic gene cluster of thiopeptide
link 3. ThioFinder, identification of a thiopeptide gene cluster in a user supplied sequence. [link ThioFinder Tutorial]
link 4. Classification of thiopeptide genotypes
link 5. Relationship between thiopeptide biosynthetic gene cluster and the diversified side ring system
link 6. ThioBase, a web-based database of thiopeptides featured in genetics and chemistry

1. Thiopeptide Antibiotic
Thiopeptides are a growing class of sulfur-rich, highly modified heterocyclic peptide antibiotics [Bagley et al., Chem Rev, 2005]. The thiopeptide family now contains near 100 entities, all of which possess a characteristic macrocyclic core that consists of a monoaza six-membered ring central to multiple thiazoles and dehydroamino acids but vary in side chains (and/or rings) that append additional functionalities. The clinical interest in this family was recently renewed, since many members show potent activity against various drug-resistant pathogens, including methicillin-resistant Staphylococcus aureus, penicillin-resistant Streptococcus pneumoniae and vancomycin-resistant Enterococcus.

More recently, certain thiopeptides have been found with the antiproliferative activity in human cancer cells [Hegde, et al., Nat Chem, 2011], further motivating the interest in new analogue development to overcome their physical drawbacks for clinical use. [Clink here for more references.]
2. Definition of a biosynthetic gene cluster of thiopeptide
We proposed a clear definition of a biosynthetic gene cluster of thiopeptide. Thiopeptide biosynthesis shares a common paradigm that features a ribosomally synthesized precursor peptide and conserved posttranslational modifications [Arndt, et al., Angew Chem Int Ed 2009; Li, et al., Nat Prod Rep 2010; Walsh, et al., J Biol Chem 2010].

Thus, to create a functional machinery to produce certain thiopeptide in bacteria, the genetic basis should be characterized by a gene cluster (here the nos cluster for biosynthesizing nosiheptide [Yu, et al., ACS Chem Biol 2009], one of the parent compounds in the thiopeptide family, serves as the reference) that minimally matches two criteria (Figure 1).
First, the cluster contains a gene encoding a NosM-like precursor peptide (termed prep), in which the structural peptide part is Cys and Ser/Thr-rich and consistent with the amino acid sequence of the resultant thiopeptide backbone.

Second, prep is clustered with a highly conserved gene cassette encoding thiopeptide-specific framework (termed thiosf) formation, which involves (i) a NosG-like cyclodehydratase/NosF-like dehydrogenase complex to produce polyazole, (ii) a NosD and NosE-like dehydratase pair to form multiple dehydroamino acids, and (iii) at least a NosO or NosH-like protein to afford the six-membered nitrogen domain.

These generality-based criteria for furnishing the thiopeptide-characteristic framework were applied in the ThioFinder tool to evaluate whether or not a gene cluster encodes thiopeptide biosynthesis, making the database apparently distinct from the available web-based resources relevant to other ribosomally synthesized peptides, such as APD2, CAMP, and BAGEL2.

figure 1 
Figure 1. Organization of the biosynthetic genes, as exemplified by that for nosiheptide utilized as the reference in this study.

figure 2 
Figure 2. Formation of the thiopeptide-characteristic framework. Shape indicates the azoles, dehydroamino acids and six-membered nitrogen heterocycle.

3. ThioFinder, identification of thiopeptide gene clusters
[link ThioFinder Tutorial]

The online tool ThioFinder utilizes a Hidden Markov Models (HMMs) -based approach to automatically predict thiopeptide biosynthetic gene clusters in the user-supplied nucleotide sequences (Figure 3). Compared to frequently used sequence alignment tools based on older scoring methodology, like BLAST, link HMMER3 detects remote protein homologs because of the strength of its underlying using probabilistic models, called profile hidden Markov models (profile HMMs).

The highly conserved gene cassette ( thiosf), involved in thiopeptide-specific framework formation, and the precursor peptide gene (prep) were searched in the query nucleotide sequence with the HMM profiles summarized in Table 1. Via searches against Pfam 26.0 with the five proteins involved into the nosiheptide biosynthesis as query (NosG, NosE, NosF, NosD and NosL), we easily obtained the HMM profiles for the five protein families or domains (Table 1): YcaO, Lantibiotic-like dehydratase, Nitroreductase, SpaB C-terminal domain, Biotin and Thiamin synthesis-associated domain. Besides, the HMM profiles of the NosO-like, NosH-like and Prep-like proteins without significant hit in Pfam were built from 11 known biosynthetic gene clusters for thiopeptides.

When users submitted a nucleotide sequence, the ThioFinder tool first identifies the protein-encoding regions with the embedded gene-finding tool Prodigal or Glimmer3 (Figure 3). Users can aslo upload their own gene annotations. The tool then searches for profiled homologues of thiopeptide biosynthesis by HMMER3::hmmsearch. The region containing co-localized genes encoding the YcaO-like cyclase (homologue of NosG involved in nosiheptide biosynthesis as reference) and the lantibiotic-type dehydratase (homologue of NosE) will be considered as candidate of thiopeptide biosynthetic gene cluster. The tool then examines the prep gene within a flanking 25 kb DNA region. The additional conserved protein (NosD, NosF, NosH or NosO, if any) coded by the thiosf cassette are also searched. Finally, ThioFinder recognizes the putative cleavage sites in precursor peptide sequences. The conserved motif of structural peptide were obtained from 38 known chemical structural thiopeptides by MEME [26]. The sequence logos is shown in Figure 1 and the MEME-defined regular expression for structural peptide is 'SCTT[CS][GI]CT[CS]S[CS]'., the obtained motif was subsequently used to identify the cleavage site in the precursor peptide sequence by FIMO [27]. A broad range of primer design, multiple sequence alignment and phylogenetic tools are readily accessible at ThioFinder, allowing for user-directed analyses focusing on thiopeptides or their biosynthetic genes to facilitate individualized directions of the research.

Currently, a few bioinformatic tools such as BAGEL2 have been developed to successfully mine the biosynthetic gene clusters of bacteriocins which are bacterially produced antimicrobial peptides, including lantibiotics, small simple peptides and large lytic proteins. Their accurate prediction is based on identification of bacteriocin typical biosynthetic gene homologues, such as genes for lanthionine synthase, cysteine/serine/threonine rich small peptide for encoding prepeptide and ABC transportors for conferring immunity. Thiopeptide as a class of newly identified antimicrobial peptides differs greatly from the bacteriocins by its unique posttranslational modifications. Tools successfully mining bacteriocins are thus insensitive to thiopeptide, as exemplified by BAGEL2 whose back-end database only include two thiopeptides (GE37468 and GE2270). Furthermore, a few databases including APD2, CAMP and DAMPD have been developed to collect the comprehensive information of antimicrobial peptides for facilitating the development of new antimicrobial peptides, however they are also not specific for thiopeptide, currently only containing four thiopeptides (thiostrepton, thiocillin, GE37468, GE2270). [more details]

Table 1. List of The HMM profiles of the protein families or domains involved in thiopeptide biosynthesis
Reference proteinInvolvement in thiopeptide biosynthesisPfam-recorded protein family/domain (accession number)
NosG azole formation YcaO (PF02624), YcaO-like cyclase
NosE dehydratase Lant_dehyd_C (PF04738), C terminus of lantibiotic-type dehydratase
NosF azole formation Nitroreductase (PF00881)
NosD dehydratase SpaB_C (PF14028), C terminus of SpaB involved in subtilin biosynthesis
NosH 6-membered nitrgen heterocycle nosH.hmm, unavailable in Pfam
NosO 6-membered nitrgen heterocycle nosO.hmm, unavailable in Pfam
NosL Type I thiopeptide specific BATS (PF06968), biotin and thiamin synthesis associated domain
TsrT Type II thiopeptide specific Radical_SAM (PF04055), radical SAM superfamily
TsrD Type II thiopeptide specific SnoaL (PF07366), SnoaL-like polyketide cyclase
Prep Precursor peptide Precursor peptide.hmm, unavailable in Pfam

HMMER E-value
This is the E-value that the inclusion and reporting significant thresholds that are measured against . The conditional E-value is an attempt to measure the statistical significance of each domain, given that it has already been decided that the target sequence is a true homolog. It is the expected number of additional domains or hits that would be found with a domain/hit score this big in the set of sequences reported in the top hits list, if those sequences consisted only of random nonhomologous sequence outside the region that sufficed to define them as homologs.
[More details.]

Figure 1 
Figure 3. Schematic modular pipeline of ThioFinder
4. Classification of thiopeptide genotypes
Despite of a similar macrocyclic framework, the members in thiopeptide family differ in substitution of the six-membered central domain, installation of the side ring system, decoration of the core system, and C-terminal functionalization of the extended side chain. ThioBase features a combinatorial strategy for classifying the genotypes of thiopeptides, towards establishing the relationship to the chemotypes, by taking the specific genes for diversity into accounts of the resulting structural manners via the biosynthetic reactions.

This strategy focuses on the side ring system in structure, formation of which is independent of the precursor peptide. Biochemical investigations indicated that the functionalization shares L-tryptophan as a common substrate but can proceed in completely different ways, to afford variable groups as the indolic acid (MIA) moiety of nosiheptide and the quinaldic acid (QA) moiety of thiostrepton. For MIA formation, we have recently characterized a radical S-adenosylmethionine (SAM) 3-methyl-2-indolic acid synthase (e.g., NosL in nosiheptide biosynthesis) that catalyzes an unprecedented fragmentation-recombination to reconstitute the carbon side chain. By contrast, QA formation, as that in thiostrepton biosynthesis, involves an unusual methyl transfer (catalyzed by a radical SAM/methylcobalamin-dependent methyltransferase TsrT) onto, and particularly, a key ring expansion (involving a cyclase-like protein TsrD) of the indole part.

Comparative analysis of the corresponding gene(s) for L-tryptophan processing among the available 11 biosynthetic gene clusters revealed
(i) that formations of MIA and QA are common in each moiety-containing bi-macrocyclic members, consistent with the nosL homologue for MIA found in the nocathiacin gene cluster and the tsrT and tsrD homologues for QA in the siomycin gene cluster, respectively;
and (ii) that the gene clusters for the members without the L-tryptophan-derivative moiety, most of which are mono-macrocyclic, apparently lack the above counterparts.

These findings supported that the specific gene(s) involved in L-tryptophan processing can serve as a standard for classifying the genotypes of thiopeptides into three types as shown in Figure 4.
Figure 4. Classification of thiopeptide genotypes.
Type I, conserved NosL-like enzymes synthesizing the indolyl structure (MIA) in a side ring from L-Trp.
Type II, two enzymes, a hypothetical amidotransferase and a putative ester cyclase, for side rings with quinaldic acid (QA).
Type III, no genes for synthesizing L-trp derivatives (MIA or QA).

5. Relationship between thiopeptide biosynthetic gene cluster and the diversified side ring system
With the genotype-based classification strategy, we predicted the genetic bases of 99 structurally known thiopeptides, as 14 in Type I group featuring a nosL-like gene encoding MIA (indolyl) structure formation, 21 in Type II group possessing tsrT- and tsrD-like genes coding for QA (quinolinic acid) in side ring, and 64 in Type III group containing none of the above to afford the side ring system.

Remarkably, above genotypes are consistent with the chemotypes of thiopeptides classified into series a-e according to the oxidative state of the central heterocyclic domain [Bagley et al., Chem Rev, 2005]. The genotypes of Types I and III are completely in line with the thiopeptides of series e (as the monocyclic members with a trissubstituted pyridine central domain) and d (as the bi-macrocyclic members with a hydroxypyridine central domain and an indolic side ring system), respectively, whereas the type II genotypes are in agreement with the members of series a, b and c, all of which share a piperidine central domain and a QA (quinolinic acid) moiety in the side ring system.

With respect to the new 42 putative gene clusters identified by ThioFinder, the structural manners of their potential products can also be briefly predicted based on this specific genetic features. One gene cluster harboring the tsrT and tsrD counterparts (with the NCBI accession no. NZ_GG657738), belonging to Type II in genotype, may encode the biosynthesis of a bi-macrocyclic thiopeptide containing a QA (quinolinic acid) moiety, and the other 41 that lack gene(s) for processing L-tryptophan fall into the genotypes of Type III, likely involved in the production of the members without the L-tryptophan-derived side ring.

This combinatorial classification system has the advantage in grouping the genotypes of some members structurally almost identical but different only in the central domain, such as thiopeptins, 8 members of which have to be classified into distinct series of chemotypes.

Table 2. Relationship between biosynthetic gene cluster features and the thiopeptide side ring system.
Gene cluster partten Type I Type II Type III
Genotype features nosL-like gene tsrT-like and tsrD-like gene no characteristic gene
Chemical structure features
MIA (indolyl),
blue in Figure. 5
QA (quinolinic acid),
orange in Figure. 5
No MIA (indolyl),
no QA (quinolinic acid)
Numbers of members with
known structures
Numbers of members with
known structures & clusters
2 [nosiheptide,
3 [siomycin, thiostrepton *]
6 [GE2270, thiocillin,
thiomuracins, TP-1161,
cyclothiazomycin, GE37468]
Numbers of members with
known clusters & without
Classical classification (Series)
Series according to the oxidation state of the 6-membered N heterocycle (pink in Figure. 5)
Piperidine or
Number of members
(chemical structures required for

figure 5 
Figure 5. Example structures. MR, macrocyclic ring; SR, light grey, side ring. Pink, 6-membered N heterocycle (including pyrimidine); orange, QA (quinolinic acid); blue, MIA (indolyl structure)

6. ThioBase, a web-based database of thiopeptides featured in genetics and chemistry
As a web-based, open-access back-end database, ThioBase, which involves the information of thiopeptides regarding the chemical structure, biological activity, producing organism, and biosynthetic gene (cluster) along with the associated genome. Systematical organization of these data can facilitate new thiopeptide discovery and enrichment of the unique biosynthetic elements to produce novel drug leads by applying the principle of synthetic biology.

As of June 1, 2012, ThioBase includes the information as the follows.
(i) 99 known thiopeptides are listed with the metabolite structures. For each entity, the CAS registry number, analogues, structural peptide sequence, biological activities (antibacterial and/or anticancer data collected from publications and patents), producing organism and hyperlinks to NCBI PubChem are provided.
(ii) ) 65 biosynthetic gene clusters in 63 bacterial species are depicted. 11 of them have been correlated with their coding metabolites, while 54 are newly identified by ThioFinder in of the sequenced bacterial genomes currently available at NCBI.
(iii) 102 microorganisms are recorded, including 49 reported thiopeptide producers and 53 ThioFinder-predicted bacteria.
(iv) Nearly 380 publications relevant to thiopeptides by text mining of NCBI PubMed and SciFinder, which are classified into the following catalogues: 'isolation and structure characterization', 'fermentation and production', 'biosynthesis', 'biological activity', as well as 'chemical synthesis'.