THIOBASE 
Thiopeptide Antibiotic
Thiopeptides are a growing class of sulfur-rich, highly modified heterocyclic peptide antibiotics. According to our survey, the link thiopeptide family now contains near 100 entities, all of which possess a characteristic macrocyclic core that consists of a monoaza six-membered ring central to multiple thiazoles and dehydroamino acids but vary in side chains (and/or rings) that append additional functionalities. The clinical interest in this family was recently renewed, since many members show potent activity against various drug-resistant pathogens, including link methicillin-resistant Staphylococcus aureus, penicillin-resistant Streptococcus pneumoniae and link vancomycin-resistant Enterococcus. This motivates extensive investigations by chemical modification into new analogue development; however, their complex architecture poses a tremendous challenge to synthetic ways.

To date there are ten link biosynthetic gene clusters of thiopeptides published. These studies substantiate a unifying rationale: The structural complexity of all thiopeptides probably arises from unforeseen posttranslational modifications of genetically encoded and ribosomally translated peptides. By mining the genetic features for the Cys and Ser/Thr-rich precursor peptide and the conserved posttranslational modification enzymes, we also confirmed thiocillin production in Bacillus cereus ATCC 14579, a strain that was previously unknown as a thiopeptide producer (Liao, et al. 2009). This demonstrated the generality of the newly emerged thiopeptide biosynthetic paradigm.

[TOP]
Database for Thiopeptide Antibiotic Characterization and Biosynthesis
Based on above understanding of thiopeptide biosynthesis in generality and specificity, it is becoming practical to evaluate the potential of thiopeptide production and to predict the structural manners at the genetic level in the post-genomic era, which was indeed supported by our success in mining thiocillin from a Bacillus strain. Herein, we carry out a comprehensive survey and therefore constitute a web-based database relevant to “Chemical & Genetic Characterization of Thiopeptide Antibiotic” (THIOBASE) to document the information regarding thiopeptides in structure, producing system and featured biosynthetic gene, aiming at expediting of new thiopeptide discovery and accumulating of specific building elements for applying combinatorial biosynthesis methods to meet the requirement of diversity for drug discovery and development. The database includes 1) link known thiopeptide entities along with their chemical structures; 2) microorganisms involving the known as the thiopeptide producers (in 10 of which the biosynthetic gene cluster have been correlated with the chemicals) and those have the potential to produce certain thiopeptides; 3) link the biosynthetic gene clusters; and 4) the genes homologous to the genes (link nosD, E, F, G, H or O in nosiheptide biosynthesis as the reference) encoding highly conserved posttranslational modifications to form the thiopeptide-characteristic framework.

According to analysis of the genotypic distinctions in this database, we here provide a new biosynthesis-based means of link thiopeptide classification by taking account of the biosynthetically specific genes relevant to certain structural manners, as exemplified by Type I featuring the genes encoding indolic acid side ring (i.e. nosiheptide and nocathiacin), Type II possessing those for quinalic acid moiety formation (i.e. thiostrepton and siomycin), and Type III containing none of the above as the mono-macrocyclic members (i.e. thiocillin and GE2270A). These support the postulation that the specific tailoring genes can help define the structural divergence of the predicted thiopeptide before isolation/characterization.

Users can utilize THIOBASE to search certain thiopeptide or producing organism and to query a sequence against THIOBASE with HMMer or BLAST to find homologous matches. A broad range of primer design, multiple sequence alignment and phylogenetic tools are readily accessible at THIOBASE, allowing for user-directed analyses focusing on thiopeptides or their biosynthetic genes to facilitate individualized directions of the research.

Some web-based resources to a group of ribosomally synthesized peptides, termed link bacteriocins, are available (e.g. link BACTIBASE for bacteriocin characterization and link BAGEL2 for a genome mining tool. Distinct from these, the THIOBASE-archived data focus on the characterization of thiopeptides, whose biosynthesis features not only a ribosomal origin but also a set of highly conserved enzymes for posttranslational modifications (i.e. cyclodehydration and dehydrogenation, dehydration and hetero-cyclization), and provide sufficient information both in chemistry and genetics.

We will continuously enrich the database in structure, producing system and biosynthetic gene (and/or cluster), to provide a pipeline for readily automated discovery of new thiopeptides and their associated biosynthetic pathways.

[TOP]
ThioFinder for prediction of thiopeptide biosynthetic gene clusters and structural peptide
The online tool ThioFinder utilizes a HMMER3-based tool to automately predict potential thiopeptide gene clusters and precursor peptides in user-supplied nucleotide sequences. Compared to frequently used sequence alignment tools based on older scoring methodology, like BLAST, link HMMER3 detects remote protein homologs because of the strength of its underlying using probabilistic models, called profile hidden Markov models (profile HMMs).

THIOBASE archived the manually curated dataset of the known nine biosynthetic gene clusters encoding posttranslational modifications to form the thiopeptide-characteristic framework, typically, the homologs to NosD, E, F, G, H and O in nosiheptide biosynthesis. The ThioFinder-used profile HMMs for the above six families of the biosynthetic proteins and the short precursor peptides was thus easily obtained, respectively.

When users submitted a nucleotide sequence, the ThioFinder server firstly identified the protein-coding regions with the user-supplied annotation or by using the embeded link Prodigal or link Glimmer3 . Then the local version of HMMER3::hmmsearch were employed to identify the gene clusters coding for homologs of the NosD, E, F, G, H and O, and the putative short precursor peptide (<100 amino acid residues) within the 20-kb fragments. The detailed reports were generatd in a minute, including graphical representation.

[TOP]
ThioFinder input file format
A single genome sequence file: FASTA
A single genome sequence file is prepared in FASTA format. It begins with a single-line description, followed by lines of sequence data. The description line must begin with a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. It is suggested that the user download the *.fna or other required genome files in FASTA format from the NCBI at ftp.ncbi.nih.gov/genome/bacteria or specified genome sequencing centres. {wiki}

Note: only the gzip-compressed single FASTA file is acceptable (.gz)
[Example] The genome sequence file of Bacillus cereus ATCC 14579: NC_004722.fna.gz (1.6Mb)

[TOP]
CDS annotation file: NCBI PTT format
Tabular list of all protein-coding regions (CDS) in the corresponding genome sequence should be prepared in the NCBI PTT format.
The PTT file format is a table of protein features. It is used mainly by NCBI who produce PTT files for all their published genomes found in ftp://ftp.ncbi.nih.gov/genomes/.

[Example] the CDS annotation file of Bacillus cereus ATCC 14579 : NC_004722.ptt
[TOP]
Homology analysis using profile Hidden Markov Models
Amino acid
eleven sequences of NosD homologues experimentally reported
Profile HMM
NosD NosD.fasta NosD.hmm
NosE NosE.fasta NosE.hmm
NosF NosF.fasta NosF.hmm
NosG NosG.fasta NosG.hmm
NosH NosH.fasta NosH.hmm
NosO NosO.fasta NosO.hmm
Precursor peptide Precursor peptide.fasta Precursor peptide.hmm
[TOP]