Giannopoulos, N.G., Michalopoulos, I., Papandreou, N.C., Malatras, A., Iconomidou, V.A., and Hamodrakas, S.J. (2013). LepChorionDB, a database of Lepidopteran chorion proteins and a set of tools useful for the identification of chorion proteins in Lepidopteran proteomes. Insect Biochem Mol Biol 43, 189-196.
The major component of the eggshell (90-95%) of many insect and fish eggs is chorion (Kafatos et al., 1977). Fig. 1A shows a scanning electron micrograph of a Bombyx mori chorion, a typical chorion of Lepidoptera. Proteins account for more than 95% of its dry mass. This proteinaceous shell forms the outer layer of the eggshell and has extraordinary mechanical and physiological properties, protecting the oocyte and the developing embryo from a series of environmental hazards such as temperature variations, mechanical pressure, proteases, bacteria, viruses etc. Fig.1B shows an electron micrograph of a thin transverse section of a silkmoth chorion. A lamellar ultrastructure of packed fibrils is seen: silkmoth chorion is a biological analogue of a cholesteric liquid crystal. The proteinaceous Lepidopteran silkmoth chorion is used in our lab, as a model system towards unraveling the routes and rules of formation of natural protective amyloids (Iconomidou and Hamodrakas, 2008). Therefore, we constructed a relational database, available on the web, containing all Lepidoptera chorion proteins identified to date.
Lepidoptera chorion proteins can be classified into two major protein families, A and B. This classification was based on multiple sequence alignments of conserved key residues, in the central domain of well characterised silkmoth chorion proteins. HMMER3 (Eddy, 2009) was used to build Hidden Markov Models (HMMs), from these core multiple sequence alignments. Then, these HMMs were used in order to make sensitive searches against various proteomes and identify chorion proteins of interest.
Scope of the
LepChorionDB is the first database of Lepidoptera class A and B chorionic proteins. The goal was the collection of sequences and annotations of all chorionic proteins that have been stored in a series of databases to date, to facilitate research on chorionic and structural proteins in general. It is hoped that this database will be of help to genome annotators in the near future when more arthropod genomes become available. Furthermore, it is hoped that LepChorionDB will facilitate the detection of common properties of chorion proteins, as well as the recognition of important differences that are responsible for chorion functions. The database will be updated on a regular basis.
From the 'Search' page, the users retrieve one or more entries. This search can be done in two ways:
After querying the database, one or more Chorion DB tables will appear (Fig. 4).
Compare tool is used in order to perform a protein sequence homology search against LepChorionDB entries. It is based on the jackhmmer program from HMMER3 software package. Compare tool input is a single protein sequence which is queried iteratively against LepChorionDB, much as PSI-BLAST would do. The ﬁrst round is identical to a phmmer search (BLAST like). All the matches that pass the inclusion thresholds are put in a multiple alignment. In the second (and subsequent) rounds, a proﬁle is made from these results, and the database is searched again with the proﬁle. Iterations continue either until no new sequences are detected or the maximum number of iterations is reached (PSI-BLAST like). The given range of iterations to the user is one to five. The original query sequence is always included in the multiple alignments, whether or not it appears in LepChorionDB (Eddy, 2009).
tool - input fields
In the Compare tool input interface, the user can submit the sequence query and optionally determine another two fields (Fig. 6), as follows.
Filter tool is used to recognise class A or B proteins from a protein sequence database by performing HMM searches.
In the Filter tool input interface, the user can submit a list of FASTA-formatted protein sequences and optionally determine another two fields (Fig. 7), as follows.
& Filter tool Output results
Both Compare & Filter tool have the same output results format. The two sections of the result output are the 'Ranked list of top hits (Fig. 8) and 'Domain annotation for each sequence (and alignments) (Fig. 9) which are described below:
The first section is the sequence top hit list. It is a list of ranked top hits (sorted by E-value, most signiﬁcant hit ﬁrst), formatted in a BLAST-like style. The most important number here is the ﬁrst one, the sequence E-value. This is the statistical signiﬁcance of the match to this sequence. The lower the E-value, the more signiﬁcant the hit. The E-value is based on the sequence bit score, which is the second number. This is the log-odds score for the complete sequence. Some people like to see a bit score instead of an E-value, because the bit score does not depend on the size of the sequence database, only on the proﬁle HMM and the target sequence. The next number, the bias, is a correction term for biased sequence composition that has been applied to the sequence bit score. The only time the user really needs to pay attention to this value is when it is large, and on the same order of magnitude as the sequence bit score. This might be a sign that the target sequence is not really a homolog, but merely shares a similar strong biased composition with the query model. The next three numbers are again an E-value, score, and bias, but only for the single best-scoring domain in the sequence, rather than the sum of all its identiﬁed domains. The two columns headed #doms are two different estimates of the number of distinct domains that the target sequence contains. The ﬁrst, the column marked exp, is the expected number of domains according to HMMER's statistical model. It is an average, calculated as a weighted marginal sum over all possible 18alignments. Because it is an average, it is not necessarily a round integer. The second, the column marked N, is the number of domains that HMMER3's domain postprocessing and annotation pipeline ﬁnally decided to identify, annotate, and align in the target sequence. This is the number of alignments that will show up in the domain report later in the output ﬁle. The last two columns are the name of each target sequence which is linked with LepChorionDB and a description (Eddy, 2009).
In the second section (Fig.9) for each sequence in the top hits list,
there will be a section containing a table of where HMMER3 thinks all the
domains are, followed by the alignment inferred for each domain . Domains
are reported in the order they appear in the sequence, not in order of
their significance. The ! or ? symbol indicates whether this domain does or
does not satisfy both per-sequence and per-domain inclusion
thresholds. The bit score and bias values are as described above for
sequence scores, but are the score of just one domain's envelope. The first
of the two E-values is the conditional E-value. The second number is the
independent E-value. The next four columns give the endpoints of the
reported local alignment with respect to both the query model (“hmm from”
and “hmm to”) and the target sequence (“ali from” and “ali to”).The next
two columns (“env from” and “env to”) deﬁne the envelope of the domain’s
location on the target sequence. The last column is the average posterior
probability of the aligned target sequence residues; effectively, the
expected accuracy per residue of the alignment.
Next, at the alignments for each domain section, is following an “optimal posterior accuracy” alignment (Holmes, 1998) which is computed within each domain's envelope, and displayed. The line starting with a user's protein name, here lepprot1010, is the consensus of the query model. Capital letters represent the most conserved (high information content) positions. Dots (.) in this line indicate insertions in the target sequence with respect to the model. The midline indicates matches between the query model and target sequence. A + indicates positive score, which can be interpreted as “conservative substitution”, with respect to what the model expects at that position. The line starting with a LepChorionDB accession name, here CLASSA0164, is the target sequence. Dashes (-) in this line indicate deletions in the target sequence with respect to the model. The bottom line represents the posterior probability (essentially the expected accuracy) of each aligned residue. A 0 means 0-5%, 1 means 5-15%, and so on; 9 means 85-95%, and a * means 95-100% posterior probability. These posterior probabilities can be used to decide which parts of the alignment are well-determined or not. The user will also see expected alignment accuracy degrade at the ends of an alignment (Eddy, 2009). For more information please refer to HMMER3 User guide.
Iconomidou, V.A., and Hamodrakas, S.J. (2008). Natural protective amyloids. Curr Protein Pept Sci 9, 291-309.
Eddy, S.R. (2009). A new generation of homology search tools based on probabilistic inference. Genome Inform 23, 205-211.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. (1990). Basic local alignment search tool. J Mol Biol 215, 403-410.
Edgar, R.C. (2004a). MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5, 113.
Edgar, R.C. (2004b). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32, 1792-1797.
Lipman, D.J., and Pearson, W.R. (1985). Rapid and sensitive protein similarity searches. Science 227, 1435-1441.
Pearson, W.R., and Lipman, D.J. (1988). Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 85, 2444-2448.
Kafatos, F.C., Regier, J.C., Mazur, G.D., Nadel, M.R., Blau, H.M., Petri, W.H., Wyman, A.R., Gelinas, R.E., Moore, P.B., Paul, M., Efstratiadis, A., Vournakis, J.N., Goldsmith, M.R., Hunsley, J.R., Baker, B., Nardi, J., and Koehler, M. (1977). The eggshell of insects: differentiation-specific proteins and the control of their synthesis and accumulation during development. Results Probl Cell Differ 8, 45-145.
Dawson, R.M.C., Elliott, D.C., Elliott, W.H., Jones, K.M. (1989) Data for Biochemical Research. 3rd edition, Oxford Science Publications.
Halligan, B.D. (2009) ProMoST: A Tool for Calculating the pI and Molecular Mass of Phosphorylated and Modified Proteins on Two-Dimensional Gels. Methods Mol Biol 527, 283-298.
Holmes, I. (1998). Studies in Probabilistic Sequence Alignment and Evolution. PhD Thesis, University of Cambridge.
This work was a collaboration of the Department of Cell Biology and Biophysics, University of Athens and the Centre of Immunology & Transplantation Biomedical Research Foundation, Academy of Athens.