Chorion Lepidoptera DataBase

LepChorionDB: A relational database of Lepidoptera chorion proteins

Nikolaos G. Giannopoulos1,2, Ioannis Michalopoulos1, Nikos C. Papandreou2, Apostolos Malatras1,2, Vassiliki A. Iconomidou2 and Stavros J. Hamodrakas2

1Centre of Immunology & Transplantation, Biomedical Research Foundation, Academy of Athens
2Biophysics and Bioinformatics Laboratory, Faculty of Biology, University of Athens


Giannopoulos, N.G., Michalopoulos, I., Papandreou, N.C., Malatras, A., Iconomidou, V.A., and Hamodrakas, S.J. (2013). LepChorionDB, a database of Lepidopteran chorion proteins and a set of tools useful for the identification of chorion proteins in Lepidopteran proteomes. Insect Biochem Mol Biol 43, 189-196.

Basic Theory
The major component of the eggshell (90-95%) of many insect and fish eggs is chorion (Kafatos et al., 1977). Fig. 1A shows a scanning electron micrograph of a Bombyx mori chorion, a typical chorion of Lepidoptera. Proteins account for more than 95% of its dry mass. This proteinaceous shell forms the outer layer of the eggshell and has extraordinary mechanical and physiological properties, protecting the oocyte and the developing embryo from a series of environmental hazards such as temperature variations, mechanical pressure, proteases, bacteria, viruses etc. Fig.1B shows an electron micrograph of a thin transverse section of a silkmoth chorion. A lamellar ultrastructure of packed fibrils is seen: silkmoth chorion is a biological analogue of a cholesteric liquid crystal. The proteinaceous Lepidopteran silkmoth chorion is used in our lab, as a model system towards unraveling the routes and rules of formation of natural protective amyloids (Iconomidou and Hamodrakas, 2008). Therefore, we constructed a relational database, available on the web, containing all Lepidoptera chorion proteins identified to date.
Lepidoptera chorion proteins can be classified into two major protein families, A and B. This classification was based on multiple sequence alignments of conserved key residues, in the central domain of well characterised silkmoth chorion proteins. HMMER3 (Eddy, 2009) was used to build Hidden Markov Models (HMMs), from these core multiple sequence alignments. Then, these HMMs were used in order to make sensitive searches against various proteomes and identify chorion proteins of interest.

Fig. 1: A. A scanning electron micrograph of a Bombyx mori chorion (bar 400 μm). B. Transmission electron micrograph of a thin section cut through a B. mori silkmoth chorion, showing the fibrous ultrastructure of its lamellae. The parabolic pattern of fibres (ca. 110 Å) within each lamella (white), in oblique sections, is seen, which indicates that silkmoth chorionic is a biological analogue of a cholesteric liquid crystal (bar 0.45 μm).
A B. mori chorion
B Thin section cut of B. mori chorion

Scope of the database
LepChorionDB is the first database of Lepidoptera class A and B chorionic proteins. The goal was the collection of sequences and annotations of all chorionic proteins that have been stored in a series of databases to date, to facilitate research on chorionic and structural proteins in general. It is hoped that this database will be of help to genome annotators in the near future when more arthropod genomes become available. Furthermore, it is hoped that LepChorionDB will facilitate the detection of common properties of chorion proteins, as well as the recognition of important differences that are responsible for chorion functions. The database will be updated on a regular basis.


From the 'Search' page, the users retrieve one or more entries. This search can be done in two ways:

  1. Search by parameters
  2. Search by accession name(s)
Search by parameters

In the first section at search page, data can be obtained by submitting information to the fields available (Fig. 2), as follows:
Fig. 2: The first section at search page.

Search by accession name(s)

In the second section of the search page data can be obtained by submitting one or more protein names from LepChorionDB or other databases. Protein names can be separated by tabs, commas, spaces or new line characters (Fig. 3).
Fig. 3: The second section at search page.

Result Set

After querying the database, one or more Chorion DB tables will appear (Fig. 4).

Fig. 4: LepChorionDB output for the "CLASSB0002" protein.

The available fields of a LepChorionDB protein entry are:

Compare tool

Compare tool is used in order to perform a protein sequence homology search against LepChorionDB entries. It is based on the jackhmmer program from HMMER3 software package. Compare tool input is a single protein sequence which is queried iteratively against LepChorionDB, much as PSI-BLAST would do. The first round is identical to a phmmer search (BLAST like). All the matches that pass the inclusion thresholds are put in a multiple alignment. In the second (and subsequent) rounds, a profile is made from these results, and the database is searched again with the profile. Iterations continue either until no new sequences are detected or the maximum number of iterations is reached (PSI-BLAST like). The given range of iterations to the user is one to five. The original query sequence is always included in the multiple alignments, whether or not it appears in LepChorionDB (Eddy, 2009).

Compare tool - input fields

In the Compare tool input interface, the user can submit the sequence query and optionally determine another two fields (Fig. 6), as follows.

Fig. 6: Compare tool input interface.

The available fields in Compare tool of LepChorionDB are:

Filter tool

Filter tool is used to recognise class A or B proteins from a protein sequence database by performing HMM searches.

Filter tool input interface

In the Filter tool input interface, the user can submit a list of FASTA-formatted protein sequences and optionally determine another two fields (Fig. 7), as follows.

Fig. 7: Compare tool input interface.

The fields available in Filter tool of LepChorionDB are:

Compare & Filter tool Output results

Both Compare & Filter tool have the same output results format. The two sections of the result output are the 'Ranked list of top hits (Fig. 8) and 'Domain annotation for each sequence (and alignments) (Fig. 9) which are described below:

Fig. 8: Ranked list of top hits

Ranked list of top hits

The first section is the sequence top hit list. It is a list of ranked top hits (sorted by E-value, most significant hit first), formatted in a BLAST-like style. The most important number here is the first one, the sequence E-value. This is the statistical significance of the match to this sequence. The lower the E-value, the more significant the hit. The E-value is based on the sequence bit score, which is the second number. This is the log-odds score for the complete sequence. Some people like to see a bit score instead of an E-value, because the bit score does not depend on the size of the sequence database, only on the profile HMM and the target sequence. The next number, the bias, is a correction term for biased sequence composition that has been applied to the sequence bit score. The only time the user really needs to pay attention to this value is when it is large, and on the same order of magnitude as the sequence bit score. This might be a sign that the target sequence is not really a homolog, but merely shares a similar strong biased composition with the query model. The next three numbers are again an E-value, score, and bias, but only for the single best-scoring domain in the sequence, rather than the sum of all its identified domains. The two columns headed #doms are two different estimates of the number of distinct domains that the target sequence contains. The first, the column marked exp, is the expected number of domains according to HMMER's statistical model. It is an average, calculated as a weighted marginal sum over all possible 18alignments. Because it is an average, it is not necessarily a round integer. The second, the column marked N, is the number of domains that HMMER3's domain postprocessing and annotation pipeline finally decided to identify, annotate, and align in the target sequence. This is the number of alignments that will show up in the domain report later in the output file. The last two columns are the name of each target sequence which is linked with LepChorionDB and a description (Eddy, 2009).

Fig. 9: Domain annotation for each sequence (and alignments)

Domain annotation for each sequence (and alignments)

In the second section (Fig.9) for each sequence in the top hits list, there will be a section containing a table of where HMMER3 thinks all the domains are, followed by the alignment inferred for each domain . Domains are reported in the order they appear in the sequence, not in order of their significance. The ! or ? symbol indicates whether this domain does or does not satisfy both per-sequence and per-domain inclusion thresholds. The bit score and bias values are as described above for sequence scores, but are the score of just one domain's envelope. The first of the two E-values is the conditional E-value. The second number is the independent E-value. The next four columns give the endpoints of the reported local alignment with respect to both the query model (“hmm from” and “hmm to”) and the target sequence (“ali from” and “ali to”).The next two columns (“env from” and “env to”) define the envelope of the domain’s location on the target sequence. The last column is the average posterior probability of the aligned target sequence residues; effectively, the expected accuracy per residue of the alignment.
Next, at the alignments for each domain section, is following an “optimal posterior accuracy” alignment (Holmes, 1998) which is computed within each domain's envelope, and displayed. The line starting with a user's protein name, here lepprot1010, is the consensus of the query model. Capital letters represent the most conserved (high information content) positions. Dots (.) in this line indicate insertions in the target sequence with respect to the model. The midline indicates matches between the query model and target sequence. A + indicates positive score, which can be interpreted as “conservative substitution”, with respect to what the model expects at that position. The line starting with a LepChorionDB accession name, here CLASSA0164, is the target sequence. Dashes (-) in this line indicate deletions in the target sequence with respect to the model. The bottom line represents the posterior probability (essentially the expected accuracy) of each aligned residue. A 0 means 0-5%, 1 means 5-15%, and so on; 9 means 85-95%, and a * means 95-100% posterior probability. These posterior probabilities can be used to decide which parts of the alignment are well-determined or not. The user will also see expected alignment accuracy degrade at the ends of an alignment (Eddy, 2009). For more information please refer to HMMER3 User guide.

Iconomidou, V.A., and Hamodrakas, S.J. (2008). Natural protective amyloids. Curr Protein Pept Sci 9, 291-309.
Eddy, S.R. (2009). A new generation of homology search tools based on probabilistic inference. Genome Inform 23, 205-211.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. (1990). Basic local alignment search tool. J Mol Biol 215, 403-410.
Edgar, R.C. (2004a). MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5, 113.
Edgar, R.C. (2004b). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32, 1792-1797.
Lipman, D.J., and Pearson, W.R. (1985). Rapid and sensitive protein similarity searches. Science 227, 1435-1441.
Pearson, W.R., and Lipman, D.J. (1988). Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 85, 2444-2448.
Kafatos, F.C., Regier, J.C., Mazur, G.D., Nadel, M.R., Blau, H.M., Petri, W.H., Wyman, A.R., Gelinas, R.E., Moore, P.B., Paul, M., Efstratiadis, A., Vournakis, J.N., Goldsmith, M.R., Hunsley, J.R., Baker, B., Nardi, J., and Koehler, M. (1977). The eggshell of insects: differentiation-specific proteins and the control of their synthesis and accumulation during development. Results Probl Cell Differ 8, 45-145.
Dawson, R.M.C., Elliott, D.C., Elliott, W.H., Jones, K.M. (1989) Data for Biochemical Research. 3rd edition, Oxford Science Publications.
Halligan, B.D. (2009) ProMoST: A Tool for Calculating the pI and Molecular Mass of Phosphorylated and Modified Proteins on Two-Dimensional Gels. Methods Mol Biol 527, 283-298.
Holmes, I. (1998). Studies in Probabilistic Sequence Alignment and Evolution. PhD Thesis, University of Cambridge.

This work was a collaboration of the Department of Cell Biology and Biophysics, University of Athens and the Centre of Immunology & Transplantation Biomedical Research Foundation, Academy of Athens.