LepChorionDB Manual

LepChorionDB: A relational database of Lepidoptera chorion proteins

Nikolaos G. Giannopoulos^1,2, Ioannis Michalopoulos¹, Nikos C. Papandreou², Apostolos Malatras^1,2, Vassiliki A. Iconomidou² and Stavros J. Hamodrakas²

¹Centre of Immunology & Transplantation, Biomedical Research Foundation, Academy of Athens
²Biophysics and Bioinformatics Laboratory, Faculty of Biology, University of Athens

Citation
Basic Theory
Scope of the database
Search
Compare tool
- Compare input data
- Compare output results
Filter tool
- Filter input data
- Filter output results
References
Acknowledgements

Citation
Giannopoulos, N.G., Michalopoulos, I., Papandreou, N.C., Malatras, A., Iconomidou, V.A., and Hamodrakas, S.J. (2013). LepChorionDB, a database of Lepidopteran chorion proteins and a set of tools useful for the identification of chorion proteins in Lepidopteran proteomes. Insect Biochem Mol Biol 43, 189-196.

Basic Theory
The major component of the eggshell (90-95%) of many insect and fish eggs is chorion (Kafatos et al., 1977). Fig. 1A shows a scanning electron micrograph of a Bombyx mori chorion, a typical chorion of Lepidoptera. Proteins account for more than 95% of its dry mass. This proteinaceous shell forms the outer layer of the eggshell and has extraordinary mechanical and physiological properties, protecting the oocyte and the developing embryo from a series of environmental hazards such as temperature variations, mechanical pressure, proteases, bacteria, viruses etc. Fig.1B shows an electron micrograph of a thin transverse section of a silkmoth chorion. A lamellar ultrastructure of packed fibrils is seen: silkmoth chorion is a biological analogue of a cholesteric liquid crystal. The proteinaceous Lepidopteran silkmoth chorion is used in our lab, as a model system towards unraveling the routes and rules of formation of natural protective amyloids (Iconomidou and Hamodrakas, 2008). Therefore, we constructed a relational database, available on the web, containing all Lepidoptera chorion proteins identified to date.
Lepidoptera chorion proteins can be classified into two major protein families, A and B. This classification was based on multiple sequence alignments of conserved key residues, in the central domain of well characterised silkmoth chorion proteins. HMMER3 (Eddy, 2009) was used to build Hidden Markov Models (HMMs), from these core multiple sequence alignments. Then, these HMMs were used in order to make sensitive searches against various proteomes and identify chorion proteins of interest.

**Fig. 1: A.** A scanning electron micrograph of a *Bombyx mori* chorion (bar 400 μm). B. Transmission electron micrograph of a thin section cut through a *B. mori* silkmoth chorion, showing the fibrous ultrastructure of its lamellae. The parabolic pattern of fibres (ca. 110 Å) within each lamella (white), in oblique sections, is seen, which indicates that silkmoth chorionic is a biological analogue of a cholesteric liquid crystal (bar 0.45 μm).
A
B

Scope of the database
LepChorionDB is the first database of Lepidoptera class A and B chorionic proteins. The goal was the collection of sequences and annotations of all chorionic proteins that have been stored in a series of databases to date, to facilitate research on chorionic and structural proteins in general. It is hoped that this database will be of help to genome annotators in the near future when more arthropod genomes become available. Furthermore, it is hoped that LepChorionDB will facilitate the detection of common properties of chorion proteins, as well as the recognition of important differences that are responsible for chorion functions. The database will be updated on a regular basis.

Search

From the 'Search' page, the users retrieve one or more entries. This search can be done in two ways:

Search by parameters
Search by accession name(s)

Search by parameters

In the first section at search page, data can be obtained by submitting information to the fields available (Fig. 2), as follows:

**Fig. 2:** The first section at search page.

In the "Chorion Protein Class" field the users can choose between class A and B chorion proteins in order to limit the results to one of the chorionic families (if left blank, both classes are included).
In the "Organism" field the users can choose between the following ten species, in order to obtain results for a specific Lepidopteron (if left blank, all species are included):
- Agrotis segetum
- Antheraea assama
- Antheraea polyphemus
- Bicyclus anynana
- Bombyx mori
- Heliothis virescens
- Hyalophora cecropia
- Lymantria dispar
- Manduca sexta
- Melitaea cinxia
In the "Source" field the users can choose between the following databases (if left blank, all databases are included):
- ButterflyBase
- Entrez
- InsectaCentral
- silkpep
- silkworm_gene_bgf
- UniProt
In the "P-value Cut-off" field the users can determine a P-value threshold. The lower the p-value the more statistically significant the result is. The p-value is based on the HMMER3 program.
In the "Score Cut-off" field the users can determine a HMMER score threshold. The score matrix is based on the BLOSUM62.
In the "Molecular Weight (kD)" field the users can determine the range of molecular weight. The amino acid molecular weights were based on webQC chemical portal.
In the "Isoelectric Point" field the users can determine the range of pI. The calculations where based on the ProMoST algorithm (Halligan, 2009) and amino acid pK_a and pK_b were found from "Data for Biochemical Research" (Dawson et al., 1986).

Search by accession name(s)

In the second section of the search page data can be obtained by submitting one or more protein names from LepChorionDB or other databases. Protein names can be separated by tabs, commas, spaces or new line characters (Fig. 3).

**Fig. 3:** The second section at search page.

Result Set

After querying the database, one or more Chorion DB tables will appear (Fig. 4).

**Fig. 4:** LepChorionDB output for the "CLASSB0002" protein.

The available fields of a LepChorionDB protein entry are:

ProteinName: Identical proteins from different sources share a common LepChorionDB protein name. ProteinName prefix corresponds to the chorionic family the protein belongs to and its suffix is a unique four-digit number.
Sequence: It shows the protein sequence in FASTA format. The header line (starting with the ">" symbol) contains the LepChorionDB ProteinName followed by names from external databases that refer to the same protein. Also, the conserved domain is underlined in the protein sequence.
Organism: It shows the species from where the protein is originated. The species name is linked to NCBI taxonomy page for detailed information about every Lepidoptera species.
Links: It shows links to external databases that refer to the same protein.
Source: It shows the database where the protein has been downloaded from.
PubMed: It shows the bibliographic reference where the protein has been described.
Pvalue: It shows the statistical significance as described at the parametric search section above.
Score: It is a HMMER alignment score as described at the parametric search section above.
From: It is the first amino acid of the conserved domain of the protein sequence.
To: It is the last amino acid of the conserved domain of the protein sequence.
MW: It is the calculated molecular weight of the protein as above.
pI: It is the estimated isoelectric point of the protein, calculated as described above.
Group: It shows similar proteins which are grouped together in multiple sequence alignments in CLUSTAL W (2.0.12) format.

Compare tool

Compare tool is used in order to perform a protein sequence homology search against LepChorionDB entries. It is based on the jackhmmer program from HMMER3 software package. Compare tool input is a single protein sequence which is queried iteratively against LepChorionDB, much as PSI-BLAST would do. The ﬁrst round is identical to a phmmer search (BLAST like). All the matches that pass the inclusion thresholds are put in a multiple alignment. In the second (and subsequent) rounds, a proﬁle is made from these results, and the database is searched again with the proﬁle. Iterations continue either until no new sequences are detected or the maximum number of iterations is reached (PSI-BLAST like). The given range of iterations to the user is one to five. The original query sequence is always included in the multiple alignments, whether or not it appears in LepChorionDB (Eddy, 2009).

Compare tool - input fields

In the Compare tool input interface, the user can submit the sequence query and optionally determine another two fields (Fig. 6), as follows.

**Fig. 6:** Compare tool input interface.

The available fields in Compare tool of LepChorionDB are:

Protein sequence text area: In this box the user should input strictly one FASTA-formatted protein sequence which homologous searching will be based on.
upload file: The upload form allows user to upload a single FASTA-formatted protein sequence from a file.
Iteration(s): In this field the user sets the maximum number of iterations. One iteration gives results equivalent to a phmmer search (BLAST like) and it is used for closer homologues. The default is one but the system can support up to five iterations which is better for distant homologues (PSI-BLAST like).
Choose E-value: In this field the user can limit the output results by setting a per-sequence E-value threshold. The default is 1.0E-5.

Filter tool

Filter tool is used to recognise class A or B proteins from a protein sequence database by performing HMM searches.

Filter tool input interface

In the Filter tool input interface, the user can submit a list of FASTA-formatted protein sequences and optionally determine another two fields (Fig. 7), as follows.

**Fig. 7:** Compare tool input interface.

The fields available in Filter tool of LepChorionDB are:

Sequence database text area: In this field the user inputs a sequence database in "FASTA" or "EMBL/Uniprot text" in which the sequence homology search will be performed.
upload file: The upload form allows user to upload sequence database from a file.
Choose protein class: In this field the user determines the chorion class to search for. The user can choose between class A and B chorion proteins.
Choose E-value: In this field the user can limit the output results by setting a per-sequence E-value threshold. The default is 1.0E-5.

Compare & Filter tool Output results

Both Compare & Filter tool have the same output results format. The two sections of the result output are the 'Ranked list of top hits (Fig. 8) and 'Domain annotation for each sequence (and alignments) (Fig. 9) which are described below:

**Fig. 8:** Ranked list of top hits

Ranked list of top hits

The first section is the sequence top hit list. It is a list of ranked top hits (sorted by E-value, most signiﬁcant hit ﬁrst), formatted in a BLAST-like style. The most important number here is the ﬁrst one, the sequence E-value. This is the statistical signiﬁcance of the match to this sequence. The lower the E-value, the more signiﬁcant the hit. The E-value is based on the sequence bit score, which is the second number. This is the log-odds score for the complete sequence. Some people like to see a bit score instead of an E-value, because the bit score does not depend on the size of the sequence database, only on the proﬁle HMM and the target sequence. The next number, the bias, is a correction term for biased sequence composition that has been applied to the sequence bit score. The only time the user really needs to pay attention to this value is when it is large, and on the same order of magnitude as the sequence bit score. This might be a sign that the target sequence is not really a homolog, but merely shares a similar strong biased composition with the query model. The next three numbers are again an E-value, score, and bias, but only for the single best-scoring domain in the sequence, rather than the sum of all its identiﬁed domains. The two columns headed #doms are two different estimates of the number of distinct domains that the target sequence contains. The ﬁrst, the column marked exp, is the expected number of domains according to HMMER's statistical model. It is an average, calculated as a weighted marginal sum over all possible 18alignments. Because it is an average, it is not necessarily a round integer. The second, the column marked N, is the number of domains that HMMER3's domain postprocessing and annotation pipeline ﬁnally decided to identify, annotate, and align in the target sequence. This is the number of alignments that will show up in the domain report later in the output ﬁle. The last two columns are the name of each target sequence which is linked with LepChorionDB and a description (Eddy, 2009).

**Fig. 9:** Domain annotation for each sequence (and alignments)

Domain annotation for each sequence (and alignments)

In the second section (Fig.9) for each sequence in the top hits list, there will be a section containing a table of where HMMER3 thinks all the domains are, followed by the alignment inferred for each domain . Domains are reported in the order they appear in the sequence, not in order of their significance. The ! or ? symbol indicates whether this domain does or does not satisfy both per-sequence and per-domain inclusion thresholds. The bit score and bias values are as described above for sequence scores, but are the score of just one domain's envelope. The first of the two E-values is the conditional E-value. The second number is the independent E-value. The next four columns give the endpoints of the reported local alignment with respect to both the query model (“hmm from” and “hmm to”) and the target sequence (“ali from” and “ali to”).The next two columns (“env from” and “env to”) deﬁne the envelope of the domain’s location on the target sequence. The last column is the average posterior probability of the aligned target sequence residues; effectively, the expected accuracy per residue of the alignment.
Next, at the alignments for each domain section, is following an “optimal posterior accuracy” alignment (Holmes, 1998) which is computed within each domain's envelope, and displayed. The line starting with a user's protein name, here lepprot1010, is the consensus of the query model. Capital letters represent the most conserved (high information content) positions. Dots (.) in this line indicate insertions in the target sequence with respect to the model. The midline indicates matches between the query model and target sequence. A + indicates positive score, which can be interpreted as “conservative substitution”, with respect to what the model expects at that position. The line starting with a LepChorionDB accession name, here CLASSA0164, is the target sequence. Dashes (-) in this line indicate deletions in the target sequence with respect to the model. The bottom line represents the posterior probability (essentially the expected accuracy) of each aligned residue. A 0 means 0-5%, 1 means 5-15%, and so on; 9 means 85-95%, and a * means 95-100% posterior probability. These posterior probabilities can be used to decide which parts of the alignment are well-determined or not. The user will also see expected alignment accuracy degrade at the ends of an alignment (Eddy, 2009). For more information please refer to HMMER3 User guide.

References
Iconomidou, V.A., and Hamodrakas, S.J. (2008). Natural protective amyloids. Curr Protein Pept Sci 9, 291-309.
Eddy, S.R. (2009). A new generation of homology search tools based on probabilistic inference. Genome Inform 23, 205-211.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. (1990). Basic local alignment search tool. J Mol Biol 215, 403-410.
Edgar, R.C. (2004a). MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5, 113.
Edgar, R.C. (2004b). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32, 1792-1797.
Lipman, D.J., and Pearson, W.R. (1985). Rapid and sensitive protein similarity searches. Science 227, 1435-1441.
Pearson, W.R., and Lipman, D.J. (1988). Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 85, 2444-2448.
Kafatos, F.C., Regier, J.C., Mazur, G.D., Nadel, M.R., Blau, H.M., Petri, W.H., Wyman, A.R., Gelinas, R.E., Moore, P.B., Paul, M., Efstratiadis, A., Vournakis, J.N., Goldsmith, M.R., Hunsley, J.R., Baker, B., Nardi, J., and Koehler, M. (1977). The eggshell of insects: differentiation-specific proteins and the control of their synthesis and accumulation during development. Results Probl Cell Differ 8, 45-145.
Dawson, R.M.C., Elliott, D.C., Elliott, W.H., Jones, K.M. (1989) Data for Biochemical Research. 3rd edition, Oxford Science Publications.
Halligan, B.D. (2009) ProMoST: A Tool for Calculating the pI and Molecular Mass of Phosphorylated and Modified Proteins on Two-Dimensional Gels. Methods Mol Biol 527, 283-298.
Holmes, I. (1998). Studies in Probabilistic Sequence Alignment and Evolution. PhD Thesis, University of Cambridge.

Acknowledgements
This work was a collaboration of the Department of Cell Biology and Biophysics, University of Athens and the Centre of Immunology & Transplantation Biomedical Research Foundation, Academy of Athens.