 |
|
|
|
|
|
 |
Arabidopsis Thaliana :
|
Arabidopsis thaliana is the first eukaryotic
organism where EuGène has been extensively used.
The following tables compares the results obtained
with EuGene and several gene prediction programs
(described in Table 1).
|
|
|
|
Program |
Reference |
Version |
| Grail |
Xu and Uberbacher, 1997
|
1.3, data from [*]
|
| Fex |
Solovyev et al., 1994
|
data from [*]
|
| MZEF |
Zhang, 1998
|
prior p=0.04, data from [*]
|
| GenScan
|
Burge and Karlin, 1997
|
data from [*]
|
| GlimmerA
|
Salzberg et al., 1999
|
1.0
|
| GMhmm1
|
Lukashin and Borodovsky,
1998 |
data from [*]
|
| GMhmm
|
Lukashin and Borodovsky,
1998 |
2.2a
|
| FgenesP
|
Solovyev, unpublished,
1997 |
data from [*]
|
| FgenesH
|
Salamov and Solovyev,
unpublished, 1999 |
1.0
|
| EuGène50
|
- |
Rel. 1.2, 50% sensitivity
|
| EuGène80
|
- |
Rel. 1.2, 80% sensitivity
|
|
|
Table 1 : A list of the gene prediction programs
compared along with the corresponding bibliographic
reference and version.
[*] : N. Pavy, S. Rombauts, P. Déhais, C. Mathé,
D.V.V. Ramana, P. Leroy, and P. Rouzé. Evaluation of
gene prediction software using a genomic data set :
application to arabidopsis thaliana
sequences. Bioinformatics, 1999, 15(11):887-99.
|
|
|
|
|
Base level |
Exon level |
|
Program |
Sn |
Sp |
CC |
Pred |
Corr |
Olap |
Wrong |
Miss |
Split |
Merged |
Sn |
Sp |
|
Grail |
- |
- |
- |
1184 |
449 |
506 |
229 |
80 |
12 |
16 |
44% |
38% |
|
Fex |
- |
- |
- |
1745 |
562 |
484 |
699 |
155 |
180 |
23 |
55% |
32% |
|
MZEF |
- |
- |
- |
846 |
459 |
236 |
151 |
358 |
32 |
14 |
45% |
54% |
|
GenScan |
- |
- |
- |
938 |
652 |
204 |
82 |
175 |
10 |
16 |
63% |
70% |
|
FgenesP |
- |
- |
- |
737 |
433 |
195 |
109 |
403 |
7 |
8 |
42% |
59% |
|
GMhmm1 |
- |
- |
- |
1104 |
845 |
172 |
87 |
26 |
10 |
4 |
82% |
77% |
|
GMhmm |
0.97 |
0.93 |
0.94 |
1093 |
854 |
157 |
85 |
28 |
6 |
6 |
83% |
78% |
|
GlimmerA |
0.87 |
0.89 |
0.84 |
1034 |
697 |
186 |
164 |
149 |
5 |
26 |
67% |
67% |
|
FgenesH |
0.98 |
0.93 |
0.94 |
1070 |
900 |
105 |
72 |
23 |
1 |
14 |
88% |
84% |
|
FgenesHGC |
0.98 |
0.93 |
0.94 |
1021 |
902 |
100 |
78 |
25 |
0 |
12 |
88% |
88% |
|
EuGène50 |
0.95 |
0.95 |
0.93 |
974 |
849 |
94 |
38 |
86 |
2 |
14 |
83% |
87% |
|
EuGène80 |
0.95 |
0.95 |
0.93 |
991 |
862 |
89 |
47 |
71 |
1 |
14 |
84% |
87% |
|
|
|
Program |
Pred |
Corr |
Miss |
Part |
Wrong |
Split |
Merged |
Sn |
Sp |
|
GenScan |
150 |
28 |
1 |
139 |
13 |
1 |
60 |
17% |
19% |
|
FgenesP |
92 |
10 |
47 |
111 |
3 |
0 |
60 |
6% |
11% |
|
GMhmm1 |
208 |
67 |
1 |
100 |
27 |
18 |
12 |
40% |
32% |
|
GMhmm |
187 |
69 |
1 |
98 |
24 |
2 |
12 |
41% |
37% |
|
GlimmerA |
265 |
50 |
2 |
116 |
58 |
35 |
0 |
30% |
19% |
|
FgenesH |
176 |
94 |
1 |
73 |
14 |
0 |
10 |
56% |
53% |
|
FgenesHGC |
175 |
96 |
1 |
71 |
14 |
0 |
12 |
57% |
55% |
|
EuGène50 |
178 |
101 |
5 |
62 |
15 |
1 |
2 |
60% |
57% |
|
EuGène80 |
199 |
112 |
2 |
54 |
22 |
11 |
0 |
67% |
56% |
|
|
|
Table 2 : Results of the gene prediction software
evaluation on Araset at the base and exon level. Each
line corresponds to a gene prediction program, as
described in Table 1. At the base level, on the left,
sensitivity (Sn), specificity (Sp) as well
correlation coefficient (CC) are reported. At the exon
level, each column successively gives the number of
predicted (pred), correct (corr), overlapping with an
annotated exon in the same coding frame (olap),
overpredicted (wrong), missing (miss) exons. The two
next columns give the number of annotated exons which
are predicted as two split exons (split) and the number
of predicted exons which actually merge annotated exons
into one exon. The two last columns report sensitivity
(Sn) and specificity (Sp) at the exon
level.
|
Table 3 : Results of the gene prediction software
evaluation on Araset at the whole gene level. Each line
corresponds to a gene prediction program, as described
in Table 1. The number of predicted (pred), completely
correct (corr), completely missing (miss), partially
predicted (part) and overpredicted (wrong) genes are
first presented. The two next columns give the number of
annotated genes that are actually split in the
predictions and the number of predicted genes which
actually merge annotated genes into one gene. The two
last columns report sensitivity (Sn) and
specificity (Sp) at the gene level.
|
|
|
| |
Base level |
Exon level |
|
Version |
Sn |
Sp |
CC |
Pred |
Corr |
Olap |
Wrong |
Miss |
Split |
Merged |
Sn |
Sp |
|
EuGène50 |
0.948 |
0.952 |
0.934 |
974 |
849 |
94 |
38 |
86 |
2 |
14 |
82.67% |
87.17% |
|
EuGène50EST |
0.961 |
0.952 |
0.942 |
1014 |
895 |
79 |
46 |
53 |
0 |
12 |
87.15% |
88.26% |
|
EuGène50Prot |
0.981 |
0.955 |
0.957 |
1035 |
926 |
75 |
40 |
27 |
1 |
12 |
90.17% |
89.47% |
|
EuGène50Full |
0.980 |
0.953 |
0.955 |
1047 |
936 |
65 |
50 |
26 |
0 |
8 |
91.14% |
89.40% |
|
|
|
Table 4 : Results of EuGene's evaluation on Araset at
the base and exon level when similarity information is
used. Each line corresponds to a specific variant of
EuGene that is either given no similarity information
(EuGene50), similarity information with EST and
cDNA from dbEST and PlantGene databases
(EuGene50EST), similarity information with
SwissProt rel. 40 from which all Arabidopsis
thaliana sequences have been removed
(EuGene50Prot), and both EST and protein
similarities (EuGene50Full). At the base
level, on the left, sensitivity (Sn),
specificity (Sp) as well correlation
coefficient (CC) are reported. At the exon level, each
column successively gives the number of predicted
(pred), correct (corr), overlapping with an annotated
exon in the same coding frame (olap), overpredicted
(wrong), missing (miss) exons. The two next columns
give the number of annotated exons which are predicted
as two split exons (split) and the number of predicted
exons which actually merge annotated exons into one
exon. The two last columns report sensitivity
(Sn) and specificity (Sp) at the exon
level.
|
|
|
|
Version |
Pred |
Corr |
Miss |
Part |
Wrong |
Split |
Merged |
Sn |
Sp |
|
EuGène50 |
178 |
101 |
5 |
62 |
15 |
1 |
2 |
60.12% |
56.74% |
|
EuGène50EST |
185 |
115 |
5 |
48 |
19 |
4 |
2 |
68.45% |
62.16% |
|
EuGèneProt |
185 |
122 |
1 |
45 |
17 |
2 |
2 |
72.62% |
65.95% |
|
EuGèneFull |
187 |
127 |
2 |
39 |
21 |
1 |
2 |
75.60% |
67.91% |
|
|
|
Run |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
|
f |
89.0 |
89.0 |
89.1 |
89.4 |
89.7 |
89.7 |
90.1 |
90.1 |
90.5 |
|
Sne |
96 |
96.2 |
96.1 |
96.5 |
96.6 |
96.6 |
96.6 |
96.7 |
96.9 |
|
Spe |
96.8 |
96.7 |
96.1 |
96.6 |
96.6 |
96.6 |
97 |
96.6 |
96.7 |
|
Sng |
83.3 |
84 |
84 |
84 |
84.7 |
84.7 |
85.4 |
85.4 |
86.1 |
|
Spg |
81.6 |
79.1 |
80.7 |
81.8 |
81.3 |
81.3 |
82 |
82 |
81.6 |
|
|
|
Table 5 : Results of EuGene's evaluation on Araset at
the base and exon level when similarity information is
used. Each line corresponds to a specific variant of
EuGene that is either given no similarity information
(EuGene50), similarity information with EST
and cDNA from dbEST and PlantGene databases
(EuGene50EST), similarity information with
SwissProt rel. 40 from which all Arabidopsis
thaliana sequences have been removed
(EuGene50Prot), and both EST and protein
similarities (EuGene50Full). The number of
predicted (pred), completely correct (corr),
completely missing (miss), partially predicted (part)
and overpredicted (wrong) genes are first
presented. The two next columns give the number of
annotated genes that are actually split in the
predictions and the number of predicted genes which
actually merge annotated genes into one gene. The two
last columns report sensitivity (Sn) and
specificity (Sp) at the gene level.
|
Table 6 : Evaluation of the robustness of the
parameter estimation algorithm used to estimate
EuGene's parameters. Nine independent sorted runs of
the (alpha,beta) parameters estimation
have been performed. The criteria optimized is a
combination of exon and gene level sensitivity and
sensitivity such that 100% represents perfect
prediction. Each line successively reports the value
of the criteria obtained (f), the corresponding
exon level sensitivity (Sne) and specificity
(Spe), and the corresponding gene level
sensitivity (Sng) and specificity (Spg).
|
|