AM I a Gene ?
AMIGene (Annotation of MIcrobial Genes) is an application for automatically identifying the most likely CoDing Sequences (CDSs) in a large contig or a complete bacterial genome sequence. The first step in AMIGene is dedicated to the construction of Markov models that fit the input genomic data (i.e the gene model), followed by the combination of well-known gene-finding methods and an heuristic approach for the selection of the most likely CDSs.
Manual annotation of the CDSs from several analyses. This graphical map has been obtained with the Imagene system (Medigue et al. 1999 Bioinformatics) after analysis of a B. subtilis chromosomal region. Results obtained with the CoDing Sequences search method are shown as red boxes (CDSs) and those obtained with the GeneMark coding predictions are displayed as blue curves. The results are superimposed in each of the six reading frames. The selected CDSs are represented in green and correspond to the longest CDSs having a high coding probability, which also gives a good coverage of the sequence (minimum overlaps between adjacent CDSs and no inclusion between CDSs). The CDS 5' (circled in green) shows an important overlap with the CDS 5, revealing a possible frameshift error in the DNA sequence. In the case of the CDS 8', the coding probability is better than that of the CDS 8' selected on the opposite strand. Here, searching for similarities in protein databanks will help the user to choose among them.
As shown in this Figure, several short CDSs are included into other longer CDSs and are undoubtedly false positives. In other cases the choice between two overlapping CDSs can be more tricky (for examples, the CDSs number 5 and 8). In AMIGene, the selection of the most likely CDSs consists in the elimination of the false positives according to overlapping criteria between adjacent CDSs, these overlaps being either total (they are called inclusion) or partial.
Computer methods of accurate gene finding in DNA sequences require models of protein coding and non-coding regions derived either from annotated DNA sequences or from a large enough set of anonymous DNA sequences.
These two choices depend on the DNA sequence you are going to analyze :
1) if the corresponding organism is very closed to one of the listed species (i.e, strains), you can select the corresponding species in order to use the gene models we have built up, as explained below.
2) if your genome is really new, it is recommended to build an appropriate gene model that will better bit with the input DNA sequence.
"Build a Gene Model"
The minium length of the input sequence should be at least 10 kb. The prokov-learn method (Alain Viari, personnal communication) will run with the highest Markov order model possible (2, 3, or 4).
The minimum size of the Open Reading Frame (ORF) being extracted from the input sequence depends on its GC content. For our three reference genomes, the minimum ORF size were the following :
- B. subtilis (GC content of 43%) : 500 bp
- E. coli (GC content of 51%) : 700 bp
- M. tuberculosis (GC content of 66%) : 900 bp
We then first compute the GC% of the input sequence, and as a first approximation, we used a linear regression (obtained with the values of the three genome models) to determine the minimum size of the ORFs being extracted.
For a Markov model of a noncoding sequence, we used the zero order model with four probability parameters estimated by genome specific frequencies of mononucleotides.
=> the corresponding gene model is called pre-matrix and is used by AMIGene to predict CDSs.
"Use specific gene models for an organism"
The AGC (Atelier de Génomique Comparative) group systematically investigates codon usage differences in genomes, to identify major factors influencing synonymous codons frequencies within a set of CDSs. The following procedure is performed for complete genomes (public or non-public) we are interested in :
The multivariate statistical technique of Factorial Correspondence Analysis (FCA) is used to identify major trends acting on codon usage within the annotated or predicted CDSs.
The K-Means algorithm is also used on absolute and/or relative codon frequencies of the coding sequences, k being equal to 2,3 or 4 depending on the contribution of each major factor to codon usage bias.
The gene classes defined the training sets for protein-coding regions, the rest of the sequence being included into the noncoding training set. Corresponding gene models (2,3 or 4 in total) are generated by the prokov-learn program and subsequently used in the core of the AMIGene method.
To date, this analysis has been done for the following bacterial genomes :
- Escherichia coli K12
- Bacillus subtilis 168
- Mycobacterium tuberculosis H37Rv
- Mycobacterium tuberculosis CDC1551
- Helicobacter pylori 26695
- Yersinia pseudotuberculosis IP32953 (unpublished)
- Bacillus halodurans
- Escherichia coli O157H7-EDL933
- Salmonella typhimurium LT2
- Photorhabdus luminescens (unpublished)
- Acinetobacter sp. str. ADP1 (unpublished)
For these bacterial genomes, genes have been clustered in at least 3 different classes. By selecting one of these organisms the AMIGene program is going to used the corresponding gene models for the automatic gene predictions on your input sequence.
The core of the AMIGene method uses several parameter values which are explained in detail below, and summarize in the following Table :
AMIGene METHOD : ATTRIBUTION OF CODING PROBABILITIES
Given the sequence of a complete genome, we first look for the maximal CDS (i.e., maximal segments in-frame between start and stop codons) and the positions in the six reading frames. The putative CDSs longer than 60 bp are kept (M1, prokov-orf method). Then, the Prokov-curve method (M2) uses in parallel the specific models built for the gene classes of the genome (see Gene Model section), in order to compute the coding probabilities of each identified CDS. AMIGene indicates the model (matrix 1, 2, 3 or 4) that fits best in terms of coding probability.
In order to speed up the following step, a first selection of CDSs is made at this level, depending only on their coding probability. Two probability thresholds, prob-PC and sure-PC, are defined (above Table). If the value of the coding probability is
1) below the prob-PC threshold, the corresponding CDS is discarded
2) between the prob-PC and sure-PC thresholds, the corresponding CDS is kept in a list containing the probable CDSs (lst-prob-CDSs), ONLY IF its length is longer than the Prob-LMin threshold.
3) above the sure-PC threshold, the corresponding CDS is kept in a list containing the sure CDSs (lst-sure-CDSs).
AMIGene METHOD : CDSs SELECTION HEURISTIC
We defined an heuristic to select CDSs in order to take into account tricky choices between two overlapping CDSs, and/or presence of a frameshift error in the DNA sequence. Our selection of the most likely CDSs consists in the elimination of the false positives according to overlaping criteria between adjacent CDSs, these overlaps being either total (they are called inclusion) or partial. In order to avoid overlaps due to the choice of the leftmost start codon, an alternative start can be selected by AMIGene according to the coding prediction curve climbing. The AMIGene method is divided into three main steps :
A. step 1
From the list of sure CDSs (lst-sure-CDSs), we first examine the inclusion cases only. Two inclusion thresholds are defined according to the transcription orientation of the included CDSs : the sure-ss-I threshold for two CDSs transcribed in the same strand, and the sure-os-I threshold for two CDSs transcribed in the opposite strand. An inclusion percentage, given by the ratio of the size of the smallest CDS to the size of the longest CDS, is then computed. If this value is below the threshold corresponding to the transcription orientation of the two CDSs, the smallest CDS is eliminated. In the figure (A) most of the included CDSs are eliminated, and only one included CDS is kept, this inclusion being characteristic of the presence of a compensating frameshift in the sequence.
B. step 2
In a second step we add, to this new list of sure CDSs, the list of probable CDSs (lst-prob-CDSs). Overlaping and inclusion cases between a probable CDS and a sure CDS are then examined, the sure CDSs being always conserved. If a probable CDS is included in a sure CDS, the probable CDS is always eliminated. If a sure CDS overlaps a probable CDS (partially or totally), the probable CDS is always conserved (these situations are represented in the figure B and may point out cases of putative frameshifts in the DNA sequence). In the case of overlaping CDSs transcribed in the opposite strand, an overlap percentage, given by the ratio of the size of this overlap to the size of the probable CDS, is computed. If this value is above the overlaping threshold between a sure and a probable CDS, sure-prob-O, the probable CDS is eliminated.
C. step 3
In the last step of AMIGene, we look for overlaping cases only between CDSs of the new list of probable CDSs. As opposed to the previous steps, it is important here to allow for all overlaps between a CDS with others probable adjacent CDSs. Therefore, for each probable CDS of the list, a total overlap score is first computed (it is the sum of the overlap percentages of the probable CDS being examined with adjacent probable CDSs). Then, the list of probable CDSs is re-examined and a CDS having a score value above a total overlaping threshold, prob-glob-IO, is eliminated. An example of 3 probable CDSs having an important overlap is given in the Figure C. If the 3 CDSs have a score value above the prob-glob-IO threshold, the CDS having the highest coding probability will be kept.
SETTING THE THRESHOLD PARAMETER VALUES :
The following table has been obtained after an optimization process of the threshold values on three reference genomes annotations. Cases of false negatives are penalized twice (k=2) and ten times (k=10).
ECOLI = Escherichia coli K12, BACSU = Bacillus subtilis 168 , MYCTU = Mycobacterium tuberculosis H37Rv
"Use specific parameters"
Depending on your selection (Low, Medium or High GC percent), AMIGene is going to used the set of threshold values we have obtained for the corresponding reference genome (k=10), that is to say :
- B. subtilis (BACSU) for a LOW GC content
- E. coli (ECOLI) for a MEDIUM GC content
- M. tuberculosis (MYCTU) for a HIGH GC content
"Choose your own parameters"
In the above Table we propose some thresholds taking as an example the analysis an anonymous DNA sequence. The required parameter values can be selected from a predefined list, and several ckecks are done before the submission of your query to AMIGene, in order to avoid some inconsistencies (such as a Prob-Pc > Sure_Pc !).
This part of the input AMIGene home page allows the user to paste a sequence in fasta format or to upload a file containing the sequence. He can use a complete genome sequence or a part of a genome (sequence length must be greater than 10 kb if he wants to build a new gene model).
The label for the sequence is required to name the predicted CDSs. If this label is for example PL (for Photorhabdus luminescens) the CDSs listed in the output page will be named PL0001, PL0002, .., etc.
The four possible start codons can be selected and the set of stop codons can be limited to TAA and TAG.
The home page of the AMIGene results includes the list of predicted CDSs in text format, and a graph reporting the protein coding potentials (both the position of the CDSs and the coding prediction curves in the 6 reading frames of the input sequence). The graph is fully dynamic and allows the user to navigate along the genome (or contig); the corresponding list of predicted CDSs is updated accordingly.
->The URL which is sent to the user's e-mail address is available for one week and allows him to consult AMIGene predictions via this graphical interface.
In addition, AMIGene creates several output files that can be subsequently downloaded :
1) one file containing the list of predicted CDSs
2) one fasta file containing the nucleic sequences of the predicted CDSs
3) one faste file containing the corresponding protein sequences : ONLY if the user selects the "Translate CDSs" item in the "Options" section.
4) one file containing the position of putative frameshift errors which have been found by the ProFED method described above : ONLY if the user selects the "Add FrameShift Errors Detection (ProFED)" item in the "Options" section.
ProFED : Procaryotic Frameshift Errors Detection
The ProFED method allows for frameshifts prediction in raw DNA sequences or complete bacterial genome without looking for sequence similarity in databanks. It only uses frame-dependent properties of the protein coding regions, namely the stop and the start codon locations combined with the predicted coding probabilities in the six reading frames. Our method is thus only based on the intrinsic properties of the coding sequences. It combines the results of two analyses : the search for translational initiation/termination sites and the prediction of coding regions.
The input genome sequence is first split into smaller fragments (of length L=20 kb) and the following steps are iterated on each DNA fragment :
(i) A CDS searching method is executed (Prokov-orf; A. Viari, personnal communication), and the positions of the putative CDSs in the six reading frames are kept (M1).
(ii) this list is then parsed to create, for each reading frame k, a boolean (0/1) vector Bk of length L, which contains 1 if the position is in a CDS and 0 if the position is not in a CDS.
(iii) the Prokov-curve method (A. Viari, personnal communication) is used to produce six numeric vectors Pk, corresponding to the coding probabilities along the DNA fragment for each of the six frames (M2).
(iv) the previous results are then merged in the following way (M1+M2):
(iv-a) for each numeric vector Pk, a sliding window of width W is moved along the sequence to calculate the mean of the GeneMark coding probability (mean-gm) at each position i in the sequence.
(iv-b) if mean-gm exceeds a preset threshold called Tgm, the value Bk(i) of the boolean vector Bk, at position i, is considered.
(iv-b-1) If Bk(i) is 0 (case A), this indicates a desagreement between the coding prediction and the location maximal CDS and therefore a putative frameshift introducing an incorrect in-frame stop codon.
(iv-b-2) if Bk(i) is 1, the program looks for an another frame k' containing, at the same position i, a boolean vector B'k(i) equal to 1 and a numeric vector P'k with a mean-gm value greater or equal to the threshold Tgm. This second case (case B) indicates partially overlapping CDSs that should putatively be merged together.
(iv-c) if one of the conditions (iv-b-1) or (iv-b-2) holds, then position i is suspected to contain a frameshift error and is reported in the output file. Consecutive erroneous positions are further merged into a single putative sequencing error.