POLYAR - Human Polyadenylation Site Prediction
POLYAR program predicts potential PAS-strong, PAS-weak and PAS-less cleavage/poly(A) sites in human sequences by linear discriminant function (LDF) combining characteristics describing functional motifs (polyadenylation signal [PAS]; cleavage site [CS], motif; GU/U motif) and oligonucleotide composition upstream and/or downstream of these sites.
PAS-strong poly(A) sites: with the AAUAAA or AUUAAA forms of PAS;
PAS-weak poly(A) sites: with the AGUAAA, UAUAAA, CAUAAA, GAUAAA, AAUAUA, AAUACA, AAUAGA, ACUAAA, AAGAAA and AAUGAA forms of PAS;
PAS-less poly(A) sites: which lack any of the PAS-strong and PAS-weak forms.
Initially, POLYAR classifies each position on a given sequence as a potential CS or non-CS based on three LDF classifiers for PAS-strong, PAS-weak and PAS-less poly(A) sites.In the cases of PAS-strong and PAS-weak poly(A) sites, only positions with PAS-motif score higher than some preliminary defined threshold in the region (-40,-1) from the current position (+1) are selected for further consideration. First, LDF for any position is estimated based on the classifier for PAS-strong sites; if not, it is estimated by the classifier for PAS-weak sites; otherwise, is estimated by the classifier for PAS-less sites. Estimation of LDFs for the candidate positions is performed by applying thresholds for PAS-strong, PAS-weak and PAS-less poly(A) sites defined on the training dataset. For the final selection of potential CSs, the following criteria are applied:
(i) for any pair of predicted PAS-strong and PAS-weak CSs, or PAS-strong and PAS-weak CSs, within 100 bp of each other, only PAS-strong site is retained;
(ii) for any pair of predicted PAS-weak and PAS-less CSs, within 100 bp of each other, only PAS-weak site is retained;
(iii) for any pair of predicted CSs of the same class, within 100 bp of each other, only the one with the highest score is retained.
In comparison with polya_svm, the most accurate program published (Cheng et al., Bioinformatics, 2006, 22, 2320), POLYAR program shows significantly higher prediction sensitivity (80.8% versus 65.7%) and specificity (66.4% versus 51.7%) in search for PAS-strong CS/poly(A) sites in human sequences. At the same time, in search for PAS-weak and PAS-less CSs both programs show very low prediction accuracy. The program shows almost the same accuracy in analysis of mouse and rat gene end sequences.