• Summary

  • Inputs Required

  • Parameters to be input by the user

  • Output files

  • Algorithms

  • Sample datasets


  • Summary

    Local Matching Score (LMS) based distance calculation is an alignment free approach to cluster any given set of amino acid sequences. Through LMS, we aim to achieve a classification scheme which is independent of sequence alignment and domain definitions, especially useful in the case of multi-domain proteins. The algorithm was first developed by Julitte et. al. in 2010. In the present version of 'CLAP', LMS distances between any two sequences are clustered using Ward's method in R statistical package. The main output files are - a distance matrix of the LMS distances and a dendrogram that helps to visually inspect the phylogenetic relationships between different sequences. A scaled version of the dendrogram (height between 0 and 1) is also available for download.

    In order to study the classification of input data-set, one can also parse the dendrogram at a desired distance cut-off. Different clusters generated are represented in the tree as separate branch colors. The domain similarity index of each cluster can be measured if the user inputs a file containing the domain architecture details of each sequence. Here, three similarity indices have been used, based on domain composition, domain order and domain duplication similarity respectively. For each cluster, we also provide a representative domain architecture, which is nothing but the architecture adopted by majority of the sequences in a particular cluster.

  • logo
  • Inputs Required        

    a. Set of amino acid sequences in fasta format. See sample input file for reference        
    b. (Optional) Domain architecture details for the input data set in tab delimited format. See sample domain architecture file for reference.

  • Parameters to be input by the user        

    (Optional) Tree parsing cut-off between 0 and 1. Recommended to generate colored dendrogram according to clusters at the given cut-off.

  • logo
  • Output files        

    a. List of given proteins - The names of proteins whose fasta sequences were input
    b. LMS distance matrix - Labelled square symmetric distance matrix based on Local Matching Scores.
    c. Phylogenetic tree (can also be downloaded as a post-script file).

    Results of parsing tree at a user-defined cut-off:
    d. Colored dendrogram according to clusters generated.
    e. Domain distance scores as a text file. The domain architecture similarity scores' means and standard deviations are given for each cluster (Jaccard Index (JC), Goodman Kruskal Gamma index (GK) and Domain Duplication Similarity index (DS)). Also, a representative domain architecture is assigned to each cluster.
    f. Results.tar - This contains all the input as well as output files.

  • logo
  • Algorithms

    a. Local Matching Score(LMS) computes the distance between two sequences as the similarity measure between sequence patterns s and s'.

    logo
    Where {s,s'} denotes the set of amino acids in s and s' are sub-patterns of 5 residues long and M[i,i] is the BLOSUM62 substitution score. The scores computed were normalized as follows to give a distance measure ranging from 0 to 1.

    LOGO

    These pair-wise distances computed using LMS-dist for full length sequnces then used for agglomerative clustering of the sequences using the Ward's minimum variance method as employed in R statistical package. The hierarchical clusters obtained were represented as dendrogram. Further, the dendrogram is parsed at given cut-off value ranging from 0 to 1 to obtain distinct clusters. We believe that these clusters are representative of the subfamily organization in a dataset.

    LOGO


    Figure taken from Martin J et al., 2010.

    b.Domain architecture similarity scores
    Three scoring metrics namely Jaccard index , Goodman-Kruskal index and duplication similarity index are used to capture the different aspects of domain architectural similarities among proteins. These metrics were previously used to quantify architectural similarities in multi-domain proteins (K Lin et al., 2006).

    Jaccard index (JPQ) provides the ratio of number of shared domains to number of distinct domains among the two proteins being compared. LOGO N'PQ is the number of shared domains between proteins P and Q, NP and NQ are the total number of domains belonging to proteins P and Q. Goodman-Kruskal γ index (γPQ) measures the conservation of N-to C-terminal domain order among proteins P and Q.

    LOGO NSPQ and NRPQ are the number of pairs of domains shared between proteins P and Q in the same order and in reverse order respectively. This score was rescaled to values ranging from 0 to 1. The duplication similarity (DPQ) between proteins P and Q is defined as described previously by Lin and co-workers.

    where
    All the three scores, i.e. JPQ, γPQ and DPQ were computed for all the pairs within the clusters formed at a given tree parsing cut-off and averaged.

  • logo
  • Sample datasets

    a. Test data sets
    Set of 50 protein sequences (Fasta format)
    Domain architechture file (Tab delimitted)

    b. Pilot test data sets
    Immunoglobulin Sequences (Fasta format)
    Immunoglobulin domain architecture (Tab delimitted)
    Pkinase Sequences (Fasta format)
    Pkinase domain architecture (Tab delimitted)