tranalign |
tranalign is a simple program that allows you to produce aligned cDNA sequences from aligned protein sequences. This can be very useful for phylogeny programs, e.g. in PHYLIP - dnadist, dnapars, dnaml, etc. In general, it is better to use protein sequences for multiple alignments, but to use DNA sequences for phylogeny. This can be time consuming when there are gaps in the aligned protein sequences.
tranalign takes a set of (unaligned) nucleic sequences and a set of aligned protein sequences. It reads the first nucleic sequence and the first protein sequence, translates the nucleic sequence in each of the three forward frames, compares the protein sequence to the translated nucleic sequence to find the protein coding region, and then writes out the nucleic sequence that encoded the protein.
The sequences must be in the same order in both sets of sequences. A common problem you should be aware of is that some alignment program (including clustalw/emma) will re-order the aligned sequences to group similar sequences together.
The protein library may include '-' characters to specify alignments. Each '-' character in the protein library is ignored during the sequence comparison but replaced by '---' in the nucleic sequence output to form the aligned nucleic sequences.
tranalign finds the coding regions for contiguous sequences only. It will not splice together different exons to produce a coding sequence. You should therefore use either mRNA sequences, or nucleic sequences which you have constructed to hold a contiguous coding region (maybe using extractseq or yank and union?).
% tranalign ../data/tranalign.pep tranalign2.seq Align nucleic coding regions given the aligned proteins |
Go to the input files for this example
Go to the output files for this example
Standard (Mandatory) qualifiers: [-asequence] seqall Sequence database USA [-bsequence] seqset Sequence set USA [-outseq] seqoutset Output sequence set USA Additional (Optional) qualifiers: -table menu Code to use Advanced (Unprompted) qualifiers: (none) Associated qualifiers: "-asequence" associated qualifiers -sbegin1 integer Start of each sequence to be used -send1 integer End of each sequence to be used -sreverse1 boolean Reverse (if DNA) -sask1 boolean Ask for begin/end/reverse -snucleotide1 boolean Sequence is nucleotide -sprotein1 boolean Sequence is protein -slower1 boolean Make lower case -supper1 boolean Make upper case -sformat1 string Input sequence format -sdbname1 string Database name -sid1 string Entryname -ufo1 string UFO features -fformat1 string Features format -fopenfile1 string Features file name "-bsequence" associated qualifiers -sbegin2 integer Start of each sequence to be used -send2 integer End of each sequence to be used -sreverse2 boolean Reverse (if DNA) -sask2 boolean Ask for begin/end/reverse -snucleotide2 boolean Sequence is nucleotide -sprotein2 boolean Sequence is protein -slower2 boolean Make lower case -supper2 boolean Make upper case -sformat2 string Input sequence format -sdbname2 string Database name -sid2 string Entryname -ufo2 string UFO features -fformat2 string Features format -fopenfile2 string Features file name "-outseq" associated qualifiers -osformat3 string Output seq format -osextension3 string File name extension -osname3 string Base file name -osdirectory3 string Output directory -osdbname3 string Database name to add -ossingle3 boolean Separate file for each entry -oufo3 string UFO features -offormat3 string Features format -ofname3 string Features file name -ofdirectory3 string Output directory General qualifiers: -auto boolean Turn off prompts -stdout boolean Write standard output -filter boolean Read standard input, write standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report deaths |
Standard (Mandatory) qualifiers | Allowed values | Default | |||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
[-asequence] (Parameter 1) |
Sequence database USA | Readable sequence(s) | Required | ||||||||||||||||||||||||||||||||||||
[-bsequence] (Parameter 2) |
Sequence set USA | Readable set of sequences | Required | ||||||||||||||||||||||||||||||||||||
[-outseq] (Parameter 3) |
Output sequence set USA | Writeable sequences | <sequence>.format | ||||||||||||||||||||||||||||||||||||
Additional (Optional) qualifiers | Allowed values | Default | |||||||||||||||||||||||||||||||||||||
-table | Code to use |
|
0 | ||||||||||||||||||||||||||||||||||||
Advanced (Unprompted) qualifiers | Allowed values | Default | |||||||||||||||||||||||||||||||||||||
(none) |
The ID names of the nucleic acid and protein sequences are NOT checked to see if they correspond to each other. They can have any names.
There must be at least as many protein sequences as nucleic acid sequence - extra protein sequences are ignored.
Each of the nucleic acid sequences must have a corresponding protein sequence which is derived from the coding region of that nucleic acid sequence. The two sets of sequences must be in the same order.
>HSFAU1 ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc tcctggcaggccccctggaggatgaggccactctgggccagtgcggggtggaggccc tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc tctaataaaaaagccacttagttcagtcaaaaaaaaaa >HSFAU2 ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc tcctggcaggcgcgcccctggaggatgcactctgggccagtgcggggtggaggccc tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc tctaataaaaaagccacttagttcagtcaaaaaaaaaa >HSFAU3 ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc gtgctggaaaagtgagaggtcagactcctaagggggccaaacaggagaagaagaagaaga agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc tctaataaaaaagccacttagttcagtcaaaaaaaaaa >HSFAU4 ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgaaatagcctcactggagggcattgccccggaagatcaagtcgtgc tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc tgactaccctggaagtagcaggccgcatgcttgcccgaggtaaagttcatggttccctggccc gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc tctaataaaaaagccacttagttcagtcaaaaaaaaaa >HSFAU5 ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc tgactaccctggaagtaggccgcatgctttttggaggtaaagttcatggttccctggccc gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc tctaataaaaaagccacttagttcagtcaaaaaaaaaa |
>HSFAU1_3 PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHVAS-LEGIAPEDQVV LLAG-PLEDEATLGQCGVEALTTLEVAGRMLG-GKVHGSLARAGKVRGQTPKVAKQEKKK KKTGRAKRRMQYNRRFVNVVPTFGKKKGPNANS >HSFAU2_3 PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHVAS-LEGIAPEDQVV LLAGAPLEDALWASAGWRP >HSFAU3_3 PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHVAS-LEGIAPEDQVV LLAGAPLEDEATLGQCGVEALTTLEVAGRMLG-GKVHGSLARAGKVRGQTPKGAKQEKKK KKTGRAKRRMQYNRRFVNVVPTFGKKKGPNANS >HSFAU4_3 PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHEIASLEGIAPEDQVV LLAGAPLEDEATLGQCGVEALTTLEVAGRMLARGKVHGSLARAGKVRGQTPKVAKQEKKK KKTGRAKRRMQYNRRFVNVVPTFGKKKGPNANS >HSFAU5_3 PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHVAS-LEGIAPEDQVV LLAGAPLEDEATLGQCGVEALTTLEVGRMLFG-GKVHGSLARAGKVRGQTPKVAKQEKKK KKTGRAKRRMQYNRRFVNVVPTFGKKKGPNANS |
>HSFAU1 cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc cagatcaaggctcatgtagcctca---ctggagggcattgccccggaagatcaagtcgtg ctcctggcaggc---cccctggaggatgaggccactctgggccagtgcggggtggaggcc ctgactaccctggaagtagcaggccgcatgcttgga---ggtaaagttcatggttccctg gcccgtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaag aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg cccacctttggcaagaagaagggccccaatgccaactct >HSFAU2 cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc cagatcaaggctcatgtagcctca---ctggagggcattgccccggaagatcaagtcgtg ctcctggcaggcgcgcccctggaggatgcactctgggccagtgcggggtggaggccc--- ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ --------------------------------------- >HSFAU3 cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc cagatcaaggctcatgtagcctca---ctggagggcattgccccggaagatcaagtcgtg ctcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggcc ctgactaccctggaagtagcaggccgcatgcttgga---ggtaaagttcatggttccctg gcccgtgctggaaaagtgagaggtcagactcctaagggggccaaacaggagaagaagaag aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg cccacctttggcaagaagaagggccccaatgccaactct >HSFAU4 cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc cagatcaaggctcatgaaatagcctcactggagggcattgccccggaagatcaagtcgtg ctcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggcc ctgactaccctggaagtagcaggccgcatgcttgcccgaggtaaagttcatggttccctg gcccgtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaag aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg cccacctttggcaagaagaagggccccaatgccaactct >HSFAU5 cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc cagatcaaggctcatgtagcctca---ctggagggcattgccccggaagatcaagtcgtg ctcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggcc ctgactaccctggaagtaggccgcatgctttttgga---ggtaaagttcatggttccctg gcccgtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaag aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg cccacctttggcaagaagaagggccccaatgccaactct |
The output is the regions of the nucleic acid sequences which code for the corresponding protein sequence, with gap characters ('-') introduced so that they have the same alignment as the corresponding protein sequences.
"Guide protein sequence xxx not found in nucleic sequence xxx" - the region of the nucleic sequence which codes for the protein was not found. The coding region in the nucleic acid sequence must be a single contiguous sequence. The protein sequence might not be the corresponding one for this nucleic acid sequence if they are out of order.
Program name | Description |
---|---|
emma | Multiple alignment program - interface to ClustalW program |
infoalign | Information on a multiple sequence alignment |
plotcon | Plot quality of conservation of a sequence alignment |
prettyplot | Displays aligned sequences, with colouring and boxing |
showalign | Displays a multiple sequence alignment |
tranalign was written in EMBOSS code using the
description of mrtrans as a guide by
Gary Williams (gwilliam © rfcgr.mrc.ac.uk)
MRC Rosalind Franklin Centre for Genomics Research
Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SB, UK
tranalign written (March 2002) - Gary Williams