tranalign

 

Function

Align nucleic coding regions given the aligned proteins

Description

tranalign is a re-implementation in EMBOSS of the program mrtrans by Bill Pearson.

tranalign is a simple program that allows you to produce aligned cDNA sequences from aligned protein sequences. This can be very useful for phylogeny programs, e.g. in PHYLIP - dnadist, dnapars, dnaml, etc. In general, it is better to use protein sequences for multiple alignments, but to use DNA sequences for phylogeny. This can be time consuming when there are gaps in the aligned protein sequences.

tranalign takes a set of (unaligned) nucleic sequences and a set of aligned protein sequences. It reads the first nucleic sequence and the first protein sequence, translates the nucleic sequence in each of the three forward frames, compares the protein sequence to the translated nucleic sequence to find the protein coding region, and then writes out the nucleic sequence that encoded the protein.

The sequences must be in the same order in both sets of sequences. A common problem you should be aware of is that some alignment program (including clustalw/emma) will re-order the aligned sequences to group similar sequences together.

The protein library may include '-' characters to specify alignments. Each '-' character in the protein library is ignored during the sequence comparison but replaced by '---' in the nucleic sequence output to form the aligned nucleic sequences.

tranalign finds the coding regions for contiguous sequences only. It will not splice together different exons to produce a coding sequence. You should therefore use either mRNA sequences, or nucleic sequences which you have constructed to hold a contiguous coding region (maybe using extractseq or yank and union?).

Usage

Here is a sample session with tranalign


% tranalign ../data/tranalign.pep tranalign2.seq 
Align nucleic coding regions given the aligned proteins

Go to the input files for this example
Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers:
  [-asequence]         seqall     Sequence database USA
  [-bsequence]         seqset     Sequence set USA
  [-outseq]            seqoutset  Output sequence set USA

   Additional (Optional) qualifiers:
   -table              menu       Code to use

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-asequence" associated qualifiers
   -sbegin1             integer    Start of each sequence to be used
   -send1               integer    End of each sequence to be used
   -sreverse1           boolean    Reverse (if DNA)
   -sask1               boolean    Ask for begin/end/reverse
   -snucleotide1        boolean    Sequence is nucleotide
   -sprotein1           boolean    Sequence is protein
   -slower1             boolean    Make lower case
   -supper1             boolean    Make upper case
   -sformat1            string     Input sequence format
   -sdbname1            string     Database name
   -sid1                string     Entryname
   -ufo1                string     UFO features
   -fformat1            string     Features format
   -fopenfile1          string     Features file name

   "-bsequence" associated qualifiers
   -sbegin2             integer    Start of each sequence to be used
   -send2               integer    End of each sequence to be used
   -sreverse2           boolean    Reverse (if DNA)
   -sask2               boolean    Ask for begin/end/reverse
   -snucleotide2        boolean    Sequence is nucleotide
   -sprotein2           boolean    Sequence is protein
   -slower2             boolean    Make lower case
   -supper2             boolean    Make upper case
   -sformat2            string     Input sequence format
   -sdbname2            string     Database name
   -sid2                string     Entryname
   -ufo2                string     UFO features
   -fformat2            string     Features format
   -fopenfile2          string     Features file name

   "-outseq" associated qualifiers
   -osformat3           string     Output seq format
   -osextension3        string     File name extension
   -osname3             string     Base file name
   -osdirectory3        string     Output directory
   -osdbname3           string     Database name to add
   -ossingle3           boolean    Separate file for each entry
   -oufo3               string     UFO features
   -offormat3           string     Features format
   -ofname3             string     Features file name
   -ofdirectory3        string     Output directory

   General qualifiers:
   -auto                boolean    Turn off prompts
   -stdout              boolean    Write standard output
   -filter              boolean    Read standard input, write standard output
   -options             boolean    Prompt for standard and additional values
   -debug               boolean    Write debug output to program.dbg
   -verbose             boolean    Report some/full command line options
   -help                boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning             boolean    Report warnings
   -error               boolean    Report errors
   -fatal               boolean    Report fatal errors
   -die                 boolean    Report deaths


Standard (Mandatory) qualifiers Allowed values Default
[-asequence]
(Parameter 1)
Sequence database USA Readable sequence(s) Required
[-bsequence]
(Parameter 2)
Sequence set USA Readable set of sequences Required
[-outseq]
(Parameter 3)
Output sequence set USA Writeable sequences <sequence>.format
Additional (Optional) qualifiers Allowed values Default
-table Code to use
0 (Standard)
1 (Standard (with alternative initiation codons))
2 (Vertebrate Mitochondrial)
3 (Yeast Mitochondrial)
4 (Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma)
5 (Invertebrate Mitochondrial)
6 (Ciliate Macronuclear and Dasycladacean)
9 (Echinoderm Mitochondrial)
10 (Euplotid Nuclear)
11 (Bacterial)
12 (Alternative Yeast Nuclear)
13 (Ascidian Mitochondrial)
14 (Flatworm Mitochondrial)
15 (Blepharisma Macronuclear)
16 (Chlorophycean Mitochondrial)
21 (Trematode Mitochondrial)
22 (Scenedesmus obliquus)
23 (Thraustochytrium Mitochondrial)
0
Advanced (Unprompted) qualifiers Allowed values Default
(none)

Input file format

The input is a set of unaligned nucleic sequences and the set of aligned protein sequences to be used as a guide in the alignment of the output nucleic sequences.

The ID names of the nucleic acid and protein sequences are NOT checked to see if they correspond to each other. They can have any names.

There must be at least as many protein sequences as nucleic acid sequence - extra protein sequences are ignored.

Each of the nucleic acid sequences must have a corresponding protein sequence which is derived from the coding region of that nucleic acid sequence. The two sets of sequences must be in the same order.

Input files for usage example

File: tranalign.seq

>HSFAU1
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggccccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttcagtcaaaaaaaaaa
>HSFAU2
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgcactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttcagtcaaaaaaaaaa
>HSFAU3
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaagggggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttcagtcaaaaaaaaaa
>HSFAU4
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgaaatagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttgcccgaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttcagtcaaaaaaaaaa
>HSFAU5
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtaggccgcatgctttttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttcagtcaaaaaaaaaa

File: tranalign.pep

>HSFAU1_3
PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHVAS-LEGIAPEDQVV
LLAG-PLEDEATLGQCGVEALTTLEVAGRMLG-GKVHGSLARAGKVRGQTPKVAKQEKKK
KKTGRAKRRMQYNRRFVNVVPTFGKKKGPNANS
>HSFAU2_3
PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHVAS-LEGIAPEDQVV
LLAGAPLEDALWASAGWRP
>HSFAU3_3
PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHVAS-LEGIAPEDQVV
LLAGAPLEDEATLGQCGVEALTTLEVAGRMLG-GKVHGSLARAGKVRGQTPKGAKQEKKK
KKTGRAKRRMQYNRRFVNVVPTFGKKKGPNANS
>HSFAU4_3
PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHEIASLEGIAPEDQVV
LLAGAPLEDEATLGQCGVEALTTLEVAGRMLARGKVHGSLARAGKVRGQTPKVAKQEKKK
KKTGRAKRRMQYNRRFVNVVPTFGKKKGPNANS
>HSFAU5_3
PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHVAS-LEGIAPEDQVV
LLAGAPLEDEATLGQCGVEALTTLEVGRMLFG-GKVHGSLARAGKVRGQTPKVAKQEKKK
KKTGRAKRRMQYNRRFVNVVPTFGKKKGPNANS

Output file format

Output files for usage example

File: tranalign2.seq

>HSFAU1
cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag
ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc
cagatcaaggctcatgtagcctca---ctggagggcattgccccggaagatcaagtcgtg
ctcctggcaggc---cccctggaggatgaggccactctgggccagtgcggggtggaggcc
ctgactaccctggaagtagcaggccgcatgcttgga---ggtaaagttcatggttccctg
gcccgtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaag
aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg
cccacctttggcaagaagaagggccccaatgccaactct
>HSFAU2
cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag
ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc
cagatcaaggctcatgtagcctca---ctggagggcattgccccggaagatcaagtcgtg
ctcctggcaggcgcgcccctggaggatgcactctgggccagtgcggggtggaggccc---
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
---------------------------------------
>HSFAU3
cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag
ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc
cagatcaaggctcatgtagcctca---ctggagggcattgccccggaagatcaagtcgtg
ctcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggcc
ctgactaccctggaagtagcaggccgcatgcttgga---ggtaaagttcatggttccctg
gcccgtgctggaaaagtgagaggtcagactcctaagggggccaaacaggagaagaagaag
aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg
cccacctttggcaagaagaagggccccaatgccaactct
>HSFAU4
cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag
ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc
cagatcaaggctcatgaaatagcctcactggagggcattgccccggaagatcaagtcgtg
ctcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggcc
ctgactaccctggaagtagcaggccgcatgcttgcccgaggtaaagttcatggttccctg
gcccgtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaag
aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg
cccacctttggcaagaagaagggccccaatgccaactct
>HSFAU5
cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag
ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc
cagatcaaggctcatgtagcctca---ctggagggcattgccccggaagatcaagtcgtg
ctcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggcc
ctgactaccctggaagtaggccgcatgctttttgga---ggtaaagttcatggttccctg
gcccgtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaag
aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg
cccacctttggcaagaagaagggccccaatgccaactct

The output is the regions of the nucleic acid sequences which code for the corresponding protein sequence, with gap characters ('-') introduced so that they have the same alignment as the corresponding protein sequences.

Data files

None.

Notes

The sequences must be in the same order in both sets of sequences. A common problem you should be aware of is that some alignment program (including clustalw/emma) will re-order the aligned sequences to group similar sequences together.

References

None.

Warnings

None.

Diagnostic Error Messages

"No guide protein sequence available for nucleic sequence xxx" - the corresponding protein sequence for this nucleic sequence has not been input. You have input more nucleic acid sequences than protein sequences.

"Guide protein sequence xxx not found in nucleic sequence xxx" - the region of the nucleic sequence which codes for the protein was not found. The coding region in the nucleic acid sequence must be a single contiguous sequence. The protein sequence might not be the corresponding one for this nucleic acid sequence if they are out of order.

Exit status

It always exits with status 0.

Known bugs

None.

See also

Program nameDescription
emmaMultiple alignment program - interface to ClustalW program
infoalignInformation on a multiple sequence alignment
plotconPlot quality of conservation of a sequence alignment
prettyplotDisplays aligned sequences, with colouring and boxing
showalignDisplays a multiple sequence alignment

Author(s)

The original program mrtrans was written by Bill Pearson (wrp@virginia.edu)

tranalign was written in EMBOSS code using the description of mrtrans as a guide by Gary Williams (gwilliam © rfcgr.mrc.ac.uk)
MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SB, UK

History

mrtrans written (Jan 1991, July 1987) - Bill Pearson

tranalign written (March 2002) - Gary Williams

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments

None