dotpath

 

Function

Non-overlapping wordmatch dotplot of two sequences

Description

A dotplot is a graphical representation of the regions of similarity between two sequences.

The two sequences are placed on the axes of a rectangular image and wherever there is a similarity between the sequences a dot is placed on the image.

Where the two sequences have substantial regions of similarity, many dots align to form diagonal lines. It is therefore possible to see at a glance where there are local regions of similarity.

dotpath is very similar to the program dottup which looks for places where words (tuples) of a specified length have an exact match in both sequences and draws a diagonal line over the position of these words.

Using a longer word size thus displays less random noise, runs extremely quickly, but is less sensitive.

dotpath finds all matches of size -wordsize or greater between two sequences. It then reduces the matches found to the minimal set of long matches that do not overlap. This is a way of finding the (nearly) optimal path aligning two sequences. It is not the true optimal path as produced by the algorithms used in water or needle, but for very closely related sequences it will produce the same result and will work well with very long sequences.

If you wish to compare the path found by dotpath to the set of all matches found then the qualifier -overlaps will show all matches in red except for the matches in the minimal path which are shown in black, as normal.

Usage

Here is a sample session with dotpath


% dotpath tembl:AF129756 tembl:AP000504 -word 20 -graph cps -overlaps 
Non-overlapping wordmatch dotplot of two sequences

Created dotpath.ps

Go to the input files for this example
Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers:
  [-asequence]         sequence   Sequence USA
  [-bsequence]         sequence   Sequence USA
   -wordsize           integer    Word size
   -graph              graph      Graph type

   Additional (Optional) qualifiers:
   -overlaps           boolean    Displays the overlapping matches (in red) as
                                  well as the minimal set of non-overlapping
                                  matches
   -[no]boxit          boolean    Draw a box around dotplot

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-asequence" associated qualifiers
   -sbegin1             integer    Start of the sequence to be used
   -send1               integer    End of the sequence to be used
   -sreverse1           boolean    Reverse (if DNA)
   -sask1               boolean    Ask for begin/end/reverse
   -snucleotide1        boolean    Sequence is nucleotide
   -sprotein1           boolean    Sequence is protein
   -slower1             boolean    Make lower case
   -supper1             boolean    Make upper case
   -sformat1            string     Input sequence format
   -sdbname1            string     Database name
   -sid1                string     Entryname
   -ufo1                string     UFO features
   -fformat1            string     Features format
   -fopenfile1          string     Features file name

   "-bsequence" associated qualifiers
   -sbegin2             integer    Start of the sequence to be used
   -send2               integer    End of the sequence to be used
   -sreverse2           boolean    Reverse (if DNA)
   -sask2               boolean    Ask for begin/end/reverse
   -snucleotide2        boolean    Sequence is nucleotide
   -sprotein2           boolean    Sequence is protein
   -slower2             boolean    Make lower case
   -supper2             boolean    Make upper case
   -sformat2            string     Input sequence format
   -sdbname2            string     Database name
   -sid2                string     Entryname
   -ufo2                string     UFO features
   -fformat2            string     Features format
   -fopenfile2          string     Features file name

   "-graph" associated qualifiers
   -gprompt             boolean    Graph prompting
   -gtitle              string     Graph title
   -gsubtitle           string     Graph subtitle
   -gxtitle             string     Graph x axis title
   -gytitle             string     Graph y axis title
   -goutfile            string     Output file for non interactive displays
   -gdirectory          string     Output directory

   General qualifiers:
   -auto                boolean    Turn off prompts
   -stdout              boolean    Write standard output
   -filter              boolean    Read standard input, write standard output
   -options             boolean    Prompt for standard and additional values
   -debug               boolean    Write debug output to program.dbg
   -verbose             boolean    Report some/full command line options
   -help                boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning             boolean    Report warnings
   -error               boolean    Report errors
   -fatal               boolean    Report fatal errors
   -die                 boolean    Report deaths


Standard (Mandatory) qualifiers Allowed values Default
[-asequence]
(Parameter 1)
Sequence USA Readable sequence Required
[-bsequence]
(Parameter 2)
Sequence USA Readable sequence Required
-wordsize Word size Integer 2 or more 4
-graph Graph type EMBOSS has a list of known devices, including postscript, ps, hpgl, hp7470, hp7580, meta, colourps, cps, xwindows, x11, tektronics, tekt, tek4107t, tek, none, null, text, data, xterm, png, xml EMBOSS_GRAPHICS value, or x11
Additional (Optional) qualifiers Allowed values Default
-overlaps Displays the overlapping matches (in red) as well as the minimal set of non-overlapping matches Boolean value Yes/No No
-[no]boxit Draw a box around dotplot Boolean value Yes/No Yes
Advanced (Unprompted) qualifiers Allowed values Default
(none)

Input file format

Input files for usage example

'tembl:AF129756' is a sequence entry in the example nucleic acid database 'tembl'

Database entry: tembl:AF129756

ID   AF129756   standard; DNA; HUM; 184666 BP.
XX
AC   AF129756;
XX
SV   AF129756.1
XX
DT   12-MAR-1999 (Rel. 59, Created)
DT   29-OCT-1999 (Rel. 61, Last updated, Version 2)
XX
DE   Homo sapiens MSH55 gene, partial cds; and CLIC1, DDAH, G6b, G6c, G5b, G6d,
DE   G6e, G6f, BAT5, G5b, CSK2B, BAT4, G4, Apo M, BAT3, BAT2, AIF-1, 1C7, LST-1,
DE   LTB, TNF, and LTA genes, complete cds.
XX
KW   .
XX
OS   Homo sapiens (human)
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
OC   Eutheria; Primates; Catarrhini; Hominidae; Homo.
XX
RN   [1]
RP   1-184666
RA   Rowen L., Madan A., Qin S., Shaffer T., James R., Ratcliffe A., Abbasi N.,
RA   Dickhoff R., Loretz C., Madan A., Dors M., Young J., Lasky S., Hood L.;
RT   "Sequence of the human major histocompatibility complex class III region";
RL   Unpublished.
XX
RN   [2]
RP   1-184666
RA   Rowen L.;
RT   ;
RL   Submitted (22-FEB-1999) to the EMBL/GenBank/DDBJ databases.
RL   Department of Molecular Biotechnology, Box 357730 University of Washington,
RL   Seattle, WA 98195, USA
XX
RN   [3]
RP   1-184666
RA   Rowen L.;
RT   ;
RL   Submitted (28-OCT-1999) to the EMBL/GenBank/DDBJ databases.
RL   Multimegabase Sequencing Center, University of Washington, PO Box 357730,
RL   Seattle, WA 98195, USA
XX
DR   EPD; EP11158; HS_TNFA.
DR   EPD; EP11159; HS_TNFB.
DR   SPTREMBL; O00452; O00452.
DR   SPTREMBL; O14931; O14931.
DR   SPTREMBL; O95866; O95866.
DR   SPTREMBL; O95868; O95868.
DR   SPTREMBL; O95869; O95869.
DR   SPTREMBL; O95870; O95870.


  [Part of this file has been deleted for brevity]

     aaaccagttt accaccactc ctaacactaa acttaaatct gactctaaat gtaagtccaa    181740
     tctgagccac aagcctaaag ttgaacttta tcctgcttta tgaattattc atccattcct    181800
     ccatttagtg agtatctgcg tgcctaacac atgctgggca ttgtcctaag gcaggaggga    181860
     catggaggca aagggatcag agaaggtacc agcacctgtg gagcttgtat tccagtgagg    181920
     ccagacggaa aagaaagaaa ctgaagaaga aattggtact atgagaaaat aagacaggct    181980
     gatgttgtaa gagtggcagg gagctacttt taaatacagt agtcagcaaa atcctctttg    182040
     agtgtttggg tggcactgga gctgagaccc aaatgacaaa aaatagtgac caggtaaaag    182100
     tttgggagca aagcatttca ggtaaaggga gcagctactg caaaggctgg aaggcggaac    182160
     caagctgggg gtgttgacga caaacagaag gccagtgtgg ctggagcaga gagagagact    182220
     gggaggcggg tgggagatga ggtcagagag gagggcaggg gccaggtcat gcagggccat    182280
     gcaagaaggg taaagcctct agatttcatc cagccacagg aagcctttaa aggtcgtcag    182340
     agtgtgtggt gcgtgcgtgt gtgtgtgtgt gtgtgtgtgt gttgcagggg agagaggggg    182400
     agggagagag agagagagag agagaagagg gaggtgagca gaggtgattg gatttttttt    182460
     tcttttgaca tggtgtcttg ctctgtggcc taggctggag tgcagtggca ccatcatagc    182520
     ccactgcaac ctcaaaacca tgggctcaag tcatccttcc acctcagctt cccaagtatc    182580
     taggactaca ggtgtgtgcc actgtgcctg gctaatttta aaaaatattt taaaattttt    182640
     gttgagacag ggtctatgct gctcaggctg gtctcgaact cctggtttca agtgatctgc    182700
     ccatcttggc ctcccaaagt ttttttttgt tagtttgaga ggcggtttcg ctcgttgccc    182760
     aggctggagt gcaatgactg atctcatctc actgcaacct ctgcctcctg ggttcaagcg    182820
     attctcctgc ttcagcctcc caagtagctg ggattacagg tgcatgccac cattcccggc    182880
     taattttttg tatttagtag agatggggtt tcaccatgtt agtcaggctg atctcaaact    182940
     cctgacctca ggtgatccgc ctgcctcagc ctcccaaagt tttgggatta caggtgtgag    183000
     ccaccatgct gggccagcct cccaaagttt tgggattaca ggcatgagtc accacactgg    183060
     ccctggattt tttttctttc ttttttttgg agacggagtc tcactctgtt gcccaggctg    183120
     gagtgcaatg gcgtaatctc agctcactgc aacctctgct gcccgggttc aaacgattct    183180
     cctgtcttag cctcctgagt agctgggatt ataggtgcat gccaccatgc ctggctaatt    183240
     tttgtacttt tagtagagaa agtacaccat cttggccagg ctggtctcga actcctgacc    183300
     tcaggtgatc cacttgcgtc ggcctcccaa agtgctggga ttacaggcgt gagacaccgc    183360
     acccagcctt tttttttttt tttcttttaa gacagaatcg ctctgtcacc caggctggag    183420
     tgcagtggca caatctcggc tcactgcaac ctctgcctcc caggtttaag caatccacct    183480
     atgtcagtct cccaagtagc tgggattata ggtgcatgtc accatgcctg gctaattttt    183540
     gtacttttag tatagaaagt acaccatgtt ggccaggctg gtcttgaact cctgacctca    183600
     agtgatccgc ctgcctcagc ctcccgaagt gctggaatta cagacatgtg ccactgcacc    183660
     cggcctggtt ttttttttct aagagatgga gtctcacttt tctgcccagg ttggagtgca    183720
     atggcaccat catagctcac tgcagccttc aactcttggc ctcaggcaat ccttgcacct    183780
     tagcctcgca gtgttgggat tacaggcatg agccactgag ccttgcctgg actttttttt    183840
     ttttttgaga tggcgtctcg ctctgttgcc caggttggag tgctacggca tgatcttggc    183900
     tcactgcaac ttccacctcc caggttcaag cgattctctt gcctcggccc cccgagtagc    183960
     tgggattaca ggcatgcgcc accgtgcctg gctaattttg gtatttttag tagagatagg    184020
     gtttcatcat gttgggcagg ctggtcttga actcctgacc tcgtgatcca cccacctcgg    184080
     cctcccaaag tgctgggatt ataggcatag ccaacgcgcc cagcctggac ttgtttttaa    184140
     aagatcactg tggctcctgt gtttaggctg gctggtagga gacaggtggc agtggcattg    184200
     atggtgaaga gaaaatagtg gcagccatgg agatggagag aagtagacaa gtttgggata    184260
     tattatacat tccaggggta gaaacaacag gactagatga tggattgatg ggtgggagat    184320
     gtagatactg ggagagaagc aggattctga tggatggaaa aactaaaaaa ttctattttg    184380
     ggtgtggtaa gtctaagtct attagacatg caagtagaga tgtcactggg cagatacaca    184440
     tctggatttc aggggcaagg tccaagctag agaaagaaac ctgggcatgg tcagcatgag    184500
     gatggtgttt aaagccatgg aacttatctt gtgcatccct ataagacccc tttgaggcac    184560
     ttgtttcccc tcacaatgga tgcagtgcat cttccattct gaattccaga ggcaacaacc    184620
     tcctgctcct agaagctaaa ctctccagac ttagtcttct gaattc                   184666
//

Database entry: tembl:AP000504

ID   AP000504   standard; DNA; HUM; 100000 BP.
XX
AC   AP000504; BA000025;
XX
SV   AP000504.1
XX
DT   28-SEP-1999 (Rel. 61, Created)
DT   22-AUG-2001 (Rel. 68, Last updated, Version 3)
XX
DE   Homo sapiens genomic DNA, chromosome 6p21.3, HLA Class I region, section
DE   3/20.
XX
KW   .
XX
OS   Homo sapiens (human)
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
OC   Eutheria; Primates; Catarrhini; Hominidae; Homo.
XX
RN   [1]
RP   1-100000
RA   Hirakawa M., Yamaguchi H., Imai K., Shimada J.;
RT   ;
RL   Submitted (21-SEP-1999) to the EMBL/GenBank/DDBJ databases.
RL   Mika Hirakawa, Japan Science and Technology Corporation (JST), Advanced
RL   Databases Department; 5-3, Yonbancho, Chiyoda-ku, Tokyo 102-0081, Japan
RL   (E-mail:mika@tokyo.jst.go.jp, URL:http://www-alis.tokyo.jst.go.jp/,
RL   Tel:81-3-5214-8491, Fax:81-3-5214-8470)
XX
RN   [2]
RA   Shiina S., Tamiya G., Oka A., Inoko H.;
RT   "Homo sapiens 2,229,817bp genomic DNA of 6p21.3 HLA class I region";
RL   Unpublished.
XX
DR   SWISS-PROT; O00299; CLI1_HUMAN.
DR   SWISS-PROT; O43196; MSH5_HUMAN.
DR   SWISS-PROT; O95445; APOM_HUMAN.
DR   SWISS-PROT; O95865; DDH2_HUMAN.
DR   SWISS-PROT; O95867; NG24_HUMAN.
DR   SWISS-PROT; P13862; KC2B_HUMAN.
XX
CC   This sequence is conducted by Tokai University as a JST sequencing
CC   Team.
CC   Principal Investigator: Hidetoshi Inoko Ph.D
CC   Phone:+81-463-93-1121, Fax:+81-463-94-8884,
CC   The sequence is submitted by Human Genome Sequencing in ALIS
CC   project of JST
CC   Japan Science and Technology Corporation (JST)
CC   5-3, Yonbancyo, Chiyoda-ku, Tokyo, 102-0081 Japan
CC   For further infomation about this sequences, please visit our
CC   sequence archive Web site (http://www-alis.tokyo.jst.go.jp/HGS/top.


  [Part of this file has been deleted for brevity]

     gggtggatca tgaggtcaag agatcgagac tatcctggct aacatgatga aaccccgtct     97080
     ctactaaaaa tacaaaaaat tagctgggca tggtggcggg cacctgtagt cccagctact     97140
     cgggaggctg agtcaggaga atggtgtgaa cccaggagac ggagcttgca gtgagctgag     97200
     gtcgcaccac tgcactccag cctgggtgat agagcgagac tctgtctcaa aaaaaaaaaa     97260
     aaaaaaaaaa aaaacaaaaa ttagccgggt gtggtggcag gcaacttaat cccagctact     97320
     tgggaggcag aggcaggaga atcgtttgaa cctgggaggc ggaggttgaa gagaatagaa     97380
     gctctgctgg tccagagaag gattgggcca gggctctggg agaccaggga gaaagagggc     97440
     acatgtggtc cctgttgact gtgagggtgg gaatctgagg aaggctttgg ctcattgccc     97500
     cttgggtttg tccacagcca tccttcccct gcggagtatg tcgaggtgct ccaggagcta     97560
     cagcggctgg agagtcgcct ccagcccttc ttgcagcgct actacgaggt tctgggtgct     97620
     gctgccacca cggactacaa taacaatgtg agccctttga tggccctgcc ctttctcctc     97680
     agccccagta ctcccaaaac agaacaggct gaaatacaga taactctttc cctccctgga     97740
     aaaacattgc aacagggcca ggtgcagtgg ctcacgcctg taatcccagc actttgggag     97800
     gccaaggtgg gcggatcatc tgagatcggg agtttgagac cagcctggcc aacatggtgc     97860
     aaccccatct ctactgaaaa tataaacatt agctggatgt agtggtgcac acctgtaatc     97920
     ccagctactc aggaggctga ggcaggagaa tcgctagaac tcgggaggag ggggttgcag     97980
     tgagccgaga ttgcactact gcactctagc ctgggtgaca gagcgagact gtctcaaaaa     98040
     acaaaacaaa acaaaaaaac acacattgca acaaaacaat ttctctctaa acctgtaagt     98100
     gattttgtcc tcccttacag agaaggtgat aatctttgct gtaagcactg tcctcgtatc     98160
     gtaccccttg tgcccctgaa tgaatttaga aaatgtaaag tacaggagat cagtatatga     98220
     tgacttactg attcatagta gtgttttaat aggatgttcc ttatgtgaat aagatataat     98280
     ttatttgcaa agatttggtc tacatgtaaa cttccaagga tataactgaa agttttggag     98340
     gacatggtat tctcagtagg cattattgct tttattagtg agatggactc cagcttgata     98400
     ttttctgcct ttttgtgttt ggctggttgt gcgcagcacg agggccggga ggaggatcag     98460
     cggttgatca acttggtagg ggagagcctg cgactgctgg gcaacacctt tgttgcactg     98520
     tctgacctgc gctgcaatct ggcctgcacg cccccacgac acctgcatgt ggtccggcct     98580
     atgtctcact acaccacccc catggtgctc cagcaggcag ccattcccat acaggtgggt     98640
     tagggggagt ctggcctgag ggagagtgag gggtgttgat agagtgaccc agggtagcta     98700
     ctgggcctga aggaggttag gaaaggagga gactggaaac atggtgatga aggctggaga     98760
     tactttagag gtttatcatg aggttttctt ggttaggctc ttgtattttt ctcacatctg     98820
     cctgtccatc tgtctttttc agatcaatgt gggaaccact gtgaccatga caggaaatgg     98880
     gactcggccc cccccaactc ccaatgcaga ggcacctccc cctggtcctg ggcaggcctc     98940
     atccgtggct ccgtcttcta ccaatgtcga gtcctcagct gagggggctc ccccgccagg     99000
     tccagctccc ccgccagcca ccagccaccc gagggtcatc cggatttccc accagagtgt     99060
     ggaacccgtg gtcatgatgc acatgaacat tcaaggtgag aatagttgct ggcgagaaga     99120
     gcaggatcag catgatgagg gaggttcatg ctgaggtgtg agggaacagg gtggggaagg     99180
     gagaggcaca tgctggtggt ggtagcctgg ggaccagagc agaagcttaa gtagacagat     99240
     gtggggggtg tgggggttgg tttgtctttg gaggtgtgtt tgtgtggtga agggagtacc     99300
     tctccctgtt tagatggagg gaaaggcagg ctttctgatt gggggattat gggcctgaag     99360
     tatgcctgat ctcagaagga tatagttagg ccttggccct acctacctca gggccactgt     99420
     ctctgtctcc ctgcccagat tctggcacac agcctggtgg tgttccgagt gctcccactg     99480
     gccccctggg accccctggt catggccaaa ccctgggtaa gagtgagggc atcagggcag     99540
     gctgagctct gggtagagaa agggaagggc tgagtgggtg ggttgaaggg gtccaggttc     99600
     aaggttacat cagacccgcc ccccaggctc caccctcatc cagctgccct ccctgccccc     99660
     tgagttcatg cacgccgtcg cccaccagat cactcatcag gccatggtgg cagctgttgc     99720
     ctccgcggcc gcaggtaatg acctggaagg ggaggcttgg gaggtagggc acagtccatg     99780
     gtggcagctg gctggcaagg gcctggccct cagccctctt cggtctgtct cttctgccac     99840
     ccacaggaca gcaggtgcca ggcttcccaa cagctccaac ccgggtggtg attgcccggc     99900
     ccactcctcc acaggctcgg ccttcccatc ctggagggcc cccagtctct gggacactgg     99960
     tgagcaaggg tcggggagtt ctagtgcgta acagtctagg                          100000
//

Output file format

In normal operation, a dotplot image is displayed.

With the -data qualifier a file of the positions of the matches in the minimal non-overlapping set of matches is output.

Output files for usage example

Graphics File: dotpath.ps

[dotpath results]

Notes

None

References

None

Warnings

If you give a small word size with a very large sequence you will run out of memory. If this happens, try again with a larger word size.

Diagnostic Error Messages

None

Exit status

It always exits with status 0.

Known bugs

None

See also

Program nameDescription
dotmatcherDisplays a thresholded dotplot of two sequences
dottupDisplays a wordmatch dotplot of two sequences
polydotDisplays all-against-all dotplots of a set of sequences

This program is closely based on dottup with the addition of by default displaying only the minimal set of non-overlapping matches.

This program uses the same algorithm as diffseq for finding a minimal set of very good matches between two sequences. diffseq may be more convenient if you are looking at the differences between two nearly identical sequences.

Author(s)

Gary Williams (gwilliam © rfcgr.mrc.ac.uk)
MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SB, UK

History

Written 14 Aug 2000.

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments

None