megamerger

 

Function

Merge two large overlapping nucleic acid sequences

Description

megamerger takes two overlapping sequences and merges them into one sequence. It could thus be regarded as the opposite of what splitter does.

The sequences can be very long. The program does a match of all sequence words of size 20 (by default). It then reduces this to the minimum set of overlapping matches by sorting the matches in order of size (largest size first) and then for each such match it removes any smaller matches that overlap. The result is a set of the longest ungapped alignments between the two sequences that do not overlap with each other. If the two sequences are identical in their region of overlap then there will be one region of match and no mismatches.

It should be possible to merge sequences that are Mega bytes long. Compare this with the program merger which does a more accurate alignment of more divergent sequences using the Needle and Wunsch algorithm but which uses much more memory.

The sequences should ideally be identical in their region of overlap. If there are any mismatches between the two sequences then megamerger will still attempt to create a merged sequence, but you should check that this is what you required.

A report of the actions of megamerger is written out. Any actions that require a choice between using regions of the two sequences where they have a mismatch is marked with the word WARNING!. The sequence in these regions is written out in uppercase. All other regions of the output sequence are written in lowercase.

Where there is a mismatch then the sequence that is chosen to supply the region of the mismatch in the final merged sequence is that sequence whose mismatch region is furthest from the start of end of the sequence.

Usage

Here is a sample session with megamerger


% megamerger tembl:ap000504 tembl:af129756 
Merge two large overlapping nucleic acid sequences
Word size [20]: 
Output sequence [ap000504.merged]: 
Output file [ap000504.megamerger]: report

Go to the input files for this example
Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers:
  [-asequence]         sequence   Sequence USA
  [-bsequence]         sequence   Sequence USA
   -wordsize           integer    Word size
  [-outseq]            seqout     Output sequence USA
  [-outfile]           outfile    Output file name

   Additional (Optional) qualifiers:
   -prefer             boolean    When a mismatch between the two sequence is
                                  discovered, one or other of the two
                                  sequences must be used to create the merged
                                  sequence over this mismatch region. The
                                  default action is to create the merged
                                  sequence using the sequence where the
                                  mismatch is closest to that sequence's
                                  centre. If this option is used, then the
                                  first sequence (seqa) will always be used in
                                  preference to the other sequence when there
                                  is a mismatch.

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-asequence" associated qualifiers
   -sbegin1             integer    Start of the sequence to be used
   -send1               integer    End of the sequence to be used
   -sreverse1           boolean    Reverse (if DNA)
   -sask1               boolean    Ask for begin/end/reverse
   -snucleotide1        boolean    Sequence is nucleotide
   -sprotein1           boolean    Sequence is protein
   -slower1             boolean    Make lower case
   -supper1             boolean    Make upper case
   -sformat1            string     Input sequence format
   -sdbname1            string     Database name
   -sid1                string     Entryname
   -ufo1                string     UFO features
   -fformat1            string     Features format
   -fopenfile1          string     Features file name

   "-bsequence" associated qualifiers
   -sbegin2             integer    Start of the sequence to be used
   -send2               integer    End of the sequence to be used
   -sreverse2           boolean    Reverse (if DNA)
   -sask2               boolean    Ask for begin/end/reverse
   -snucleotide2        boolean    Sequence is nucleotide
   -sprotein2           boolean    Sequence is protein
   -slower2             boolean    Make lower case
   -supper2             boolean    Make upper case
   -sformat2            string     Input sequence format
   -sdbname2            string     Database name
   -sid2                string     Entryname
   -ufo2                string     UFO features
   -fformat2            string     Features format
   -fopenfile2          string     Features file name

   "-outseq" associated qualifiers
   -osformat3           string     Output seq format
   -osextension3        string     File name extension
   -osname3             string     Base file name
   -osdirectory3        string     Output directory
   -osdbname3           string     Database name to add
   -ossingle3           boolean    Separate file for each entry
   -oufo3               string     UFO features
   -offormat3           string     Features format
   -ofname3             string     Features file name
   -ofdirectory3        string     Output directory

   "-outfile" associated qualifiers
   -odirectory4         string     Output directory

   General qualifiers:
   -auto                boolean    Turn off prompts
   -stdout              boolean    Write standard output
   -filter              boolean    Read standard input, write standard output
   -options             boolean    Prompt for standard and additional values
   -debug               boolean    Write debug output to program.dbg
   -verbose             boolean    Report some/full command line options
   -help                boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning             boolean    Report warnings
   -error               boolean    Report errors
   -fatal               boolean    Report fatal errors
   -die                 boolean    Report deaths


Standard (Mandatory) qualifiers Allowed values Default
[-asequence]
(Parameter 1)
Sequence USA Readable sequence Required
[-bsequence]
(Parameter 2)
Sequence USA Readable sequence Required
-wordsize Word size Integer 2 or more 20
[-outseq]
(Parameter 3)
Output sequence USA Writeable sequence <sequence>.format
[-outfile]
(Parameter 4)
Output file name Output file <sequence>.megamerger
Additional (Optional) qualifiers Allowed values Default
-prefer When a mismatch between the two sequence is discovered, one or other of the two sequences must be used to create the merged sequence over this mismatch region. The default action is to create the merged sequence using the sequence where the mismatch is closest to that sequence's centre. If this option is used, then the first sequence (seqa) will always be used in preference to the other sequence when there is a mismatch. Boolean value Yes/No No
Advanced (Unprompted) qualifiers Allowed values Default
(none)

Input file format

megamerger reads any two Sequence USAs.

Input files for usage example

'tembl:ap000504' is a sequence entry in the example nucleic acid database 'tembl'

Database entry: tembl:ap000504

ID   AP000504   standard; DNA; HUM; 100000 BP.
XX
AC   AP000504; BA000025;
XX
SV   AP000504.1
XX
DT   28-SEP-1999 (Rel. 61, Created)
DT   22-AUG-2001 (Rel. 68, Last updated, Version 3)
XX
DE   Homo sapiens genomic DNA, chromosome 6p21.3, HLA Class I region, section
DE   3/20.
XX
KW   .
XX
OS   Homo sapiens (human)
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
OC   Eutheria; Primates; Catarrhini; Hominidae; Homo.
XX
RN   [1]
RP   1-100000
RA   Hirakawa M., Yamaguchi H., Imai K., Shimada J.;
RT   ;
RL   Submitted (21-SEP-1999) to the EMBL/GenBank/DDBJ databases.
RL   Mika Hirakawa, Japan Science and Technology Corporation (JST), Advanced
RL   Databases Department; 5-3, Yonbancho, Chiyoda-ku, Tokyo 102-0081, Japan
RL   (E-mail:mika@tokyo.jst.go.jp, URL:http://www-alis.tokyo.jst.go.jp/,
RL   Tel:81-3-5214-8491, Fax:81-3-5214-8470)
XX
RN   [2]
RA   Shiina S., Tamiya G., Oka A., Inoko H.;
RT   "Homo sapiens 2,229,817bp genomic DNA of 6p21.3 HLA class I region";
RL   Unpublished.
XX
DR   SWISS-PROT; O00299; CLI1_HUMAN.
DR   SWISS-PROT; O43196; MSH5_HUMAN.
DR   SWISS-PROT; O95445; APOM_HUMAN.
DR   SWISS-PROT; O95865; DDH2_HUMAN.
DR   SWISS-PROT; O95867; NG24_HUMAN.
DR   SWISS-PROT; P13862; KC2B_HUMAN.
XX
CC   This sequence is conducted by Tokai University as a JST sequencing
CC   Team.
CC   Principal Investigator: Hidetoshi Inoko Ph.D
CC   Phone:+81-463-93-1121, Fax:+81-463-94-8884,
CC   The sequence is submitted by Human Genome Sequencing in ALIS
CC   project of JST
CC   Japan Science and Technology Corporation (JST)
CC   5-3, Yonbancyo, Chiyoda-ku, Tokyo, 102-0081 Japan
CC   For further infomation about this sequences, please visit our
CC   sequence archive Web site (http://www-alis.tokyo.jst.go.jp/HGS/top.


  [Part of this file has been deleted for brevity]

     gggtggatca tgaggtcaag agatcgagac tatcctggct aacatgatga aaccccgtct     97080
     ctactaaaaa tacaaaaaat tagctgggca tggtggcggg cacctgtagt cccagctact     97140
     cgggaggctg agtcaggaga atggtgtgaa cccaggagac ggagcttgca gtgagctgag     97200
     gtcgcaccac tgcactccag cctgggtgat agagcgagac tctgtctcaa aaaaaaaaaa     97260
     aaaaaaaaaa aaaacaaaaa ttagccgggt gtggtggcag gcaacttaat cccagctact     97320
     tgggaggcag aggcaggaga atcgtttgaa cctgggaggc ggaggttgaa gagaatagaa     97380
     gctctgctgg tccagagaag gattgggcca gggctctggg agaccaggga gaaagagggc     97440
     acatgtggtc cctgttgact gtgagggtgg gaatctgagg aaggctttgg ctcattgccc     97500
     cttgggtttg tccacagcca tccttcccct gcggagtatg tcgaggtgct ccaggagcta     97560
     cagcggctgg agagtcgcct ccagcccttc ttgcagcgct actacgaggt tctgggtgct     97620
     gctgccacca cggactacaa taacaatgtg agccctttga tggccctgcc ctttctcctc     97680
     agccccagta ctcccaaaac agaacaggct gaaatacaga taactctttc cctccctgga     97740
     aaaacattgc aacagggcca ggtgcagtgg ctcacgcctg taatcccagc actttgggag     97800
     gccaaggtgg gcggatcatc tgagatcggg agtttgagac cagcctggcc aacatggtgc     97860
     aaccccatct ctactgaaaa tataaacatt agctggatgt agtggtgcac acctgtaatc     97920
     ccagctactc aggaggctga ggcaggagaa tcgctagaac tcgggaggag ggggttgcag     97980
     tgagccgaga ttgcactact gcactctagc ctgggtgaca gagcgagact gtctcaaaaa     98040
     acaaaacaaa acaaaaaaac acacattgca acaaaacaat ttctctctaa acctgtaagt     98100
     gattttgtcc tcccttacag agaaggtgat aatctttgct gtaagcactg tcctcgtatc     98160
     gtaccccttg tgcccctgaa tgaatttaga aaatgtaaag tacaggagat cagtatatga     98220
     tgacttactg attcatagta gtgttttaat aggatgttcc ttatgtgaat aagatataat     98280
     ttatttgcaa agatttggtc tacatgtaaa cttccaagga tataactgaa agttttggag     98340
     gacatggtat tctcagtagg cattattgct tttattagtg agatggactc cagcttgata     98400
     ttttctgcct ttttgtgttt ggctggttgt gcgcagcacg agggccggga ggaggatcag     98460
     cggttgatca acttggtagg ggagagcctg cgactgctgg gcaacacctt tgttgcactg     98520
     tctgacctgc gctgcaatct ggcctgcacg cccccacgac acctgcatgt ggtccggcct     98580
     atgtctcact acaccacccc catggtgctc cagcaggcag ccattcccat acaggtgggt     98640
     tagggggagt ctggcctgag ggagagtgag gggtgttgat agagtgaccc agggtagcta     98700
     ctgggcctga aggaggttag gaaaggagga gactggaaac atggtgatga aggctggaga     98760
     tactttagag gtttatcatg aggttttctt ggttaggctc ttgtattttt ctcacatctg     98820
     cctgtccatc tgtctttttc agatcaatgt gggaaccact gtgaccatga caggaaatgg     98880
     gactcggccc cccccaactc ccaatgcaga ggcacctccc cctggtcctg ggcaggcctc     98940
     atccgtggct ccgtcttcta ccaatgtcga gtcctcagct gagggggctc ccccgccagg     99000
     tccagctccc ccgccagcca ccagccaccc gagggtcatc cggatttccc accagagtgt     99060
     ggaacccgtg gtcatgatgc acatgaacat tcaaggtgag aatagttgct ggcgagaaga     99120
     gcaggatcag catgatgagg gaggttcatg ctgaggtgtg agggaacagg gtggggaagg     99180
     gagaggcaca tgctggtggt ggtagcctgg ggaccagagc agaagcttaa gtagacagat     99240
     gtggggggtg tgggggttgg tttgtctttg gaggtgtgtt tgtgtggtga agggagtacc     99300
     tctccctgtt tagatggagg gaaaggcagg ctttctgatt gggggattat gggcctgaag     99360
     tatgcctgat ctcagaagga tatagttagg ccttggccct acctacctca gggccactgt     99420
     ctctgtctcc ctgcccagat tctggcacac agcctggtgg tgttccgagt gctcccactg     99480
     gccccctggg accccctggt catggccaaa ccctgggtaa gagtgagggc atcagggcag     99540
     gctgagctct gggtagagaa agggaagggc tgagtgggtg ggttgaaggg gtccaggttc     99600
     aaggttacat cagacccgcc ccccaggctc caccctcatc cagctgccct ccctgccccc     99660
     tgagttcatg cacgccgtcg cccaccagat cactcatcag gccatggtgg cagctgttgc     99720
     ctccgcggcc gcaggtaatg acctggaagg ggaggcttgg gaggtagggc acagtccatg     99780
     gtggcagctg gctggcaagg gcctggccct cagccctctt cggtctgtct cttctgccac     99840
     ccacaggaca gcaggtgcca ggcttcccaa cagctccaac ccgggtggtg attgcccggc     99900
     ccactcctcc acaggctcgg ccttcccatc ctggagggcc cccagtctct gggacactgg     99960
     tgagcaaggg tcggggagtt ctagtgcgta acagtctagg                          100000
//

Database entry: tembl:af129756

ID   AF129756   standard; DNA; HUM; 184666 BP.
XX
AC   AF129756;
XX
SV   AF129756.1
XX
DT   12-MAR-1999 (Rel. 59, Created)
DT   29-OCT-1999 (Rel. 61, Last updated, Version 2)
XX
DE   Homo sapiens MSH55 gene, partial cds; and CLIC1, DDAH, G6b, G6c, G5b, G6d,
DE   G6e, G6f, BAT5, G5b, CSK2B, BAT4, G4, Apo M, BAT3, BAT2, AIF-1, 1C7, LST-1,
DE   LTB, TNF, and LTA genes, complete cds.
XX
KW   .
XX
OS   Homo sapiens (human)
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
OC   Eutheria; Primates; Catarrhini; Hominidae; Homo.
XX
RN   [1]
RP   1-184666
RA   Rowen L., Madan A., Qin S., Shaffer T., James R., Ratcliffe A., Abbasi N.,
RA   Dickhoff R., Loretz C., Madan A., Dors M., Young J., Lasky S., Hood L.;
RT   "Sequence of the human major histocompatibility complex class III region";
RL   Unpublished.
XX
RN   [2]
RP   1-184666
RA   Rowen L.;
RT   ;
RL   Submitted (22-FEB-1999) to the EMBL/GenBank/DDBJ databases.
RL   Department of Molecular Biotechnology, Box 357730 University of Washington,
RL   Seattle, WA 98195, USA
XX
RN   [3]
RP   1-184666
RA   Rowen L.;
RT   ;
RL   Submitted (28-OCT-1999) to the EMBL/GenBank/DDBJ databases.
RL   Multimegabase Sequencing Center, University of Washington, PO Box 357730,
RL   Seattle, WA 98195, USA
XX
DR   EPD; EP11158; HS_TNFA.
DR   EPD; EP11159; HS_TNFB.
DR   SPTREMBL; O00452; O00452.
DR   SPTREMBL; O14931; O14931.
DR   SPTREMBL; O95866; O95866.
DR   SPTREMBL; O95868; O95868.
DR   SPTREMBL; O95869; O95869.
DR   SPTREMBL; O95870; O95870.


  [Part of this file has been deleted for brevity]

     aaaccagttt accaccactc ctaacactaa acttaaatct gactctaaat gtaagtccaa    181740
     tctgagccac aagcctaaag ttgaacttta tcctgcttta tgaattattc atccattcct    181800
     ccatttagtg agtatctgcg tgcctaacac atgctgggca ttgtcctaag gcaggaggga    181860
     catggaggca aagggatcag agaaggtacc agcacctgtg gagcttgtat tccagtgagg    181920
     ccagacggaa aagaaagaaa ctgaagaaga aattggtact atgagaaaat aagacaggct    181980
     gatgttgtaa gagtggcagg gagctacttt taaatacagt agtcagcaaa atcctctttg    182040
     agtgtttggg tggcactgga gctgagaccc aaatgacaaa aaatagtgac caggtaaaag    182100
     tttgggagca aagcatttca ggtaaaggga gcagctactg caaaggctgg aaggcggaac    182160
     caagctgggg gtgttgacga caaacagaag gccagtgtgg ctggagcaga gagagagact    182220
     gggaggcggg tgggagatga ggtcagagag gagggcaggg gccaggtcat gcagggccat    182280
     gcaagaaggg taaagcctct agatttcatc cagccacagg aagcctttaa aggtcgtcag    182340
     agtgtgtggt gcgtgcgtgt gtgtgtgtgt gtgtgtgtgt gttgcagggg agagaggggg    182400
     agggagagag agagagagag agagaagagg gaggtgagca gaggtgattg gatttttttt    182460
     tcttttgaca tggtgtcttg ctctgtggcc taggctggag tgcagtggca ccatcatagc    182520
     ccactgcaac ctcaaaacca tgggctcaag tcatccttcc acctcagctt cccaagtatc    182580
     taggactaca ggtgtgtgcc actgtgcctg gctaatttta aaaaatattt taaaattttt    182640
     gttgagacag ggtctatgct gctcaggctg gtctcgaact cctggtttca agtgatctgc    182700
     ccatcttggc ctcccaaagt ttttttttgt tagtttgaga ggcggtttcg ctcgttgccc    182760
     aggctggagt gcaatgactg atctcatctc actgcaacct ctgcctcctg ggttcaagcg    182820
     attctcctgc ttcagcctcc caagtagctg ggattacagg tgcatgccac cattcccggc    182880
     taattttttg tatttagtag agatggggtt tcaccatgtt agtcaggctg atctcaaact    182940
     cctgacctca ggtgatccgc ctgcctcagc ctcccaaagt tttgggatta caggtgtgag    183000
     ccaccatgct gggccagcct cccaaagttt tgggattaca ggcatgagtc accacactgg    183060
     ccctggattt tttttctttc ttttttttgg agacggagtc tcactctgtt gcccaggctg    183120
     gagtgcaatg gcgtaatctc agctcactgc aacctctgct gcccgggttc aaacgattct    183180
     cctgtcttag cctcctgagt agctgggatt ataggtgcat gccaccatgc ctggctaatt    183240
     tttgtacttt tagtagagaa agtacaccat cttggccagg ctggtctcga actcctgacc    183300
     tcaggtgatc cacttgcgtc ggcctcccaa agtgctggga ttacaggcgt gagacaccgc    183360
     acccagcctt tttttttttt tttcttttaa gacagaatcg ctctgtcacc caggctggag    183420
     tgcagtggca caatctcggc tcactgcaac ctctgcctcc caggtttaag caatccacct    183480
     atgtcagtct cccaagtagc tgggattata ggtgcatgtc accatgcctg gctaattttt    183540
     gtacttttag tatagaaagt acaccatgtt ggccaggctg gtcttgaact cctgacctca    183600
     agtgatccgc ctgcctcagc ctcccgaagt gctggaatta cagacatgtg ccactgcacc    183660
     cggcctggtt ttttttttct aagagatgga gtctcacttt tctgcccagg ttggagtgca    183720
     atggcaccat catagctcac tgcagccttc aactcttggc ctcaggcaat ccttgcacct    183780
     tagcctcgca gtgttgggat tacaggcatg agccactgag ccttgcctgg actttttttt    183840
     ttttttgaga tggcgtctcg ctctgttgcc caggttggag tgctacggca tgatcttggc    183900
     tcactgcaac ttccacctcc caggttcaag cgattctctt gcctcggccc cccgagtagc    183960
     tgggattaca ggcatgcgcc accgtgcctg gctaattttg gtatttttag tagagatagg    184020
     gtttcatcat gttgggcagg ctggtcttga actcctgacc tcgtgatcca cccacctcgg    184080
     cctcccaaag tgctgggatt ataggcatag ccaacgcgcc cagcctggac ttgtttttaa    184140
     aagatcactg tggctcctgt gtttaggctg gctggtagga gacaggtggc agtggcattg    184200
     atggtgaaga gaaaatagtg gcagccatgg agatggagag aagtagacaa gtttgggata    184260
     tattatacat tccaggggta gaaacaacag gactagatga tggattgatg ggtgggagat    184320
     gtagatactg ggagagaagc aggattctga tggatggaaa aactaaaaaa ttctattttg    184380
     ggtgtggtaa gtctaagtct attagacatg caagtagaga tgtcactggg cagatacaca    184440
     tctggatttc aggggcaagg tccaagctag agaaagaaac ctgggcatgg tcagcatgag    184500
     gatggtgttt aaagccatgg aacttatctt gtgcatccct ataagacccc tttgaggcac    184560
     ttgtttcccc tcacaatgga tgcagtgcat cttccattct gaattccaga ggcaacaacc    184620
     tcctgctcct agaagctaaa ctctccagac ttagtcttct gaattc                   184666
//

Output file format

Output files for usage example

File: report

# Report of megamerger of: AP000504 and AF129756

AP000504 overlap starts at 1
AF129756 overlap starts at 6036

Using AF129756 1-6035 as the initial sequence

Matching region AP000504 1-846 : AF129756 6036-6881
Length of match: 846

WARNING!
Mismatch region found:
Mismatch AP000504 847-847
Mismatch AF129756 6882-6882
Mismatch is closer to the ends of AP000504, so use AF129756 in the merged sequence

Matching region AP000504 848-1794 : AF129756 6883-7829
Length of match: 947

WARNING!
Mismatch region found:
Mismatch AP000504 1795-1795
Mismatch AF129756 7830-7830
Mismatch is closer to the ends of AP000504, so use AF129756 in the merged sequence

Matching region AP000504 1796-2272 : AF129756 7831-8307
Length of match: 477

WARNING!
Mismatch region found:
Mismatch AP000504 2273-2273
Mismatch AF129756 8307
Mismatch is closer to the ends of AP000504, so use AF129756 in the merged sequence

Matching region AP000504 2274-2465 : AF129756 8308-8499
Length of match: 192

WARNING!
Mismatch region found:
Mismatch AP000504 2466-2466
Mismatch AF129756 8500-8500
Mismatch is closer to the ends of AP000504, so use AF129756 in the merged sequence

Matching region AP000504 2467-2654 : AF129756 8501-8688
Length of match: 188

WARNING!
Mismatch region found:
Mismatch AP000504 2655-2658
Mismatch AF129756 8688


  [Part of this file has been deleted for brevity]


WARNING!
Mismatch region found:
Mismatch AP000504 95451-95451
Mismatch AF129756 101481-101481
Mismatch is closer to the ends of AP000504, so use AF129756 in the merged sequence

Matching region AP000504 95452-96649 : AF129756 101482-102679
Length of match: 1198

WARNING!
Mismatch region found:
Mismatch AP000504 96650-96650
Mismatch AF129756 102680-102680
Mismatch is closer to the ends of AP000504, so use AF129756 in the merged sequence

Matching region AP000504 96651-97272 : AF129756 102681-103302
Length of match: 622

WARNING!
Mismatch region found:
Mismatch AP000504 97273-97274
Mismatch AF129756 103302
Mismatch is closer to the ends of AP000504, so use AF129756 in the merged sequence

Matching region AP000504 97275-97715 : AF129756 103303-103743
Length of match: 441

WARNING!
Mismatch region found:
Mismatch AP000504 97716-97716
Mismatch AF129756 103744-103744
Mismatch is closer to the ends of AP000504, so use AF129756 in the merged sequence

Matching region AP000504 97717-97826 : AF129756 103745-103854
Length of match: 110

WARNING!
Mismatch region found:
Mismatch AP000504 97827-97827
Mismatch AF129756 103855-103855
Mismatch is closer to the ends of AP000504, so use AF129756 in the merged sequence

Matching region AP000504 97828-100000 : AF129756 103856-106028
Length of match: 2173

AP000504 overlap ends at 100000
AF129756 overlap ends at 106028

Using AF129756 106029-184666 as the final sequence

File: ap000504.merged

>AP000504 AP000504.1 Homo sapiens genomic DNA, chromosome 6p21.3, HLA Class I region, section 3/20.
gaattctctccctcccatctgtggctgagattaaagatctgcacctgaagcactgaagaa
tgtgtgggtaattaaattaccctgccgattcctggagatgctgattacctggagatgacc
tcagagattatcctaaattaactcctacaagacacacattgcagcgggaggtgaggtagg
ggaaggattgtgcacctgaggctagcaaaggtttccactctgtttagagatgatgtcacc
agtgcgtttacatttgcgttgtgttcacacattgagtgctactatgtacaagaccatgtg
tcagacactgaggtaacaggtatagctactagatagagttcatagctatggaggcaagca
gattcactgactgctaattctaacatgatgtgacagtgcaactagaaaaacataaacaag
cactatgtgagcacaaagaaggtgcacatcaactccttacaggtacctgtaaaagccaaa
gggtaacagttggattgcaccttgaagaggatgcacttttttttttttttaagacagaat
ctcactctgttgcccaggctggagtgcagtggggcaatctgggctcacttcaacctctac
ctcccgagttcaagcaattctcctgcctcagcctcctgagtagctggtactagaggcatg
cgccatgatgctgggctaatttttgtattttcggtagacgtgaagtttcaccaagttggc
caggctggtcttgaactcctgacctcaaatcatccacccacctcagcctcccaaaatgct
gagactacaggcgtgagccaccgcgcctgacctggatgtaagattttgataggtacagaa
caaggaaaagactttccaggccgggcacagtggcttatgcctgtaatcccagcactttgg
gaggccgaggtgggcagatcacgaggtcagcagttcaagaccagcgtggccaacatggtg
aaatcccatctctactaaaaatacaaaaattagccaggcatggtggtgggtccctgtaat
cccagctactccggaggctgaggcaggagaattgcttgaacctgggaggtggaggttgca
gtgagcaaagaccgcgccactgcactccagcctgggtgacagagggagactccgtctcaa
aaaaaaaaaaaaaagactttccagaaggagcagcataaacacaggcatgacatgtttcca
taatggcaagtggccctaaatgactagaatataaggtagatccagtaggaaaggacttag
aaggggctttggaaggtgagtctggaaattaaaactggggtaaacgtgatggaccctgaa
catcattatactgcttaagatgctaatcttaatcctgaaggtaatgggaaaacctcctaa
ggtttatgttattttctttctacttaggctatttaaaaagtggagtgacggccaggcgca
gtgactcatgcctgtaatcccagcactttgggaggccgaggtgggcggatcaccaggagt
tcgagaccagcctgaccaacatggtgaaaccccgcctctactgaaaatacaaaaattagc
caggtgtggtggtgggcgcctgtaatcccagctacttgggaggctgaggcagaagaattg
cttgaacccgggaagtggaggttgcagtgagcagagatcgtgccattgtactccagcctg
ggcaacaagagcgaaactcagtcacaaaaaaaaaaaaaaaaaaaaaggagtgacatgctt
agatctctgttttggaatgacaggttttttgtttctagcatcaatccaaggttcatggct
tgagaaggtgtactgccagcaatgccattaaccagcaaagggaatgcaggaagaggaaca
gatctggtgggcatcagtttggatgctctgagtttgagctgcctgtgaaaactgcaggtg
gtgatatgcaattaacattcacatacggagttcaaaactagagacacaaatttgagagtc
atcacagaaatgtgaagtgtgttttctataactaaagataaccatgctaacatagccatg
tgttacattagcattttttttttttttgagacggagtctcactctgttgcccaggctgaa
gtgcagtgcacaatcttggctcactgcaacctccacctcctgggttcaagcgattctcct
gccttagtctcctgagtagctggaattacaggcacctaccaacacgcttggctaattttt
gcattttagtagagatggggctttaccatgttggccagctggtctcaaactcctgacctc
aagtgattcacccaccttggccccccaaagtgctgggattacaggtgtgagccactgtgc
ccggccttacattttgtgttttttcctgctgcttgtatgtgtgcaagtctgtgtatcatc
aatgggtatatgtgtacctgcgctgacaacaaaaaatgagatgcatatcagctactacac
aaagctgttataaggatgaaatgcagttagccagtgctcagtaaagggcagttgctttac
tactactaggtggggtggtgtatgtgagaatctgtatactgccattagtaggctttagta
tgtagtgtgcatatggaattcatgcattagtgtgtagtatgtgtgggacccactcacctg
agcagcttctctccccacttacagtggcatctgttgaggattcctgtgagggataaggca
gggagtgaacttgttacaaggcagggacagggaatggaatgtgtttatgtgtctaagctg
aggcatccaggtcagaggtgctggttgttgaggaagctggcctgggagggcacaaaggca
gccaaagctggtgcctggccacaaatatgagctgggattaccgtacatggagatggggga
agggatggacactcacagggacacttagccagaaaaatacacaaagcagacctagttaaa


  [Part of this file has been deleted for brevity]

accccctaaataaaacttctcctctaccccaacccaaccctgtttctagggctaatcttg
aaaccagtttaccaccactcctaacactaaacttaaatctgactctaaatgtaagtccaa
tctgagccacaagcctaaagttgaactttatcctgctttatgaattattcatccattcct
ccatttagtgagtatctgcgtgcctaacacatgctgggcattgtcctaaggcaggaggga
catggaggcaaagggatcagagaaggtaccagcacctgtggagcttgtattccagtgagg
ccagacggaaaagaaagaaactgaagaagaaattggtactatgagaaaataagacaggct
gatgttgtaagagtggcagggagctacttttaaatacagtagtcagcaaaatcctctttg
agtgtttgggtggcactggagctgagacccaaatgacaaaaaatagtgaccaggtaaaag
tttgggagcaaagcatttcaggtaaagggagcagctactgcaaaggctggaaggcggaac
caagctgggggtgttgacgacaaacagaaggccagtgtggctggagcagagagagagact
gggaggcgggtgggagatgaggtcagagaggagggcaggggccaggtcatgcagggccat
gcaagaagggtaaagcctctagatttcatccagccacaggaagcctttaaaggtcgtcag
agtgtgtggtgcgtgcgtgtgtgtgtgtgtgtgtgtgtgtgttgcaggggagagaggggg
agggagagagagagagagagagagaagagggaggtgagcagaggtgattggatttttttt
tcttttgacatggtgtcttgctctgtggcctaggctggagtgcagtggcaccatcatagc
ccactgcaacctcaaaaccatgggctcaagtcatccttccacctcagcttcccaagtatc
taggactacaggtgtgtgccactgtgcctggctaattttaaaaaatattttaaaattttt
gttgagacagggtctatgctgctcaggctggtctcgaactcctggtttcaagtgatctgc
ccatcttggcctcccaaagtttttttttgttagtttgagaggcggtttcgctcgttgccc
aggctggagtgcaatgactgatctcatctcactgcaacctctgcctcctgggttcaagcg
attctcctgcttcagcctcccaagtagctgggattacaggtgcatgccaccattcccggc
taattttttgtatttagtagagatggggtttcaccatgttagtcaggctgatctcaaact
cctgacctcaggtgatccgcctgcctcagcctcccaaagttttgggattacaggtgtgag
ccaccatgctgggccagcctcccaaagttttgggattacaggcatgagtcaccacactgg
ccctggattttttttctttcttttttttggagacggagtctcactctgttgcccaggctg
gagtgcaatggcgtaatctcagctcactgcaacctctgctgcccgggttcaaacgattct
cctgtcttagcctcctgagtagctgggattataggtgcatgccaccatgcctggctaatt
tttgtacttttagtagagaaagtacaccatcttggccaggctggtctcgaactcctgacc
tcaggtgatccacttgcgtcggcctcccaaagtgctgggattacaggcgtgagacaccgc
acccagcctttttttttttttttcttttaagacagaatcgctctgtcacccaggctggag
tgcagtggcacaatctcggctcactgcaacctctgcctcccaggtttaagcaatccacct
atgtcagtctcccaagtagctgggattataggtgcatgtcaccatgcctggctaattttt
gtacttttagtatagaaagtacaccatgttggccaggctggtcttgaactcctgacctca
agtgatccgcctgcctcagcctcccgaagtgctggaattacagacatgtgccactgcacc
cggcctggttttttttttctaagagatggagtctcacttttctgcccaggttggagtgca
atggcaccatcatagctcactgcagccttcaactcttggcctcaggcaatccttgcacct
tagcctcgcagtgttgggattacaggcatgagccactgagccttgcctggactttttttt
ttttttgagatggcgtctcgctctgttgcccaggttggagtgctacggcatgatcttggc
tcactgcaacttccacctcccaggttcaagcgattctcttgcctcggccccccgagtagc
tgggattacaggcatgcgccaccgtgcctggctaattttggtatttttagtagagatagg
gtttcatcatgttgggcaggctggtcttgaactcctgacctcgtgatccacccacctcgg
cctcccaaagtgctgggattataggcatagccaacgcgcccagcctggacttgtttttaa
aagatcactgtggctcctgtgtttaggctggctggtaggagacaggtggcagtggcattg
atggtgaagagaaaatagtggcagccatggagatggagagaagtagacaagtttgggata
tattatacattccaggggtagaaacaacaggactagatgatggattgatgggtgggagat
gtagatactgggagagaagcaggattctgatggatggaaaaactaaaaaattctattttg
ggtgtggtaagtctaagtctattagacatgcaagtagagatgtcactgggcagatacaca
tctggatttcaggggcaaggtccaagctagagaaagaaacctgggcatggtcagcatgag
gatggtgtttaaagccatggaacttatcttgtgcatccctataagacccctttgaggcac
ttgtttcccctcacaatggatgcagtgcatcttccattctgaattccagaggcaacaacc
tcctgctcctagaagctaaactctccagacttagtcttctgaattc

A merged sequence is written out.

Where there has been a mismatch between the two sequences, the merged sequence is written out in uppercase and the sequence whose mismatch region is furthest from the edges of the sequence is used in the merged sequence.

The name and description of the first input sequence is used for the name and description of the output sequence.

A report of the merger is written out.

Data files

None.

Notes

If you run out of memory, use a larger wordsize.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

None.

See also

Program nameDescription
consCreates a consensus from multiple alignments
mergerMerge two overlapping nucleic acid sequences

Compare this with the program merger which does a more accurate alignment of more divergent sequences using the Needle and Wunsch algorithm but which uses much more memory.

A graphical dotplot of the matches used in this merge can be displayed using the program dotpath.

Author(s)

Gary Williams (gwilliam © rfcgr.mrc.ac.uk)
MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SB, UK

History

Written Aug 2000 by Gary Williams.

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments

None