newcpgseek |
CpG refers to a C nucleotide immediately followed by a G. The 'p' in 'CpG' refers to the phosphate group linking the two bases.
Detection of regions of genomic sequences that are rich in the CpG pattern is important because such regions are resistant to methylation and tend to be associated with genes which are frequently switched on. Regions rich in the CpG pattern are known as CpG islands.
It has been estimated that about half of all mammalian genes have a CpG-rich region around their 5' end. It is said that all mammalian house-keeping genes have a CpG island!
Non-mammalian vertebrates have some CpG islands that are associated with genes, but the association gets equivocal in the farther taxonomic groups.
Finding a CpG island upstream of predicted exons or genes is good contributory evidence.
CpG islands are usually defined as "length over 200bp with %GC over 50% and obs/ervedexpexted CpG more than 0.6". However this program uses a running sum rather than a window to produce a score: if there is not a CpG at position i, then decrement runSum counter, but if CpG then runSum += CPGSCORE. Spans above the threshold are searched for recursively. If the score is higher than a threshold (17 at the moment) then a putative island is declared.
This program reads in one or more sequences and finds regions where there is a high absolute frequency of CpG dimers as well as a high proportion of CpG compared to GpC.
% newcpgseek Reports CpG rich regions Input sequence(s): tembl:rnu68037 CpG score [17]: Output file [rnu68037.newcpgseek]: |
Go to the input files for this example
Go to the output files for this example
Standard (Mandatory) qualifiers: [-sequence] seqall Sequence database USA -score integer CpG score [-outfile] outfile Output file name Additional (Optional) qualifiers: (none) Advanced (Unprompted) qualifiers: (none) Associated qualifiers: "-sequence" associated qualifiers -sbegin1 integer Start of each sequence to be used -send1 integer End of each sequence to be used -sreverse1 boolean Reverse (if DNA) -sask1 boolean Ask for begin/end/reverse -snucleotide1 boolean Sequence is nucleotide -sprotein1 boolean Sequence is protein -slower1 boolean Make lower case -supper1 boolean Make upper case -sformat1 string Input sequence format -sdbname1 string Database name -sid1 string Entryname -ufo1 string UFO features -fformat1 string Features format -fopenfile1 string Features file name "-outfile" associated qualifiers -odirectory2 string Output directory General qualifiers: -auto boolean Turn off prompts -stdout boolean Write standard output -filter boolean Read standard input, write standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report deaths |
Standard (Mandatory) qualifiers | Allowed values | Default | |
---|---|---|---|
[-sequence] (Parameter 1) |
Sequence database USA | Readable sequence(s) | Required |
-score | CpG score | Integer from 1 to 200 | 17 |
[-outfile] (Parameter 2) |
Output file name | Output file | <sequence>.newcpgseek |
Additional (Optional) qualifiers | Allowed values | Default | |
(none) | |||
Advanced (Unprompted) qualifiers | Allowed values | Default | |
(none) |
ID RNU68037 standard; RNA; ROD; 1218 BP. XX AC U68037; XX SV U68037.1 XX DT 23-SEP-1996 (Rel. 49, Created) DT 04-MAR-2000 (Rel. 63, Last updated, Version 2) XX DE Rattus norvegicus EP1 prostanoid receptor mRNA, complete cds. XX KW . XX OS Rattus norvegicus (Norway rat) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; OC Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Rattus. XX RN [1] RP 1-1218 RA Abramovitz M., Boie Y.; RT "Cloning of the rat EP1 prostanoid receptor"; RL Unpublished. XX RN [2] RP 1-1218 RA Abramovitz M., Boie Y.; RT ; RL Submitted (26-AUG-1996) to the EMBL/GenBank/DDBJ databases. RL Biochemistry & Molecular Biology, Merck Frosst Center for Therapeutic RL Research, P. O. Box 1005, Pointe Claire - Dorval, Quebec H9R 4P8, Canada XX DR SWISS-PROT; P70597; PE21_RAT. XX FH Key Location/Qualifiers FH FT source 1..1218 FT /db_xref="taxon:10116" FT /organism="Rattus norvegicus" FT /strain="Sprague-Dawley" FT CDS 1..1218 FT /codon_start=1 FT /db_xref="SWISS-PROT:P70597" FT /note="family 1 G-protein coupled receptor" FT /product="EP1 prostanoid receptor" FT /protein_id="AAB07735.1" FT /translation="MSPYGLNLSLVDEATTCVTPRVPNTSVVLPTGGNGTSPALPIFSM FT TLGAVSNVLALALLAQVAGRLRRRRSTATFLLFVASLLAIDLAGHVIPGALVLRLYTAG FT RAPAGGACHFLGGCMVFFGLCPLLLGCGMAVERCVGVTQPLIHAARVSVARARLALALL FT AAMALAVALLPLVHVGHYELQYPGTWCFISLGPPGGWRQALLAGLFAGLGLAALLAALV FT CNTLSGLALLRARWRRRRSRRFRENAGPDDRRRWGSRGLRLASASSASSITSTTAALRS FT SRGGGSARRVHAHDVEMVGQLVGIMVVSCICWSPLLVLVVLAIGGWNSNSLQRPLFLAV FT RLASWNQILDPWVYILLRQAMLRQLLRLLPLRVSAKGGPTELSLTKSAWEASSLRSSRH FT SGFSHL" XX SQ Sequence 1218 BP; 162 A; 397 C; 387 G; 272 T; 0 other; atgagcccct acgggcttaa cctgagccta gtggatgagg caacaacgtg tgtaacaccc 60 agggtcccca atacatctgt ggtgctgcca acaggcggta acggcacatc accagcgctg 120 cctatcttct ccatgacgct gggtgctgtg tccaacgtgc tggcgctggc gctgctggcc 180 caggttgcag gcagactgcg gcgccgccgc tcgactgcca ccttcctgtt gttcgtcgcc 240 agcctgcttg ccatcgacct agcaggccat gtgatcccgg gcgccttggt gcttcgcctg 300 tatactgcag gacgtgcgcc cgctggcggg gcctgtcatt tcctgggcgg ctgtatggtc 360 ttctttggcc tgtgcccact tttgcttggc tgtggcatgg ccgtggagcg ctgcgtgggt 420 gtcacgcagc cgctgatcca cgcggcgcgc gtgtccgtag cccgcgcacg cctggcacta 480 gccctgctgg ccgccatggc tttggcagtg gcgctgctgc cactagtgca cgtgggtcac 540 tacgagctac agtaccctgg cacttggtgt ttcattagcc ttgggcctcc tggaggttgg 600 cgccaggcgt tgcttgcggg cctcttcgcc ggccttggcc tggctgcgct ccttgccgca 660 ctagtgtgta atacgctcag cggcctggcg ctccttcgtg cccgctggag gcggcgtcgc 720 tctcgacgtt tccgagagaa cgcaggtccc gatgatcgcc ggcgctgggg gtcccgtgga 780 ctccgcttgg cctccgcctc gtctgcgtca tccatcactt caaccacagc tgccctccgc 840 agctctcggg gaggcggctc cgcgcgcagg gttcacgcac acgacgtgga aatggtgggc 900 cagctcgtgg gcatcatggt ggtgtcgtgc atctgctgga gccccctgct ggtattggtg 960 gtgttggcca tcgggggctg gaactctaac tccctgcagc ggccgctctt tctggctgta 1020 cgcctcgcgt cgtggaacca gatcctggac ccatgggtgt acatcctgct gcgccaggct 1080 atgctgcgcc aacttcttcg cctcctaccc ctgagggtta gtgccaaggg tggtccaacg 1140 gagctgagcc taaccaagag tgcctgggag gccagttcac tgcgtagctc ccggcacagt 1200 ggcttcagcc acttgtga 1218 // |
NEWCPGSEEK of RNU68037 from 1 to 1218 with score > 17 Begin End Score CpG %CG CG/GC * 96 1032 630 87 66.1 0.65 1072 1100 26 3 62.1 0.00 1183 1193 26 2 72.7 2.00 ------------------------------------------- |
Program name | Description |
---|---|
cpgplot | Plot CpG rich areas |
cpgreport | Reports all CpG rich regions |
geecee | Calculates fractional GC content of nucleic acid sequences |
newcpgreport | Report CpG rich areas |
As there is no official definition of what is a cpg island is, and worst where they begin and end, we have to live with 2 definitions and thus two methods. These are:
1. newcpgseek and cpgreport - both declare a putative island if the score is higher than a threshold (17 at the moment). They now also displaying the actual CpG count, the % CG and the observed/expected ration in the region where the score is above the threshold. This scoring method based on sum/frequencies overpredicts islands but finds the smaller ones around primary exons. newcpgseek uses the same method as cpgreport but the output is different and more readable.
2. newcpgreport and cpgplot use a sliding window within which the Obs/Exp ratio of CpG is calculated. The important thing to note in this method is that an island, in order to be reported, is defined as a region that satisfies the following contraints:
Obs/Exp ratio > 0.6 % C + % G > 50% Length > 200.
For all practical purposes you should probably use newcpgreport. It is actually used to produce the human cpgisland database you can find on the EBI's ftp server as well as on the EBI's SRS server.
geecee measures CG content in the entire input sequence and is not to be used to detect CpG islands. It can be usefull for detecting sequences that MIGHT contain an island.