cpgplot |
Detection of regions of genomic sequences that are rich in the CpG pattern is important because such regions are resistant to methylation and tend to be associated with genes which are frequently switched on. Regions rich in the CpG pattern are known as CpG islands.
It has been estimated that about half of all mammalian genes have a CpG-rich region around their 5' end. It is said that all mammalian house-keeping genes have a CpG island!
Non-mammalian vertebrates have some CpG islands that are associated with genes, but the association gets equivocal in the farther taxonomic groups.
Finding a CpG island upstream of predicted exons or genes is good contributory evidence.
By default, this program defines a CpG island as a region where, over an average of 10 windows, the calculated % composition is over 50% and the calculated Obs/Exp (i.e. Observed/Expected) ratio is over 0.6 and the conditions hold for a minimum of 200 bases. These conditions can be modified by setting the values of the appropriate parameters.
The Observed number of CpG patterns in a window is simply the count of the number of times a 'C' is found followed immediately by a 'G'.
The Expected number of CpG patterns is calculated for each window as the number of CpG dinucleotides you would expect to see in that window based on the frequency of C's and G's in that window. Thus, the Expected frequency of CpG's in a window is calculated as the number of 'C's in the window multiplied by the number of 'G's in the window, divided by the window length.
Expected = (number of C's * number of G's) / window length
This program reads in one or more sequences and calculates the Obs/Exp ratio, the percentage CpG over a window which is moved along the sequence. These calculated values can be plotted, together with the regions which match this program's definition of a CpG island.
% cpgplot tembl:rnu68037 -graph cps Plot CpG rich areas Window size [100]: Minimum length of an island [200]: Minimum observed/expected [0.6]: Minimum percentage [50.]: Output file [rnu68037.cpgplot]: Output features [rnu68037.gff]: Created cpgplot.ps |
Go to the input files for this example
Go to the output files for this example
Standard (Mandatory) qualifiers: [-sequence] seqall Sequence database USA -window integer The percentage CG content and the Observed frequency of CG is calculated within a window whose size is set by this parameter. The window is moved down the sequence and these statistics are calculated at each postition that the window is moved to. -minlen integer This sets the minimum length that a CpG island has to be before it is reported. -minoe float This sets the minimum average observed to expected ratio of C plus G to CpG in a set of 10 windows that are required before a CpG island is reported. -minpc float This sets the minimum average percentage of G plus C a set of 10 windows that are required before a CpG island is reported. [-outfile] outfile This sets the name of the file holding the report of the input sequence name, CpG island parameters and the output details of any CpG islands that are found. [-graph] xygraph Graph type [-outfeat] featout File for output features Additional (Optional) qualifiers: (none) Advanced (Unprompted) qualifiers: -[no]obsexp boolean If this is set to true then the graph of the observed to expected ratio of C plus G to CpG within a window is displayed. -[no]cg boolean If this is set to true then the graph of the regions which have been determined to be CpG islands is displayed. -[no]pc boolean If this is set to true then the graph of the percentage C plus G within a window is displayed. Associated qualifiers: "-sequence" associated qualifiers -sbegin1 integer Start of each sequence to be used -send1 integer End of each sequence to be used -sreverse1 boolean Reverse (if DNA) -sask1 boolean Ask for begin/end/reverse -snucleotide1 boolean Sequence is nucleotide -sprotein1 boolean Sequence is protein -slower1 boolean Make lower case -supper1 boolean Make upper case -sformat1 string Input sequence format -sdbname1 string Database name -sid1 string Entryname -ufo1 string UFO features -fformat1 string Features format -fopenfile1 string Features file name "-outfile" associated qualifiers -odirectory2 string Output directory "-graph" associated qualifiers -gprompt3 boolean Graph prompting -gtitle3 string Graph title -gsubtitle3 string Graph subtitle -gxtitle3 string Graph x axis title -gytitle3 string Graph y axis title -goutfile3 string Output file for non interactive displays -gdirectory3 string Output directory "-outfeat" associated qualifiers -offormat4 string Output feature format -ofopenfile4 string Features file name -ofextension4 string File name extension -ofdirectory4 string Output directory -ofname4 string Base file name -ofsingle4 boolean Separate file for each entry General qualifiers: -auto boolean Turn off prompts -stdout boolean Write standard output -filter boolean Read standard input, write standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report deaths |
Standard (Mandatory) qualifiers | Allowed values | Default | |
---|---|---|---|
[-sequence] (Parameter 1) |
Sequence database USA | Readable sequence(s) | Required |
-window | The percentage CG content and the Observed frequency of CG is calculated within a window whose size is set by this parameter. The window is moved down the sequence and these statistics are calculated at each postition that the window is moved to. | Integer 1 or more | 100 |
-minlen | This sets the minimum length that a CpG island has to be before it is reported. | Integer 1 or more | 200 |
-minoe | This sets the minimum average observed to expected ratio of C plus G to CpG in a set of 10 windows that are required before a CpG island is reported. | Number from 0.000 to 10.000 | 0.6 |
-minpc | This sets the minimum average percentage of G plus C a set of 10 windows that are required before a CpG island is reported. | Number from 0.000 to 100.000 | 50. |
[-outfile] (Parameter 2) |
This sets the name of the file holding the report of the input sequence name, CpG island parameters and the output details of any CpG islands that are found. | Output file | <sequence>.cpgplot |
[-graph] (Parameter 3) |
Graph type | EMBOSS has a list of known devices, including postscript, ps, hpgl, hp7470, hp7580, meta, colourps, cps, xwindows, x11, tektronics, tekt, tek4107t, tek, none, null, text, data, xterm, png, xml | EMBOSS_GRAPHICS value, or x11 |
[-outfeat] (Parameter 4) |
File for output features | Writeable feature table | unknown.gff |
Additional (Optional) qualifiers | Allowed values | Default | |
(none) | |||
Advanced (Unprompted) qualifiers | Allowed values | Default | |
-[no]obsexp | If this is set to true then the graph of the observed to expected ratio of C plus G to CpG within a window is displayed. | Boolean value Yes/No | Yes |
-[no]cg | If this is set to true then the graph of the regions which have been determined to be CpG islands is displayed. | Boolean value Yes/No | Yes |
-[no]pc | If this is set to true then the graph of the percentage C plus G within a window is displayed. | Boolean value Yes/No | Yes |
ID RNU68037 standard; RNA; ROD; 1218 BP. XX AC U68037; XX SV U68037.1 XX DT 23-SEP-1996 (Rel. 49, Created) DT 04-MAR-2000 (Rel. 63, Last updated, Version 2) XX DE Rattus norvegicus EP1 prostanoid receptor mRNA, complete cds. XX KW . XX OS Rattus norvegicus (Norway rat) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; OC Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Rattus. XX RN [1] RP 1-1218 RA Abramovitz M., Boie Y.; RT "Cloning of the rat EP1 prostanoid receptor"; RL Unpublished. XX RN [2] RP 1-1218 RA Abramovitz M., Boie Y.; RT ; RL Submitted (26-AUG-1996) to the EMBL/GenBank/DDBJ databases. RL Biochemistry & Molecular Biology, Merck Frosst Center for Therapeutic RL Research, P. O. Box 1005, Pointe Claire - Dorval, Quebec H9R 4P8, Canada XX DR SWISS-PROT; P70597; PE21_RAT. XX FH Key Location/Qualifiers FH FT source 1..1218 FT /db_xref="taxon:10116" FT /organism="Rattus norvegicus" FT /strain="Sprague-Dawley" FT CDS 1..1218 FT /codon_start=1 FT /db_xref="SWISS-PROT:P70597" FT /note="family 1 G-protein coupled receptor" FT /product="EP1 prostanoid receptor" FT /protein_id="AAB07735.1" FT /translation="MSPYGLNLSLVDEATTCVTPRVPNTSVVLPTGGNGTSPALPIFSM FT TLGAVSNVLALALLAQVAGRLRRRRSTATFLLFVASLLAIDLAGHVIPGALVLRLYTAG FT RAPAGGACHFLGGCMVFFGLCPLLLGCGMAVERCVGVTQPLIHAARVSVARARLALALL FT AAMALAVALLPLVHVGHYELQYPGTWCFISLGPPGGWRQALLAGLFAGLGLAALLAALV FT CNTLSGLALLRARWRRRRSRRFRENAGPDDRRRWGSRGLRLASASSASSITSTTAALRS FT SRGGGSARRVHAHDVEMVGQLVGIMVVSCICWSPLLVLVVLAIGGWNSNSLQRPLFLAV FT RLASWNQILDPWVYILLRQAMLRQLLRLLPLRVSAKGGPTELSLTKSAWEASSLRSSRH FT SGFSHL" XX SQ Sequence 1218 BP; 162 A; 397 C; 387 G; 272 T; 0 other; atgagcccct acgggcttaa cctgagccta gtggatgagg caacaacgtg tgtaacaccc 60 agggtcccca atacatctgt ggtgctgcca acaggcggta acggcacatc accagcgctg 120 cctatcttct ccatgacgct gggtgctgtg tccaacgtgc tggcgctggc gctgctggcc 180 caggttgcag gcagactgcg gcgccgccgc tcgactgcca ccttcctgtt gttcgtcgcc 240 agcctgcttg ccatcgacct agcaggccat gtgatcccgg gcgccttggt gcttcgcctg 300 tatactgcag gacgtgcgcc cgctggcggg gcctgtcatt tcctgggcgg ctgtatggtc 360 ttctttggcc tgtgcccact tttgcttggc tgtggcatgg ccgtggagcg ctgcgtgggt 420 gtcacgcagc cgctgatcca cgcggcgcgc gtgtccgtag cccgcgcacg cctggcacta 480 gccctgctgg ccgccatggc tttggcagtg gcgctgctgc cactagtgca cgtgggtcac 540 tacgagctac agtaccctgg cacttggtgt ttcattagcc ttgggcctcc tggaggttgg 600 cgccaggcgt tgcttgcggg cctcttcgcc ggccttggcc tggctgcgct ccttgccgca 660 ctagtgtgta atacgctcag cggcctggcg ctccttcgtg cccgctggag gcggcgtcgc 720 tctcgacgtt tccgagagaa cgcaggtccc gatgatcgcc ggcgctgggg gtcccgtgga 780 ctccgcttgg cctccgcctc gtctgcgtca tccatcactt caaccacagc tgccctccgc 840 agctctcggg gaggcggctc cgcgcgcagg gttcacgcac acgacgtgga aatggtgggc 900 cagctcgtgg gcatcatggt ggtgtcgtgc atctgctgga gccccctgct ggtattggtg 960 gtgttggcca tcgggggctg gaactctaac tccctgcagc ggccgctctt tctggctgta 1020 cgcctcgcgt cgtggaacca gatcctggac ccatgggtgt acatcctgct gcgccaggct 1080 atgctgcgcc aacttcttcg cctcctaccc ctgagggtta gtgccaaggg tggtccaacg 1140 gagctgagcc taaccaagag tgcctgggag gccagttcac tgcgtagctc ccggcacagt 1200 ggcttcagcc acttgtga 1218 // |
CPGPLOT islands of unusual CG composition RNU68037 from 1 to 1218 Observed/Expected ratio > 0.60 Percent C + Percent G > 50.00 Length > 200 Length 406 (104..509) Length 329 (596..924) |
##gff-version 2.0 ##date 2005-07-15 ##Type DNA RNU68037 RNU68037 cpgplot misc_feature 104 509 0.000 + . Sequence "RNU68037.1" RNU68037 cpgplot misc_feature 596 924 0.000 + . Sequence "RNU68037.2" |
Program name | Description |
---|---|
cpgreport | Reports all CpG rich regions |
geecee | Calculates fractional GC content of nucleic acid sequences |
newcpgreport | Report CpG rich areas |
newcpgseek | Reports CpG rich regions |
As there is no official definition of what is a cpg island is, and worst where they begin and end, we have to live with 2 definitions and thus two methods. These are:
1. newcpgseek and cpgreport - both declare a putative island if the score is higher than a threshold (17 at the moment). They now also displaying the actual CpG count, the % CG and the observed/expected ration in the region where the score is above the threshold. This scoring method based on sum/frequencies overpredicts islands but finds the smaller ones around primary exons. newcpgseek uses the same method as cpgreport but the output is different and more readable.
2. newcpgreport and cpgplot use a sliding window within which the Obs/Exp ratio of CpG is calculated. The important thing to note in this method is that an island, in order to be reported, is defined as a region that satisfies the following contraints:
Obs/Exp ratio > 0.6 % C + % G > 50% Length > 200.
For all practical purposes you should probably use newcpgreport. It is actually used to produce the human cpgisland database you can find on the EBI's ftp server as well as on the EBI's SRS server.
geecee measures CG content in the entire input sequence and is not to be used to detect CpG islands. It can be usefull for detecting sequences that MIGHT contain an island.