2.1.1 SNP data sets
126.96.36.199 MBML2 SNP data set
188.8.131.52 HQMBML2 SNP data set
184.108.40.206 MBML10 SNP data set
2.1.2 PRP data
2.1.3 Pseudochromosome sequences
2.1.4 Repetitive bases
3 Program usage
3.2 Polymorphism Tools
3.2.1 Coding SNPs by Gene
3.2.2 SNPs by Region
3.2.3 SNPs Between Accessions (SBE)
3.2.4 SNPs by Allele Frequency (SAF)
3.2.5 PRPs by Locus
3.2.6 PRPs Between Accessions
3.2.7 Pseudochromosome Search
3.3 Assay Development
3.3.1 Assay Development Formatter (ADF)
3.3.2 Primer Designer
3.3.3 CAPS Search
3.4 Repetitive Features
3.4.1 Repetitive 25mer
3.5.1 Gbrowse Viewer
Genome annotations available from The Arabidopsis Information Resource (TAIR; http://www.arabidopsis.org) will be supported (the TAIR7 annotation is currently used). Predicted products from annotated coding genes are mapped to protein domains from the Interpro protein domain databases (http://www.ebi.ac.uk/interpro). Additional annotations for noncoding RNAs from RFAM mirBASE (http://microrna.sanger.ac.uk/sequences/index.shtml) and the Arabidopsis Small RNA Project (ASRP, http://asrp.cgrb.oregonstate.edu/) are also supported.
A single resequencing project, "Array_20", is currently supported by POLYMORPH. For a given project, biases in the data that are associated with given resequencing method and that may affect end users are mentioned when appropriate.
"Array_20" is a project for which 20 A. thaliana accessions were resequenced with whole-genome resequencing arrays (Clark et al. 2007). Primary data was gathered by hybridizing isothermally amplified genomic DNA from each accession to high-density oligonucleotide microarrays suitable for base calling. Over 99.99% of bases in the 119 Mb euchromatic A. thaliana reference genome sequence from accession Col-0 (The Arabidopsis Genome Initiative, 2000) were queried with these arrays for each accession. For quality control, and to aid in polymorphism discovery in other accessions, genomic DNA from Col-0 was also hybridized to these arrays.
SNP, polymorphic region predictions (PRPs), pseudochromosome sequence, and repetitive base data that was generated as part of the "Array_20" project can be accessed with POLYMORPH.
A collection of 648,570 nonredundant SNPs predicted at a false discovery rate of about 2% (i.e., about 1 false SNP in 50). This dataset includes all SNPs predicted by a (1) model based algorithm (MB) or by (2) a machine learning method (ML) for which the cumulative confidence for a ML prediction was greater than 98% [see Clark et al. (2007) supplemental sections 5 and 6 for details].
A nonredundant subset of MBML2 consisting of 250,727 SNPs identified at a false discovery rate of about 0.2% (i.e., about 1 false call in 500). These SNPs are highly suitable for genetic mapping, where genotyping data is especially important. This dataset includes all SNPs predicted by the (1) MB algorithm and by (2) the ML method for which the cumulative confidence for the ML prediction was greater than 98% [see column "MB intersection ML" in Table 1 of Clark et al. (2007)].
This data set contains all SNP predictions described by Clark et al. (2007), including a subset not included in the MBML2 data. SNP predictions unique to MBML10 have very high FDRs (generally greater than 10-20%). Owing to the high error rate, these predictions are not displayed directly by POLYMORPH; however, some POLYMORPH tools use information from this data set (e.g., the Primer Designer tool uses information from all SNP predictions to attempt to design primers that perfectly match target sequences across all accessions).
A NOTE TO USERS: Only about 35% of known SNPs are represented in the MBML2 data set. Genomic intervals for which no SNPs were predicted may be monomorphic with the Col-0 reference genome sequence, but may also harbor SNPs undetectable with hybridization-based methods. Such regions include those for which polymorphism levels are very high, probe hybridization properties are poor, or that correspond to repetitive bases comprising about 18% of the A. thaliana genome (see section 2.1.4 below). Pseudochromosome sequences available for each accession are an additional source of information about predictions (see tool "Pseudochromosomes").
For displaying the "Array_20" SNP data in the GBrowse viewer tool (see below), SNPs are color-coded by prediction method (MB or ML) and by the cumulative FDR where a prediction was generated with the ML method. Highest quality SNPs are coded in green, high quality SNPs in yellow, and SNPs predicted by only one of the two methods in orange (ML) or red (MB).]}.
A data set consisting of 13,470 PRPs described by Clark et al. (2007). PRPs, or polymorphic region predictions, delineate regions of high polymorphism or deletion that extend over approximately 300 bp or more. Experimental characterization of a fraction of PRPs indicated that these predictions correspond largely to simple deletions or tracts of high indel content, although some result simply from extended clusters of SNPs (Clark et al. 2007). PRPs consist of core and boundary regions, both of which are displayed in the Gbrowse implementation in POLYMORPH [see Clark et al. (2007) supplemental section 8 for PRP details]. In the Gbrowse viewer, core regions are displayed as red rectangles, with extending black lines denoting boundary regions where polymorphism returns to levels more typical of the genome average.
Pseudochromosome sequences were constructed from the array data for each accession [see Clark et al. (2007) main text and supplemental Section 9]. These sequences consist of reference base calls [those identical to the Col-0 reference genome sequence; The Arabidopsis Genome Initiative, (2000)] and SNPs from the MBML2 data set. For display in POLYMORPH, the '~' symbol represents positions for which no base call could be made. Stretches of '~'s result in part where intrinsic hybridization properties are poor, but also from experimental noise and polymorphism (e.g., where a region is deleted or highly polymorphic in a given accession). In general, the experimental Col-0 pseudochromosome sequence can be used to assess the hybridization properties for a given region. Repetitive bases, which were largely excluded in making SNP predictions, are indicated by the '=' symbol (see section 2.1.4).
A repetitive position (or base) was defined as one for which an oligonucleotide of length 25 from the reference genome sequence and centered on the given position matched elsewhere in the reference genome with high complementarity. Exact, short, and inexact positions (collectively called "repetitive 25mers") were defined according to the degree of complementarity to sequences at multiple sites in the genome [see Clark et al. (2007) supplemental Section 4 for details].
All tools in POLYMORPH can be selected from drop down menus at the top of each page. Tools related to polymorphism are located under the "Polymorphism Tools" menu. The "Assay Development" menu provides tools for primer design, CAPS marker searches, and a tool to generate the assay development format (ADF, Warthmann, Fitz, Weigel, 2007) in wide use for designing high-throughput genotyping assays. Information about repetitive sequences can be accessed via the "Repetitive Features" menu, and a Gbrowse viewer (Stein, L. D. et al.) displaying various aspects of the reference and resequenced genomes is available under "Viewers". Applications are interconnected with each other when possible, and links are provided for each query result with a given tool that allow subsequent queries with other tools. In particular, result pages have links to the Gbrowse viewer displaying the genomic interval and feature.
Data are organized into Projects; when a Project is selected, all parameters are adapted to the specifics of the project. Projects used within POLYMORPH can consist of different data sets from different species. The project database contains the reference annotation for the model species in question, as well as polymorphism data gathered from a given resequencing study. Projects can provide several subsets of data having, for example, differing confidences of polymorphism prediction.
Upon selection of one or several accessions and a gene name (e.g., an AGI locus code for A. thaliana, such as "At1g01090"), a search will return the coding sequence as well as the amino-acid sequence of the gene. Known nucleotide and amino acid changes for the selected project and dataset are given in the search results as a sequence string and in a table format specifically for polymorphic positions (amino acid changes are sorted by accession and position in a gene). For constructing sequence strings, where positions are not polymorphic in a given data set, bases are filled in from the known reference sequence. Therefore, additional changes may be present that were undetectable with a given method, and that are not displayed. For each SNP, hypertext links to the Gbrowse viewer and the assay development formatter (ADF) are provided. The "Pseudochromosomes" tool provides additional data for interpreting this output.
For "SNP by Region", the user specifies a region along a chromosomes (Chromosome, Start Pos - End Pos) for which SNPs are identified that distinguish one or more of a specified number of accessions. All SNPs are listed along with a type annotation such as "coding" or "intergenic" according to the currently supported annotation. A pseudo-multiple alignment with all SNPs highlighted in the reference sequence is displayed where the requested interval is less than 30,000 bp. The "Pseudochromosomes" tool provides additional data for interpreting this output.
This is a customized implementation of MSQT/SBE (Warthmann, Fitz, Weigel, 2007)). SNPs can be extracted according to whether they distinguish distinct sets of user specified accessions within a project. The Result is a list of SNPs that distinguish all selected accessions in "1st accession set" from all accessions selected in "2nd accession set". For each SNP, hypertext links are provided to (a) the POLYMORPH/Gbrowse viewer to quickly inspect a SNP in the genomic context and to (b) the Assay Development Format application (POLYMORPH/ADF). By clicking the ADF link, the application is invoked for the given SNP and the selected 2 groups of accessions. By default, ADF will use 250 bp sequence to either side of the SNP.
If - prior to "compute SNPs" - any accession(s) in the "3rd accessions set" was/were selected, this/these will be an additional parameter for ADF and all sequence changes in this/these accession(s) will be annotated in the resulting ADF string as well. However, this does not influence the selection of distinct SNPs.
This application enables SNP selection by quality and allele frequency criteria. The user specifies a sequence range along a chromosome and lower and upper limits for allele frequencies. The user can specify a threshold for missing data by setting "min accessions called". E.g., if "min accessions called" is set to 18, SAF will only consider SNPs at positions where genotype data is available for at least 18 accessions.
SAF returns a list of SNPs, including surrounding sequences in ADF format, the length of which can be specified by setting "outer window size" (default is 500 bp to either side of the SNP). In this ADF output, information for ALL SNPs in the MBML10 data set is used. In some cases, many SNPs flank a target SNP in one or more accession. A result is that primer design for genotyping applications may be difficult for such instances. To evaluate SNPs for assay design, the number of SNPs in both inner and outer windows is provided as output, and the sizes (length in bp) of the inner and outer windows can be preset (input windows "inner window size" and "outer window size"). As the quality and utility of SNPs identified by resequencing data varies by coverage and repetitive content, the numbers of repetitive and callable bases within the specified outer window is also returned in the output. [USER NOTE: In general, assay development is recommended for SNPs for which (1) the percentage of flanking bases called is high, and (2) the percentage of flanking repetitive bases is low.]
Polymorphic region predictions (PRPs) are a type of prediction identifying regions of high polymorphism or deletion for which specific polymorphism data cannot be recovered with a given resequencing technology [see Clark et al. (2007) main text and supplemental section 8 for a discussion]. This prediction type is relevant for hybridization-based data sets, but in future may also be applicable for resequencing data generated by other methods.
PRPs can be searched by accession based on a chromosome interval (the start and end coordinates must be specified). PRPs are returned in a table format with an entry for each accession and the PRP coordinates. A Gbrowse link is provided for each result. When the option "Design primer" is set to "Yes", a link will be provided for generating primers to facilitate experimental investigation of a given prediction (see Primer Designer). Primers are designed to flank the boundary regions of PRPs within 250 bp upstream and downstream of the polymorphic region.
Identifies PRPs (see above) that distinguish sets of accessions in an analogous way as for SNPs (see "SNPs between accessions"). We defined PRPs in different accessions but within overlapping genomic regions, as polymorphic blocks. Accessions having a PRP in a specific polymorphic block are then distinct from the set of accessions not featuring a PRP in the respective polymorphic block. The "PRP Between Accessions" search will return all polymorphic blocks of the selected region or gene that distinguish the selected sets of accessions. The two accession sets must be distinct. The minimum length of the polymorphic blocks can be specified.
seudochromosome sequences can be retrieved by gene, or by chromosome region where less than or equal to 30 kb. An alignment of pseudochromosome sequences for selected accessions is displayed. In each alignment, the top sequence, denoted "Ref", is the reference genome sequence for a given project. Under parameters, the user can specify whether to display only bases identical to the Ref sequence, or whether to also display SNPs (the SNP data set can be specified). A legend is provided that defines symbols used for constructing pseudochromosome sequences from given resequencing data types.
The ADF format was adopted from MSQT (Warthmann, Fitz, Weigel, 2007). It facilitates the design of primers and/or probes for genotyping assays, and is a widely accepted format for high throughput genotyping assay development. It also serves as input for other POLYMORPH applications (e.g., "Primer designer" and "CAPS marker designer"), where it can be directly pasted or is handed over as a parameter. ADF output is created by the Assay Development Formatter, for which 2 groups of accessions and the location (chromosome and position) of a SNP must be specified. Optionally, a third group of accessions can be specified (see tool "SNPs between accessions" for a discussion of this feature). The length of the surrounding sequence returned by the tool can be adjusted (e.g., if "ADF extension" is set to the default value of 250 bp, a string of 501 bp is returned for assay design).
ADF is available in several variants:
Primers are designed from a given sequence using Primer3 (Rozen and Skaletsky 2000). Primers can be designed using as input a multi-FASTA alignment or single ADF sequence pasted into the text field "User specified sequence". The ADF sequence must adhere to the rules described in the paragraph above.
In addition, primers can be designed against a specified gene or chromosome region in a given accession(s) using the "Search Locus" feature. The sequence of the specified gene/region will be loaded from the POLYMORPH database, and polymorphic sites will be masked prior to primer design. Some masking properties can be adjusted by selecting the desired options under "Exclude polymorphic regions". When possible, perfect match primers will be returned for use with all selected accessions. If the option "Mask: 'SNPs, PRPs, repetitive bases'" is used, primers will be designed on the pseudochromosome sequence of the selected region. This option only functions on a per accession basis. If "Mask:'SNPs'" or "Mask:'SNPs and PRPs'" is selected, the respective polymorphic site from each selected accession will be masked in the reference sequence. These options function with any number of accessions and allows the design of primers which work in all selected accessions. Several POLYMORPH applications link directly to the "Primer designer" with the appropriate input parameters.
Independent of the input method, some primer design "Parameters" can be adjusted. A detailed description of Primer3 parameters can be found at http://frodo.wi.mit.edu. Users can copy an ADF sequence generated by POLYMORPH into any of the many Primer3 websites when more advanced features of the Primer3 program are required for successful primer design.
Improvements for primer designer currently in development: Uniqueness of primer pairs within the genome.
This tool locates SNPs in a gene or chromosome region where a restriction site is changed in at least one of a selected set of accessions. This tool can work on any combination of restriction enzymes. The tool returns a table listing SNP positions, the relevant restriction enzyme, and a list of accessions that have and do not have target restriction site. Accessions with an ambiguous call (N) at the SNP position are considered to have the reference base, but please note that many ambiguous calls are likely to be undetected polymorphisms. If the "High quality CAPS marker" box is checked, "CAPS Search" will only consider accessions with a non-ambiguous call (either SNP or reference) at the given position.
To retrieve CAPS markers that distinguish two sets of accessions, switch the CAPS mode "Table" to "Distinct CAPS". The two distinct sets of accessions must be specified (the appropriate SNP positions are automatically determined via POLYMORPH/SBE). This mode automatically uses "High quality CAPS marker" discarding all sites having an ambiguous call in any of the selected accessions. The third CAPS mode "Gel" returns the putative fragment lengths produced by a restriction enzyme from a given chromosome region. better description here, mode still in development
CAPS marker designer uses the program SNP2CAPS.pl (Thomas Thiel et al. SNP2CAPS: a SNP and INDEL analysis tool for CAPS marker development. NAR, 2004) which can be downloaded here and a version of SNP2CAPS.pl modified by our own group ( FindCAPS.pl) which works slightly faster on short input sequence with a single SNP position.
Oligonucleotides of length 25 bp (25mers) have been commonly used in microarray studies to examine gene expression and/or to detect polymorphisms (e.g., with resequencing arrays as in the Array_20 project). Where 25mers can cross-hybridize to multiple genomic locations, biological inference is limited, and cross-hybridization can lead to false prediction of polymorphic features from array data. "Repetitive 25mers" correspond to the repetitive bases defined for the Array_20 project (see section 2.1.4), and the repetitive 25mers in a given region can be queried using the "Repetitive 25mer" tool. In addition to exact repeats POLYMORPH provides repeats with up to one mismatch (option "kmer set:'Allow Mismatch'") and repeats with up to 2 differences at each end (option "kmer set:'Allow End Gaps'").
Gbrowse is an integral part of POLYMORPH and can be accessed via menu or by one of the links on any result page of POLYMORPH. To alleviate data access in the viewer the gbrowse tracks are divided into the categories Annotation, RNA Annotation, Protein Domains, Features, SNP and Array. RNA annotation combines known non-coding RNAs from TAIR, RFAM and ASRP. All other features from the TAIR7 release and the TIGR5 genome release of A. thaliana are found in the Annotation category. The Protein domains section provides a track for most member databases of Interpro (Pfam, Prosite, Prints, Prodom, TIGR, Panther, Smart, PirSF, Gene3D, Superfamily). All annotation tracks link to the web portal providing the original anntotaion data. SNPs from all 19 accessions of the Array_20 project are combined in the SNP track. Allother polymorphic features, additional quality data and repetitive oligos (kmers) are in the Features category. Tracks showing polymorphisms or other project specific data are linked to the respective POLYMORPH search tool. Finally we provide two tracks with probe sets of A. thaliana Affymetrix ATH1 arrays and Tiling arrays. Affymetrix ATH1 arrays are linked to the AtGenExpress tissue specific expression analysis at the WeigelWorld. All tracks are linked to the web portal providing the original data or to the related POLYMOPRH search tool.
Warthmann N, Fitz J, Weigel D. 2007. MSQT for choosing SNP assays from multiple DNA alignments.
Clark, R.M., Schweikert, G., Toomajian, C., Ossowski, S., Zeller, G., Shinn, P., Warthmann, N., Hu, T.T., Fu, G., Hinds, D.A., Chen, H., Frazer, K.A., Huson, D.H., Scholkopf, B., Nordborg, M., Ratsch, G., Ecker, J.R., and Weigel, D. 2007. Common sequence polymorphisms shaping genetic diversity in Arabidopsis thaliana.
Science 317(5836): 338-342.
Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Buillard, V., Cerutti, L., Copley, R., Courcelle, E., Das, U., Daugherty, L., Dibley, M., Finn, R., Fleischmann, W., Gough, J., Haft, D., Hulo, N., Hunter, S., Kahn, D., Kanapin, A., Kejariwal, A., Labarga, A., Langendijk-Genevaux, P.S., Lonsdale, D., Lopez, R., Letunic, I., Madera, M., Maslen, J., McAnulla, C., McDowall, J., Mistry, J., Mitchell, A., Nikolskaya, A.N., Orchard, S., Orengo, C., Petryszak, R., Selengut, J.D., Sigrist, C.J., Thomas, P.D., Valentin, F., Wilson, D., Wu, C.H., and Yeats, C. 2007. New developments in the InterPro database. Nucleic Acids Res 35(Database issue): D224-228.
Rozen, S. and Skaletsky, H. 2000. Primer3 on the WWW for general users and for biologist programmers.
Methods Mol Biol 132: 365-386.
Stein, L.D., Mungall, C., Shu, S., Caudy, M., Mangone, M., Day, A., Nickerson, E., Stajich, J.E., Harris, T.W., Arva, A., et al. (2002)
The generic genome browser: a building block for a model organism system database Genome Res, . 12, 1599-610
Thomas Thiel, Raja Kota, Ivo Grosse, Nils Stein, and Andreas Graner 2004.
SNP2CAPS: a SNP and INDEL analysis tool for CAPS marker development. Nucleic Acids Research Vol. 32, No. 1 e5