Universal Microarrays: An Algorithmic Approach

Universal Microarrays: An Algorithmic Approach Universal Microarrays: An Algorithmic Approach

Tag set design algorithms, thermodynamic model, Affymetrix GenFlex, validation and verification

May 9, 2002

Prepared by Kaleigh Smith

1 Introduction to Universal Microarrays

Universal DNA microarrays are organism-independent tools with the ability to recognize specific target genomic sequences or nucleic acid reaction products within a large and complex mixture. The microarray itself is a high-density matrix of oligonucleotides each designed to signal the presence or absence of a specific genomic sequence. An oligonucleotide is a short (20 - 25bp) sequence of amino acids, usually DNA though artificial bases are also used. When immobilized to the surface of a microarray, an oligonucleotide is named a probe, anti-tag or tag-probe. The Watson-Crick complement of a probe is a tag.

The basis of microarray technology is the hybridization between Watson-Crick complementary tag and probe pairs into stable duplex structures. Undesirable hybridization behaviours such an cross-hybridization, dimerization and secondary structure formation lead to inaccurate experimental results. Hybridization therefore introduces complications to the design of microarrays. The problem in designing microarrays is to select a maximally large set of oligonucleotide probes to be fixed to the surface of the array while strictly ensuring that each probe signals the presence or absence of its intended target with very high specificity. We will discuss issues with microarray design and several existing algorithms in further sections of this document.

Figure 1: The purpose and general design of a universal microarray. Note that there may be hundreds of thousands of probe locations on a chip and at each location there may be millions of copies of the probe.

Universal microarrays are a more flexible type of microarray than the popular organism-specific DNA microarrays. With organism-specific microarrays, the probes are designed from sections of target genomic sequences (usually genes). These probes are either immobilized on the surface of the array or are placed in wells on the array. DNA or RNA is isolated from a biological sample, processed into target sequence segments and then labeled to render them optically detectable. These targets are washed over the microarray (or chip) and hybridize with their complementary probes.

Universal microarrays differ because their probes are not targeted to an organism-specific genomic sequence or product. Instead, probe-complementary tags are ligated to the target-complementary sequences. These ligated molecules then hybridize with the tagged target sequences in the sample solution. There are then a variety of methods for fluorescently (or otherwise) labeling the tag-containing molecule [2,,,,,21]. The resulting tagged and labeled molecules are brought in contact with the universal microarray and are captured by each tag's complementary probe. The makeup of the sample can then be identified as the researcher knows which tag molecule hybridized to which target sequence.

Figure 2: A typical assay using a universal microarray.

These microarrays are used for SNP detection, genotyping of SNPs, gene expression profiling, and to facilitate custom design hybridization-based assays. Affymentrix writes of their commercial universal microarray GenFlex Tag Array, äny reaction in which a nucleic acid tag can be incorporated into the products may be amenable to analysis with the array"[1]. In each application, the universal microarray is used to determine the identity or abundance of target nucleic acid samples.

*For more details on the manufacture, data processing and quality control of organism-specific and universal microarrays, please see summany document by Fu Chen.*

2 Modeling Universal Microarrays

A universal microarray can be represented by the set of unique probes immobilized on its surface, or equivalently, by the set of unique tags complementary to those probes. An oligonucleotide, and thus a tag and a probe, is commonly represented as a string of letters over the DNA alphabet. It is useful to abstract to these representations when considering the problem of designing a universal microarray. Such abstractions also help to define and measure hybridization behaviours of oligonucleotides.

Definition 1 [Tag] A tag t = < t₁t₂ ... t_n > is an oligonucleotide (sequence of bases) over a DNA or extended DNA alphabet with n between 15 and 25. The left to right ordering of a string corresponds to 5' to 3' orientation of an oligonucleotide. In string notation, let S_dnaⁿ = { A, C, G, T }. A tag t of length n is a string t = t₁ t₂ źt_n Î S_dnaⁿ.

Example: t = ACTTGCAATCGTAATCGC = 5˘-ACTTGCAATCGTAATCGC-3˘.

The reverse t^r of t is the string t_n t_n-1 źt₁.

Definition 2 [Anti-tag] A complement tag, or anti-tag [`t] is the Watson-Crick complement of tag t. Each base of t is replaced by its Watson-Crick complement and the orientation of the derived sequence is reversed. In string notation, the Watson-Crick complement [`t] of t is the string obtained from t^r by interchanging A Ť T and C Ť G. We say that two strings s and t are a Watson-Crick complement pair iff [`s] = t.

Example: If t = 5˘-ACTTGCAATCGTAATCGC-3˘ then [`t] = 5˘-GCGATTACGATTGCAAGT-3˘

2.1 Tag Behaviour

Perfect Hybridization A Watson-Crick complement pair of tags s and t exhibit perfect hybridization to form a stable duplex. t = ACTTGCAATCGTAATCGC, [`t] = GCGATTACGATTGCAAGT hybridize as follows:

5' - A-C-T-T-G-C-A-A-T-C-G-T-A-A-T-C-G-C - 3'
     | | | | | | | | | | | | | | | | | |
3' - T-G-A-A-C-G-T-T-A-G-C-A-T-T-A-G-C-G - 5'

Self-Hybridization A tag t may self-hybridize to form a secondary structure. As we model short DNA sequences, we will consider only the most simple folded structure, the hairpin loop. In string notation, a tag t self-hybridizes to form a perfect hairpin structure with a loop of size k and a stem of length l iff t_top = [`(t_bottom)] where t_top = t₁ t₂ źt_l and t_bottom = t_s+k+1 t_s+k+2 źt_n.An example, t = AAAAAAAAGGGTTTTTTTT has the potential to form a stable hairpin structure:

  
5' - A-A-A-A-A-A-A-A-G
     | | | | | | | |   G
3' - T-T-T-T-T-T-T-T-G

Self-Dimerization A tag t perfectly self-dimerizes, that is two copies of t hybridize perfectly iff t^r = [`t]. For instance, if two copies of t = TTTCCCATGGGAAA are present in a solution, they will hybridize to form a stable duplex:

  
5' - T-T-T-C-C-C-A-T-G-G-G-A-A-A - 3'
     | | | | | | | | | | | | | |
3' - A-A-A-G-G-G-T-A-C-C-C-T-T-T - 5'

Imperfect Hybridization Two tags s and t hybridize imperfectly iff s ť t. For instance, the edit distance between s and t may be very small, or s and t may contain a long common substring.

Cross-Hybridization Tags t and s exhibit cross-hybridization if t ť s, so that t hybridizes imperfectly to [`s] and s hybridizes imperfectly to [`t]. Consider s = TTACCGTTAAGTAC, t = TTACCCTTAAGTAC and [`t] = GTACTTAAGGGTAA. s will hybridize imperfectly to [`t] (and t to [`s]):

  
5' - T-T-A-C-C-G-T-T-A-A-G-T-A-C - 3'
     | | | | |   | | | | | | | |
3' - A-A-T-G-G-G-A-A-T-T-C-A-T-G - 5'

Dimerization Tags t and s exhibit dimerization if t ť [`s], so that t and s hybridize imperfectly. Consider s = TTACCGTTAAGTAC and t = GTACTTTAACGATAA:

  
5' - T-T-A-C-C-G-T-T-A-A-G-T-A-C - 3'
     | | |   | | | | | |   | | |
3' - A-A-T-A-G-C-A-A-T-T-T-A-T-G - 5'

2.2 Tag Sets

For a set of tags T = { t₁, ź, t_m }, we use [`T] to denote the set of anti-tags obtained from T: [`T] = { [`(t_i)] : 1 Ł i Ł m }. A specific microarray chip C is represented by an ordered pair C = áT, Ań where T is a set of tags { t₁, ź, t_m }, A is a set of tags { a₁, ź, a_n }, t_i Î S_dnaⁿ, a_i Î S_dnaⁿ and there is a bijection s: T Ž A s.t. [`t] = s(t). That is, every tag t Î T has exactly one Watson-Crick complement s(t) = a Î A, so m = n.

Definition 3 [Good Tag Set] A good tag set is set of tags T such that

no tag t Î T exhibits self-hybridization or self-dimerization;
no tag t Î T exhibits strong hybridization to a sequence occurring in natural DNA that may be present in the sample mixture;
no pair of tags t and s exhibit dimerization;
no pair of tags t and s exhibit cross-hybridization;
all pairs of complementary tag/anti-tag should form stable duplexes at similar temperatures.

Figure 3: A well designed universal microarray should exhibit only the first characteristic. Notice that solid arrows indicate all tags participate in the hybridization. Dotted arrows indicate a fraction of the available tags participate in the hybridization.

2.3 Modeling and Measuring DNA Hybridization

In order to construct a good tag set, we require an accurate method for measuring/predicting the hybridization behaviours of DNA oligonucleotides described in the previous two sections.

A combinatorial model for DNA hybridization behaviour simply counts the number of base pairs that form Watson-Crick pairs and those that mismatch. Such a model is not based on the biochemical properties that govern the hybridizations, but is rather based on the similarity of the two tags in question. Such a model commonly uses measurements of distance between strings to predict the occurrence of imperfect hybridizations (Hamming distance, longest common substring or subsequence).

Several models have been proposed in the literature to analyze DNA hybridization by estimating thermodynamic properties governing the formation of a duplex. Thus, hybridization behaviour corresponds to duplex stability and is popularly measured in terms of melting temperature or free energy. The melting temperature T_M of two tags is the temperature at which half of each strand's nucleotides are in double-helical duplex form and half are in single-stranded form. The most simple model for estimating the T_M of a duplex is the 2-4 model consisting of a simple rule based on the number of hydrogen bonds between a two nucleotides. With the 2-4 model, the T_M of a sequence and its Watson-Crick complement is T_M = 2 (number of A-T base pairs) + 4(number of C-G base pairs) [19].

A thermodynamic nearest-neighbour model of DNA hybridization offers a more specific and biologically based model for short DNA sequences. Several sets of NN base pair parameters have been introduced, and we use the most well known and accurate parameters determined by SantaLucia [14]. Though this model can be applied to predict the T_M of two tags, we focus on its ability to predict the free energy, dG of a duplex. dG has been shown to be more accurate and robust to errors than the dH and dS properties that are used to determine T_M [14].

3 Universal Microarray Design

Several of the following algorithms were described in [3] and are included here to examine the different algorithms that have been created for designing universal microarrays.

3.1 l-Free Code

We have seen that two very similar tags will cross-hybridize to each other's anti-tags. It is very likely that such tags will share a long common substring, though there may be a possibility for cross-hybridization when there are many shorter common substrings and weakly mismatched base pairs. Still, long common substrings are a satisfactory indicator of cross-hybridization. The problem of good tag set design can then be to find the largest possible set of tags while ensuring that no two tags are overly similar. As described in [4], this can be accomplished by constructing the largest possible l-free code. A set of strings is a l-free code iff no two strings in the set have a common substring of length greater than l.

Input: Given integers n, l and an alphabet of bases S.
Output: A maximally large set T of length n strings over S such that no two strings s, t Î T have a common substring of length greater than l.

This problem can be solved by parsing a de Bruijn sequence of order l to construct an optimal l-free code of size 4^l/(n - l+ 1) [18].

Definition 4 [De Bruijn sequence] The shortest circular sequence of length l^|S| such that every string of length l on the alphabet S of size |S| occurs as a contiguous subrange of the sequence described by S. For example, a de Bruijn sequence of order 2 on the alphabet {a, b, c} is given by {aacbbccab}. [22]

This approach can be extended to create a tag set that avoids dimerization. This is accomplished by further filtering the constructed code T to ensure that for all tags t Î T, there is no tag s Î T such that t contains a substring x and s contains the x's complement [`x]. It is also clear that another filter may be applied to T to remove any tags that contain both substring x and the reverse of [`x], as such a tag would likely exhibit self-dimerization.

3.2 Greedy Approach

A completely combinatorial method, such as the l-free code approach, will not capture the biochemical properties of DNA hybridization. A more precise model of DNA behaviour is required to confidently construct a good tag set. As we have seen above, there are various thermodynamic models to predict the stability of DNA duplexes. The problem of good tag set design is then to construct a maximally large set of tags such that each pair of tags in the set satisfies some thermodynamic threshold against cross-hybridization. Again, it is clear how this formulation of the good tag set problem can be extended with additional stability thresholds (t's) to exclude dimerization, self-dimerization and secondary structure formation.

Input: Given integer n, value t and an alphabet of bases S.
Output: A maximally large set T of length n strings over S such that no two strings s, t Î T form a duplex with thermodynamic stability measurement greater than t.

A greedy algorithm to solve this problem would begin with an empty set of tags T and add a new tag, randomly chosen or chosen according to some heuristic, iff it is not predicted to cross-hybridize, dimerize or form self-structures. A greedy strategy for avoiding cross-hybridization is described in [18]. Unfortunately, a greedy approach does not ensure a maximally large set of tags and additional constraints would be necessary to bias the algorithm towards adding tags that form strong duplexes at similar temperatures.

3.3 GenFlex Chip Design

Affymetrix is the largest commercial manufacturer of universal microarrays. Their tag set design algorithm is described in [8] and is similar to [18]. It proceeds as follows:

Generate an initial set of length 20 oligonucleotide (tag) sequences with similar melting temperatures.
Apply a filter to this set of tags that is "based on Affymetrix expression rules and probe array experience... to optimize and standardize the hybridization characteristics of the set." [2]
Remove sequences that are ïdentical or nearly-identical to each other, to control sequences... or to sequences in the public databases at the time of array design"[2]

Affymetrix's approach to universal microarray technology includes adding additional mismatch probes (one base pair different from an actual probe) without complementary tags to the array to measure the specificity of each 'signalling' or actual probe. Affymetrix also includes control probes that are the complements of each actual probe and mismatch probe to detect possible dimerization. The critical observer will notice that the naive tag design algorithm necessitates the additional mismatch and control probes to identify and subtract unwanted reactions such as cross-hybridization and dimerization.

3.4 Hybrid Combinatorial and Thermodynamic Method

Universal DNA Tag Systems: A Combinatorial Design Scheme [3], described a tag set design algorithm that combines a combinatorial approach with the 2-4 thermodynamic model to predict t_M.

Good Individual Tags
Each tag selected for the final tag set must be guaranteed to hybridize with its complementary anti-tag when the experiment is performed at some given temperature t. This is done by designating temperature thresholds C and H, C < t < H. For each tag t in the set of good tags T, t_M(t, [`t]) ł H.

Good Pairwise Tags
Just as each intended hybridization must occur for the tag set to be good, no unintended hybridizations can be allowed to occur. Using the same temperature thresholds C and H as described above, for each pair of tags t and s in the set of good tags T, t_M(s, [`t]) < C (and t_M(t, [`s]) < C). The authors then make the well-supported assumption that t_M(s, [`t]) < C (and t_M(t, [`s]) < C) occurs iff t and s have a common substring x and t_M(x, [`x]) ł C.

It follows that the good tag set design problem defined in [3] is:
Input: Given threshold values C and H and an alphabet of bases S.
Output: A maximally large set T of tags over S such that:

For each tag t Î T, t_M(t, [`t]) ł H (according to 2-4 rule), and
For each pair of tags t, s Î T and each oligonucleotide x that occurs as a substring of both s and t, t_M(x, [`x]) < C (also according to 2-4 rule).

A string s has weight w(s), w(s) = 1/2(t_M(s)) when t_M is calculated according to the 2-4 rule. So, with respect to the previous definitions, 2c = C and 2h = H.

Definition 5 [c-h code] A tag set T is a valid c-h code iff each tag t Î T has w(t) ł h and and every substring of every tag t in T of weight c or more occurs in no other tags t˘ Î T, t š t˘ and occurs at most once in t. [3]

The authors of [3] define the Combinatorial Tag Design Problem as the problem of constructing a maximally large c-h code for appropriate parameters c and h. The authors then combine de bruijn sequences and novel combinatorial approaches to create an algorithm to generate an optimal c-h code and to provide bounds on the size of T.

4 Microarray Verification/Validation Algorithms

Much work has been done in designing algorithms for the manufacture of universal microarrays. These algorithms have been designed to minimize the cost of manufacturing by minimizing the number of steps (usually measured in masks/steps to be applied) and by reducing the potential for manufacturing error. Any error introduced during the manufacture of DNA microarrays will introduce experimental error and inconsistency. We will briefly examine the algorithms and procedures that have been designed to detect errors in manufacturing of DNA chips.

Many such quality control algorithms proceed by introducing additional probes onto the array. Hubbell and Pevzner in [12] describe an idea to add identical control probes in several locations over the microarray. These control probes must be immobilized to the microarray in different ways, using as many different manufacturing steps as possible. If there is no manufacturing error at all of these steps, all control probes are expected to display near-identical intensity readings. However, if one control probe is incorrectly created, its intensity will be dissimilar. It is this principle that allows control probes to detect manufacturing error in microarrays. It may strengthen the approach to have several different control probes, as this increases the number of steps in the manufacturing process that may be monitored[16].

The problem now becomes the design of such control probes. The authors of [16] define a problem similar to that of universal microarray tag design. In addition to selecting a set of control probes that have good tag set properties among their set, the control probes must also maintain good tag set properties when combined with the set of experimental tags. The problem is further complicated by the need to ensure that some non-zero number of probes participate in every manufacturing step of the microarray. A solution to this problem, and a more complete description of the issues arising from microarray verification can be found in [16].

References

[1]: Affymetrix GeneChip GenFlex Tag Array specification details document
[2]: Affymetrix GeneChip GenFlex Tag Array Technical Note No. 1
[3]: BenDor, A. et al. (2000) Universal Tag-Antitag Systems: A Combinatorial Design Scheme. J. Comput. Biol. 7:503-519.
[4]: Brenner, S. Methods for Sorting Polynucleotides Using Oligonucleotide Tags, US Patent 5,604,097, 1997.
[5]: Cambridge Healthtech Institute Genomic glossaries - microarray categories. http://www.genomicglossaries.com/content/Microarrays_category.ASP
[6]: Eragen Biosciences AEGIS (tm) Technology http://www.eragen.com/diagnostics/na_tech_aegis.html
[7]: Ellis, M. "Spot-On" SNP Genotyping. Genome Research, vol. 10, pp. 895-897, 2000.
[8]: Fan, J., et al. Parallel genotyping of Human SNPs Using Generic High-density Oligonucleotide Tag Arrays. Genome Research, vol. 10. pp. 853-860. 2000.
[9]: Hallett, M., Smith, K. Quality Control for DNA Microarrays - work in progress.
[10]: Richard Karp, [talk by] Some Combinatorial and Algorithmic Problems Related to DNA Microarrays http://www.msri.org/publications/ln/msri/2000/molbio/karp/1/banner/43.html
[11]: Pastinen, T., Raitio, M, et al. A System for Specific, High-throughput Genotyping by Allele-specific Primer Extension on Microarrays. Genome Research, Volume 10, pp. 1031-1042. 2000.
[12]: Hubbell, E, Pevzner, P. Fidelity Probes for DNA Arrays. Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pp. 113-117, Heidelberg, Germany, 1999. AAAI Press.
[13]: Raddatz, G, Dehio, M, Meyer, T, Dehio, C. PrimeArray: Genome-scale Primer Design for DNA-microarray Construction. Bioinformatics, Volume 17, Number 1, pp. 98-99, 2001.
[14]: SantaLucia, J.Jr., Allawi, H., Seneviratne, A. (1996) Improved Nearest-Neighbor Parameters for Predicting DNA Duplex Stability American Chemical Society Biochemistry 96, 35, 3555-3562.
[15]: Sakakibara, Y., Suyama, A. Intelligent DNA Chips: Logical Operation of Gene Expression Profiles on DNA Computers. Genome Informatics 11: 33-42, 2000.
[16]: Sengupta, R., Tompa, M. (2000) Quality control in manufacturing oligo arrays: a combinatorial design approach. Technical report, Dept. of computer science and engineering, University of Washington.
[17]: Shi, Leming. DNA Microarray (Genome Chip) - Monitoring the Genome on a Chip. http://www.gene-chips.com/ (c) 1998-2002
[18]: Morris, M., Shoemaker, D, Dvis, R, and Mittmann, M. Methods and Compositions for Selecting Tag Nucleic Acids and Probe Arrays, European Patent Application 97302313, 1997.
[19]: Strachen, T., Read, A. (1996) Human Molecular Genetics. Bios Scientific Publishers
[20]: Stony Brook Computer Science, Advanced Algorithms (Computational Biology) http://www.cs.sunysb.edu/ skiena/648/
[21]: Taton, T., Mirkin, C., Letsinger, R. Scanometric DNA Array Detection with Nanoparticle Probes. Science Volume 289, pp. 1757-1760, September 8, 2000.
[22]: Wolfram Research World of Mathematics, Author: Eric W. Weisstein Š 1999 CRC Press LLC, Š 1999-2002 Wolfram Research, Inc. http://mathworld.wolfram.com/deBruijnSequence.html
[23]: Zuker, M., Mathews, D. H., Turner, D. H. (1999) Algorithms and thermodynamics for RNA secondary structure prediction: a practical guide.

File translated from T_EX by T_TH, version 2.25.
On 13 May 2002, 10:37.