TUTORIAL: DOT-MATRIX SIMILARITY COMPARISONS

Fristensky, B. (1986) Improving the efficiency of dot-matrix similarity searches through use of an oligomer table. Nucleic Acids Research 14:597-610

DXHOM documentation: $doc/fsap/hom.asc

1. Copy sample files to $home/tutorial/dotmatrix

{goad:/home/plants/frist}cd
{goad:/home/plants/frist}cd tutorials
{goad:/home/plants/frist/tutorial}mkdir dotmatrix
{goad:/home/plants/frist/tutorial}cp $birch/tutorials/bioLegato/dotmatrix/*.gen dotmatrix
{goad:/home/plants/frist/tutorial}cd dotmatrix
{goad:/home/plants/frist/tutorial/dotmatrix}ls -l

total 58
-rw-------   1 frist    drr         5404 Nov  6 17:22 ARBLKSP.gen
-rw-------   1 frist    drr         4494 Nov  6 17:21 GMCAB2.gen
-rw-------   1 frist    drr         4238 Nov  6 17:21 GMCAB3.gen
-rw-------   1 frist    drr         4338 Nov  6 17:29 KPLACBG.gen
-rw-------   1 frist    drr         3674 Nov  6 17:21 PEACAB15.gen
-rw-------   1 frist    drr         3757 Nov  6 17:22 WHTCAB.gen

Read in the GenBank files using File --> Open.

2. A simple comparison between two sequences

Hint: Similarity comparisons require you to select two or more sequences at a time.

Select sequence(s) - either:

hold down the 'CONTROL' key and click on each sequence name
If the names of the sequences are adjacent in the name list, click on he topmost name, hold down the SHIFT key, and then click on the bottom-most name.

Choose a program from one of the menus

Select GMCAB2 and GMCAB3 by dragging the cursor across the two names, and choose Similarity --> DXHOM. The DXHOM menu will appear:

Output from GMCAB2 vs. GMCAB3 using DXHOM defaults.

3. How search parameters affect the output

a. Compression: Zooming in and out - Each row and column in the matrix represents one or more nucleotide/amino acid positions. By default, A compression factor of 10 is used, with 70 charcters per line, meaning that 700 nucleotides can be represented per line or column. In the example above, the X-axis sequence GMCAB2 is 1354 bp long, so DXHOM must break the output up into two pages, one in which GMCAB2 positons 1..700 are comared with all of GMCAB3, and another in which 701..1354 are compared with GMCAB3. The entire comparison can be fit into a single page by changing the compression factor to 20. [Output with "compression factor" (COMPRESS) =20].
b. The signal to noise ratio is controlled by local search window size and similarity cut off value.

The parameter "min. % similarity printed (MINPER)" sets the minimum score for a character to be printed in the matrix. The lower the percent match allowed, the higher the sensitivity, but the greater the background noise due to random chance. [Output with min. % similarity printed=50].

The window size is expressed as the distance from the center of the search window. That is , the search window contains d bases on either side of the center of the k-tuple match. So if the distance is 10, a window of 21 bases (ie. 10 on either side of the central nucleotide) is compared. The wider the search window, the lower the probability of a match by random chance. Thus, lower MINPER values require larger search window sizes. The result from above can be "cleaned up" by doubling the window size. [Output with window Dist. from center of window= 20 ie. window size = 41].

Some reduction in background noise can also be gained by using a k-tuple of 4. This will also speed up the search by a factor of 4, but may be less sensitive.

Finally, the signal to noise ratio can also be improved by simply doing the plot at a lower compression. One way is to use a wider output line (scroll left to right to see entire matrix)

Dist. from center of window: 10
MINPER = 50
COMPRESS = 10
width of output line: (LINEWIDTH) = 130

4. It is often necessary to compare both strands of one sequence with one strand of the other sequence.

To illustrate the importance of comparing BOTH strands, the Bluescript vector, containing a beta-lactamase gene for ampicillin resistance, will be compared with the beta lactamase gene from Klebsiella pneumonia. Since Bluescript is 2958 bp long, we need to use COMPRESS= 30 and LINEWIDTH= 100 to make the entire matrix fit onto one page.

NOTE: If you plan to compare both strands of a reference sequence with one strand of another sequence, in each plot the reference sequence, or its inverse complement, MUST be placed on the X-axis.
DXHOM is designed to preserve the coordinate system of the reference sequence, regardless of which strand is printed. That is, if you do the first search with the original strand Bluescript as the reference sequence, the coordinates of Bluescript would be written left to right on the X-axis, 1 to 2958. It is INCORRECT to simply generate the inverse complement of Bluescript and repeat the search, numbering 1.. 2958. The numbering at the top should be 2958 downto 1. In that way, a given coordinate always refers to the same part of the sequence, regardless of the strand.

Although there's no straightforward way to get biolegato to tell DXHOM which sequence goes onto the X-axis, when multiple sequences are selected, they are sent to programs in the order they appear in the biolegato sequence list, top to bottom. Thus, if you want ARBLKSP to be on the X-axis, it must appear above KPLACBG in the biolegato window.

Compare the original strand of Bluescript (ARBLKSP) with the Klebsiella gene (KPLACBG).

Dist. from center of window: 10
min % similarity: 60
COMPRESS=30
LINEWIDTH=100

As seen in the output, there are no significant diagonals, indicating that there are no
significant similarities between these two strands. Now, let's generate the opposite
strand of Bluescript for comparison.

Since many people have dificulty conceptualizing the differences between a strand and its inverse complement, we'll take a moment to review how these concepts are defined.

As shown at right, example is written 5' to 3' left to right. Select example, and choose DNA/RNA --> blrevcomp. For any sequence, blrevcomp can create the complementary strand, or flip the strand, or create the inverse-complement.
Let's start by creating the complementary strand. Choose complement only.
Since example is written 5' to 3' left to right, it should be obvious that example-comp as displayed in the figure, must be written 3' to 5', left to right. (Note that blrevcomp automatically adds a suffix to the sequence name to distinguish between the original strand and its diferent transformations.) Next, select example-comp and run blrevcomp again, this time choosing flip (reverse only).
Now, example-comp-flip is the same strand as example-comp, but written 5' to 3' left to right.
blrevcomp can achieve the same result in one step if you select example and choose reverse-complement in the blrevcomp menu. As we'd predict, example-opp is identical to example-comp-flip.

a. Create the inverse complement of ARBLKSP.

Select ARBLKSP and choose DNARNA --> blrevcomp.

The inverse complement, ARBLKSP-opp, is inserted at the bottom of the biolegato window.

When biolegato runs DXHOM, it places top-most sequence on the X axis and the lower sequence on the Y axis. DXHOM is written in such a way that it can recalculate the coordinates of the X-axis sequence if that sequence is an inverse complement. For that reason, we need to move ARBLKSP-opp up in the biolegato window.

Select ARBLKSP-opp and choose Edit --> Cut. To insert it below ARBLKSP, click on 'ARBLKSP' and choose Edit --> Paste.

b. Compare the inverse-complement of ARBLKSP with KPNLACBG.

Select ARLKSP-opp and KPNLACBG. Choose DNA/RNA --> DXHOM.

Click on the opposite button to tell DXHOM that the X-axis strand is the opposite strand. That will tell DXHOM to write the sequence coordinates in descending order. This also requires that we start (STARTX) at the end of the sequence and finish (FINISHX) at the begining of the sequence. Drag the STARTX slider to the right, and the FINISHX slider to the left.

If you don't do this step, the sequence position numbers in the output will be incorrect!

As illustrated in the map of Bluescript shown above, the DXHOM output verifies that the opposite strand of Bluescript has a beta-lactamase gene, going roughly from 2680 downto 1960. [DXHOM output using opposite strand of ARBLKSP].