Copyright © 1998 T. H. Merrett 308-420A Secondary Storage Algorithms and Data Structures Week 4 Building and Searching Full Tries I Building a full trie b : number of bits in key t : number of levels of trie nodes in each layer of pages h : height of trie in pages b=ht 2^t: max number of trie nodes in a page (2 bits per node, so max bitstring in page has 2^(t-2) bytes) page nodes are bitpairs index to page will contain: T : topcount - number of trie edges entering this level prior to this page B : botcount - number of trie edges leaving this level prior to this page N : node-count - number of nodes (non-00 bitpairs) stored in this page A : disk address (page number) also needed while constructing: Tc: cumulative T - number of trie edges entering this level thru this page Bc: cumulative B - number of trie edges leaving this level thru this page (will become T, B, resp. for next page this level) the keys must be input in ascending lexicographic order e.g., b=8, h=2, t=4, 2^t=16 nodes (4 bytes) per page key 00000011 showPage(0): (T, B, Tc, Bc, N)= (0, 0, 1, 1, 4) 10 10 10 10 showPage(1): (T, B, Tc, Bc, N)= (0, 0, 1, 1, 4) 10 10 01 01 key 00101100 showPage(0): (T, B, Tc, Bc, N)= (0, 0, 1, 2, 5) 10 10 11 10 10 showPage(1): (T, B, Tc, Bc, N)= (0, 0, 2, 2, 8) 10 01 10 01 01 10 01 10 key 10000000 showPage(0): (T, B, Tc, Bc, N)= (0, 0, 1, 3, 8) 11 10 10 11 10 10 10 10 showPage(1): (T, B, Tc, Bc, N)= (0, 0, 3, 3, 12) 10 01 10 10 01 10 01 10 10 01 10 10 key 10000101 showPage(0): (T, B, Tc, Bc, N)= (0, 0, 1, 3, 8) 11 10 10 11 10 10 10 10 showPage(1): (T, B, Tc, Bc, N)= (0, 0, 3, 4, 14) 10 01 10 10 01 11 01 10 10 10 01 10 10 01 key 10001000 showPage(0): (T, B, Tc, Bc, N)= (0, 0, 1, 3, 8) 11 10 10 11 10 10 10 10 writePage(1): (T, C, N, A)= (0, 0, 8, 0) 10 01 10 01 01 10 01 10 showPage(1): (T, B, Tc, Bc, N)= (2, 2, 3, 5, 9) 11 11 10 10 10 10 10 01 10 key 10100000 showPage(0): (T, B, Tc, Bc, N)= (0, 0, 1, 4, 9) 11 10 10 11 11 10 10 10 10 showPage(1): (T, B, Tc, Bc, N)= (2, 2, 4, 6, 13) 11 10 11 10 10 10 10 10 10 10 01 10 10 key 10101100 showPage(0): (T, B, Tc, Bc, N)= (0, 0, 1, 4, 9) 11 10 10 11 11 10 10 10 10 showPage(1): (T, B, Tc, Bc, N)= (2, 2, 4, 7, 16) 11 11 11 10 10 01 10 10 10 10 10 10 01 10 10 10 key 11000000 showPage(0): (T, B, Tc, Bc, N)= (0, 0, 1, 5, 11) 11 10 11 11 11 10 10 10 10 10 10 writePage(1): (T, B, N, A)= (2, 2, 16, 1) 11 11 11 10 10 01 10 10 10 10 10 10 01 10 10 10 showPage(1): (T, B, Tc, Bc, N)= (4, 7, 5, 8, 4) 10 10 10 10 no more keys writePage(0): (T, B, N, A)= (0, 0, 11, 2) 11 10 11 11 11 10 10 10 10 10 10 writePage(1): (T, B, N, A)= (4, 7, 4, 3) 10 10 10 10 So the final trie pages are 10 01 11 11 11 10 10 01 11 10 10 01 10 11 10 01 10 10 10 10 10 10 11 11 10 10 01 10 10 01 10 10 10 10 10 10 10 10 10 And the final page indexes are T 0 1 0 2 4 5 B 0 5 0 2 7 8 N 11 -1 8 16 4 -1 A 2 -1 0 1 3 -1 (One page (2) for root level, three (0, 1, 3) for second level; -1s signal end of level, and the T and B for these columns are the cumulative values for that level: total entries= cumulative B for last level (8); total nodes=11+8+16+4 - compare 8 bits * 8 keys: 39*2 bits vs. 64: already 2 bits per node is much less than double the original number of bits, and a not much larger set of keys will actually be compressed.) To visualize, rearrange pages by level, and interleave with T, B, N T 0 11 T 1 B 0 10 11 B 5 N 11 11 11 10 10 10 10 10 10 T 0 10 01 T 2 11 11 T 4 10 T 5 B 0 10 01 B 2 11 10 10 01 B 7 10 B 8 N 8 01 10 N 16 10 10 10 10 10 N 4 10 01 10 10 01 10 10 10 10 (we don't need the page addresses to search by hand) II Searching the full trie Searching this trie for 10001000: we use "x" to show which "1" is hit 130 line T 0 1x T 1 B 0 10 x1 B 5 N 11 11 x1 10 10 10 x0 10 10 T 0 10 01 T 2 1x 11 T 4 10 T 5 B 0 10 01 B 2 11 x0 10 01 B 7 10 B 8 N 8 01 10 N 16 10 10 x0 10 10 N 4 10 01 10 10 01 x0 10 10 10 Gets to bottom of trie, so success. Now, how does the computer do it? Let's look at the search process on a single page. The bitpairs are just found in a sequence with no way of telling which is the end of a level, e.g., the root page 11 10 11 11 11 10 10 10 10 10 10 So we maintain a counter, skip, which tells us how many pairs are on each level, and we can use this to increment a "cursor", last, which gives the last node on the present level. Here they are for the root page skip= 1; last= 0; - Each time we see "11", we increment skip. - When we come to last, we set the next value for last by adding skip to it. Try it for the root page (the bitpairs are counted from position pos=0) pos bitpair skip last 1 0 0 11 2 2 1 10 2 11 3 5 3 11 4 4 11 5 5 10 10 6 10 7 10 8 10 9 10 10 10 15 So the last bitpair in each level is at positions 0, 2, 5, 10, respectively. (We don't need the 15 because we are at the end of the page.) But we are not just reading through the trie: we are searching. In each level, we need a bitpair to compare with the current input bit. Call the position of this bitpair next. We need to count the "1" bits in each bitpair up to the bitpair at next, in order to find next for the next level. We use a counter, srch, to do this. - Initially, for the root page, next= 0; srch= 0; - Before next, increment srch for every "1" bit; - at next, increment srch for the "1" bits including the one we marked "x", above, but not past it. - At last, reset next= last + srch, and reset srch= 0. Meanwhile, maintain skip and last as above. Here it is for the first four bits of our example search key, 10001.. pos bitpair input srch next skip last 1 0 0 1 0 0 11 0 2,0 2 2 2 1 10 1 2 11 0 2,0 4 3 5 3 11 2 4 4 11 0 3,0 8 5 5 10 10 6 10 1 7 10 2 8 10 1 3 13 9 10 10 10 15 So the bitpair to be checked in each level against the input bit is at positions 0, 2, 4, 8, respectively. (If the next level of nodes were also stored on this page, we can see that the next bitpair to be checked would be at position 13, and the end of the level would be at position 15.) Note that input bit 0 matches bitpairs 10 and 11, and input bit 1 matches bitpairs 01 and 11. Now we must consider changing pages. We might have to sequence through all the bitpairs of the trie, but T and B save us from looking at any but the relevant pages on each level. We need to use the (T, B, N, A) index to do two things: 1) find the address, A, of the next page, and 2) recalculate next and last so they work correctly for the page instead of for the whole trie. Continuing the example, we can see how to do each of these. 1) srch, counting the "1" bits, also counts the descendent subtries from the current level. In this case, it has value 3, which means that we need to go to the third subtrie from the beginning of the next level. T for that level tells us which page this is on: we scan the T index until an entry >= 3, then pick the page before. (In general, when descending from a page other than a leftmost page of its level, such as the root page, we scan until >=B'+3, where B' is the B value for the page we are just leaving.) 2) The value we need for next is B' + srch - T - 1, where T is the T-value for the descendent page we have now selected, and srch gets reset to 0. In the example, we have next= 0 + 3 - 2 - 1= 0; srch= 0. For last, we reset skip to T" - T, where T" is the T-value for the page following the selected page on the same level (or the end-of-level value if the selected page is the last on the level). Then last= skip - 1. For the example, skip= 4 - 2= 2; last= 1. The final consideration is, if we have found the key in the trie, how do we find the corresponding record in the data? The data was loaded in the same order as the keys, and so the remainder of each record occupies a consecutive, fixed-length position in the data file. A calculation almost the same as the one that found next for the next level of trie pages will give the position of the data record. In the example, our succesful search concluded at the third bit of the last level of nodes on the second page of the last level of pages. Since B for this page is 2, the record we want is the fifth, which is the one at position 4 of the data file, if we count from 0. The position is thus B + srch - 1 The overall logic for the search algorithm should be three nested levels of loops, the outer loop over all page levels, the middle loop over all node levels within the page, and the inner loop a combination of four steps: loop until next; code for next; loop until last; code for last. for (int pgLev=0; pgLev<h; pgLev++) if (pgLev > 0) { find and read next page; reset skip, last, next, srch } for (int ndLev= 0; ndLev<t; ndLev++) while (...isBefore(next) ) { increment skip, srch } fail if no match, or succeed if last input bit, or get next input bit and increment skip, srch while (...isBefore(last) ) { increment skip } if (ndLev < t - 1) { reset next, srch, last }