Copyright © T. H. Merrett 308-420A Secondary Storage Algorithms and Data Structures Week 6 Assignment 6 The telephone book is represented as two files, a full trie indexing the phone numbers (/course/cs420/data/triePhone), and a data file containing the other fields (/course/cs420/data/dataNameAddr). Write a Java program to search the trie and retrieve the corresponding data for the phone numbers 2883502, 2883508, 3364566, 8445663 and 8453932. (You may treat each of these as a separate request.) Each file is a RandomAccessFile. The data file consists of blocks of records containing the fields Name (40 bytes) and Address (33 bytes). The record and blocksize information (73 bytes and 100 records/block) are stored in the index trie. Your program should use this stored information, and read a full block at a time into a buffer, extracting individual records from the buffer. The trie indexing the phone numbers has two main components, with numeric parameters before and after each. The first component is the set of pages containing the bitpairs that represent the nodes of the trie. The second component is the indexing information for these pages: the top counters, the bottom counters, the numbers of nodes, and the addresses on disk. Here is the layout. byte* type meaning 0 int number of pages in the trie (nTrie=42) 4 int blocksize of trie pages in bytes (BTrie=64) 8 short height of trie in page levels (h=4) 10 short number of node levels per page level (t=8) 12 bp[] nTrie blocks of BTrie bytes each: the trie ("array of bitpairs, bp[]") 2700 int number of entries in page index (indexSize=46) 2704 int[] Tcount[indexSize] 2888 int[] Bcount[indexSize] 3072 int[] nodeCt[indexSize] 3256 int[] Addr[indexSize] 3440 short blocksize of data pages in records (bData=100) 3442 short size of data records in bytes (RData=73) *Do not hard-code these addresses into your program, but calculate them from the stored parameters, and the number 12, which is the space needed by the first four numbers. (To confirm your understanding of the above layout, check that the stored parameters give the byte addresses listed, and answer the bonus questions at the end. Note that there is an extra Tcount and Bcount for the end of each level of trie pages; the corresponding values of nodeCt and Addr are -1, which can be used to detect the end of each level.) Read the four arrays, Tcount, Bcount, nodeCt, and Addr, into main memory at the outset and keep them there (736 bytes). Read the pages of trie nodes into main memory only as needed (64 bytes each), but you may keep one page for each of the four levels in main memory (256 bytes). You will need to support the search logic you have been given for tries with a mechanism to handle arrays of bitpairs, to compare bitpairs with bits in the query, and to convert numeric query keys to bitstrings. Bits are stored eight to the byte, and bitpairs four to the byte: use Java's byte masking operations; but also take the time to build classes for bits, bitpairs, sequences of both, and a cursor mechanism to keep track of where you are. Make the input and the trie page data as close to arrays of bits or bitpairs, respectively, as you can. It will require some time to do this, so start this assignment early. The trie stores four bitpairs in each byte with the first pair in the least significant two bits and the fourth pair in the most significant. Thus the mask is updated as follows to scan the bitpairs in the correct order: mask= mask==192 ? 3 : mask*4. For a bonus mark, answer the questions: what is the compression achieved by storing the 885 phone numbers in triePhone instead of as 7-byte character strings? 4-byte integers? 3-byte integers? What is the compression of the whole file (including waste due to blocking the data)? Note that there are 9294 trie nodes, and hence bitpairs: if there were no wastage in either trie pages or data pages, what would the compression be over the original ascii file? What is the load factor of the 42 trie pages? Optional: change your program to use internet sockets via HostClient, as you did with Assignment 1. Compare the running times. Try this from home. You may find the following four classes handy: bit, bitSeq, bitPair, bitPairSeq. They are in ~cs612/java, along with the other java classes for the course. Shared marks for shared work: assignments should be your own work; marks for joint assignments will be divided by the number of collaborators. But please feel free to work in groups to learn or for extra work that is not for marks.