Copyright © 1998 T. H. Merrett 308-420A Secondary Storage Algorithms and Data Structures Week 3 Cost Analysis for Trees - Phi-ary trees. (In these notes, we write "p" instead of "phi".) These are complete trees with fanout p at each node. They can be characterized as follows, where h, the height, is the number of levels in the tree and is also the number of accesses required for a search for a single record at a leaf, if each node is a block on secondary storage. Level Number of nodes Cumulative at this level number of nodes 0 1 1 1 p 1 + p 2 p^2 1 + p + p^2 : : : h-1 p^(h-1) Sum (k=0..h-1) p^k The total number of nodes, Sum (k=0..h-1) p^k, is n = (p^h - 1)/(p - 1). It follows that h = log_p (n(p - 1) + 1). (Eq. 1.3-34 on p.107 is wrong.) Since the number of accesses to get to level k is k+1 (k=0..h-1), the average or expected number of accesses required to find a single record at any level is (Sum (k=0..h-1) (k+1)p^k)/n. Since (k+1)p^k is the derivative of p^(k+1) with respect to p, and since the operation of differentiating can be taken outside the summation, the numerator is just the derivative of p(Sum (k=0..h-1) p^k), or hp^(h+1) - (h+1)p^h + 1 num = ----------------------- (p - 1)^2 We can check two special cases. If p=2, for a binary tree, the expected search depth is (h - 1)2^h + 1 num/n = -------------- ~ h - 1. 2^h - 1 This says that, on the average, a search in a binary tree will take us down to the level just above the leaf nodes. This makes sense, because in any binary tree, the number of leaves is one more than the number of internal nodes. Half the nodes are leaves, so we expect the average search to get to just before the leaf level. The other special case is very large p. In this limit, hp^(h-1) num/n -> -------- = h. p^(h-1) This says that for large p, the average search goes even deeper than for p=2, ultimately reaching the leaves. In fact, p does not have to be very large for this to be a close approximation. This makes sense, too, because the bushier the tree, the more the leaves dominate the node count. It also implies that we damage performance only negligibly if we move from a homogenous B-tree, with data in internal nodes, to an inhomogenous B-tree, with data all at the leaves. (And, of course, we gain by vastly increasing the fanout and hence decreasing the depth.) - B-trees. We use techniques similar to the above to find the upper and lower bounds on the height, h, of a homogenous B-tree, as functions of N, the number of records stored. The order of the B-tree is f, so that the fanout, s, at any node satisfies f/2 <= s <= f. There are two extreme cases. The best case gives the lower bound on h. In this case, the B-tree is as full as possible and each node contains f-1 keys and has f descendents. The worst case, with the B-tree as empty as allowed by the rules, gives the upper bound. Below are the contents and fanout of each node, where f2 = ceil(f/2). Best Case Worst Case Level No. Nodes No. Records No. Nodes No. Records 0 1 f-1 1 1 1 f f-1 2 f2-1 2 f^2 f-1 2*f2 f2-1 : : : : : h-1 f^(h-1) f-1 2*f2^(h-2) f2-1 Thus, for these two cases, the total number of records is (best case) N = (f-1)(1 + f + f^2 + .. + f^(h-1)) = f^h - 1 (worst case) N = 1 + 2(f2-1)(1 + f2 + .. + f2^(h-2)) = 2*f2^(h-1) - 1 and the range of values for h is log_f (N+1) <= h <= 1 + log_f2 ((N+1)/2). (x_y means x subscripted y) This tells us the upper and lower bounds for the maximum search cost in a homogenous B-tree. The minimum search cost is 1 access. What is the expected search cost? For an inhomogenous B-tree, the total number of records, N, is given by the number of records in the leaf nodes. (And f and f2 are much larger than for the homogenous case.) What are the maximum and minimum search costs? - The above discussion gives the cost of searching B-trees. What about the cost of building them? The maximum number of splits for any insertion is h, the height of the tree, but usually an insertion requires no splits or occasionally one. The average number of splits, after building a tree of N records in n nodes, is (n-1)/N ~ 1/b*alpha, where b is the block size and alpha the load factor.