Copyright © 1996 T. H. Merrett 308-420 Secondary Storage Algorithms and Data Structures Week 9 Design Problem
In an air traffic control system, aircraft file flight plans, which specify a sequence of rectangular cells they will fly through, from a rectilinear grid of cells all the same dimensions, covering the country. They also file departure and arrival times and airports. This makes it possible to predict roughly when each aircraft will be in each cell. Aircraft are identified by flight numbers.
Thus, flight plans involve fields FlightNo, DepPort, DepTime, ArrPort, ArrTime, and a sequence of Cells.
While an aircraft is in flight, it is monitored, and every ten seconds its Longitude, Latitude, and Bearing are recorded. These must be checked against the predictions of the flight plan, and a warning issued if the aircraft is in the wrong cell at the wrong time.
If this happens, the aircraft must be checked against any other aircraft near it for position and bearing. If two aircraft will come within a preset distance from each other on their present bearings, the time when this will happen must be calculated. This process must be complete before the next position and bearing data are recorded.
(We assume that all aircraft fly at the same altitude.)
Using the categories on which the course is based, and making appropriate calculations approximately, design file(s) to support the above application. Calculate file sizes, and estimate the times to do the operations. Assume a disk with 2 $\mu$sec. (microseconds) transfer time per byte and 20 msec. (milliseconds) average seek time.
308-420A Design Problem Week 9 SIZES (9 marks) Suppose 10,000 aircraft in flight at any time; maybe 20,000 planes total; Cell size 10km*10km (0.8min to cross), avg. flight crosses 100 cells; Keep no history of position or bearing. FPlan(FlightNo, DepPort, DepTime, ArrPort, ArrTime) 6 3 4 3 4 = 20 * 20,000 = 400K CellSeq(FlightNo, CellX, CellY, InTime, OutTime) 6 4 4 4 4 = 22 *100*20,000 = 40M PosBear(FlightNo, Long, Lat, Bear) could be part of FPlan, but it's not plan 6 4 4 4 = 20 *10,000 = 200K Every 10 sec., record PosBear, check against CellSeq (InTime), possibly check against PosBear (Long, Lat). AVS (18 marks) FPlan lo V: replaced daily; created all at once (plans don't change except for weather .. _all_ plans .. or major equipment failure) hi A: used only to calculate times for CellSeq; basically SEQUENTIAL, so ? S: doesn't matter CellSeq lo V: _replaced_ daily, created all at once hi A: look up _all_ flights every 10 sec., but only 1 of 100 cells: 1% is big for R=20 (b.e.a. = 20/10020 = 0.2%) lo S: only InTime PosBear hi S: look up Long, Lat lo V: rewrite only, every 10 sec: this does not affect structure. Even if we add/delete on takeoff and landing, do this at rewrite time. lo A: exceptional lookups (must be fast): hi if 20 flts in 10,000 are close CHOICE (5 marks) FPlan: SEQ (small, but not necessarily RAM) CellSeq: seq. is too slow (40 sec.). Need another ordered method, but NOT direct because of worst case. B-Tree clustered on InTime. PosBear: B-tree again, because of worst case; Z-order on Long, Lat. TIMING (8 marks) (Can we look up 10,000 plans in CellSeq in 10 sec.?) N = 10^6 records, f=b~200 (4K blocks); worst case $1 + \;ceil \log_{100} 5_{10}5 \rceil = 4$ so 4 * (20ms + 4K*2mus) = 4 * 28 = 112ms per access Clustered by time, so 10,000 flights in 50 blocks so 50 * 112ms = 6 sec: we made it! (Could go up to 16,666 flts.)
Copyright © 1998 T. H. Merrett 308-420 Secondary Storage Algorithms and Data Structures Week 9 Design Problem
A world-wide database for travellers records information on 500,000 restaurants, 100 restaurant chains, and 1000 restaurant districts.
Some, but certainly not all, of the restaurants belong to chains. Some, but certainly not all, of the restaurants are in restaurant districts.
The fields stored should include RestName, Address, TC (whether the restaurant is in town or in the country), Cuisine (e.g., Afghanistani, Greek, Hamburger), PriceRange (low/medium/high), ChainName and HeadQuarters of the chain, District, City of the district, and South, West, North and East streets bounding the district.
The database should be able to answer queries such as ``Find all restaurants in Montréal [Address] serving medium-priced Lebanese food'', or ``Find the district with the most low-priced restaurants'', or ``Find all the districts which have a McDonald's''. Other queries should also be possible: the files should be designed for flexibility.
Furthermore, the agency maintaining the database should be able to add new restaurants, districts and chains.
Using the categories on which the course is based, and making appropriate calculations approximately, design files to support the above matching. Calculate file sizes, and estimate the time to do the match. Assume a disk with 2 microsec. transfer time per byte and 20 msec. (milliseconds) average seek time.
308-420A Design Problem Week 9 file sizes (9 marks) Rest(RestName, Address, TC, Cuisine, PriceRange, ChainName, District) 20 40 1 20 2 20 20 :120 *500K = 60Mb 6K pages => 10^4 pages Chain(ChainName,HQ) 20 20: 40 *100 = 4Kb RAM! Dist(District, City, South, West, North, East) 20 20 20 20 20 20 :120 *1000 = 120Kb 20 pages AVS (18 marks) A: e.g., get list of restaurants for travellers, is presumably low activity: say, 50 restaurants of 500K << 1% But queries such as "find districts with low-priced restaurants" could be high activity. It says, be flexible. V: "add new restaurants, chains, districts": take it as high volatility S: "flexible": large variety of queries, so high. Choice (5 marks) Seq B-tree B-tree + Z KDB trie kd-trie hash lin.hash tidy SMP DMP A! S! S! V? S! S! S! V! try DMP for both Rest, Dist Costs (8 marks) Address is badly specified. Suppose Address=City (e.g., "Montreal" in the query) MP: don't use RestName (it's a key, or at least unlikely source for high-act.) don't use TC, PriceRange (not enough selectivity) left with Address (2000), Cuisine (1000), District (1000), Chain (100) I 2-D DMP: Address, Cusine 10^4 = 2x * 1x x = 100/root(2) f1= 140, f2= 70 n= 9800, B= 6123 or f1= 150, f2= 75 n= 11250, B= 5334 Q1 one access: 20ms + 6K * 2 mu-s = 32 ms Q2 seq: 20 ms + 6*10^7 * 2 mu-s = 120 sec Q3 seq: ditto II 3-D DMP: Address, Cuisine, District 10^4 = 2x * 1x * 1x x = 10 root3(5) f1= 34, f2= 17, f2= 17 n= 9826, B= 6107 Q1 17 accesses: 17*32ms = 544 ms Q2 seq: 120 sec Q3 seq: 120 sec III 4-D DMP: Address, Cuisine, District, Chain 10^4 = 20x * 10x *10x * x x = root4(5) f1= 20, f2= 15, f3= 15, f4=2 n= 9000, B= 6667 Q1 15*2=30 accesses: 30*32ms = 960 ms Q2 seq: 120 sec Q3 10^4/2 accesses: 10^4*32ms/2 = 160 sec slower than seq (activity 50%) So of these three, 2-D DMP seems best, for the three given queries. Of course, different assumptions about Vi may change this.
Copyright © 1999 T. H. Merrett 308-420 Secondary Storage Algorithms and Data Structures Week 9 Design Problem
The hyperglobe is a proposal for organizing knowledge according to two geographical coordinates (longitude and latitude) and one temporal coordinate (date), indicating the origin of the topic. The ``knowledge'' is represented as large chunks of unstructured text or other unformatted data.
For instance, at the location of Hastings, England, at 1066 AD, there might be a short book (100 Kbytes) on the Norman invasion; at the location of Montreal for 1987 AD there might be a score (10 Kbytes) and an audio record (40 Mbytes) of the premiere of Patriquin's Earthpiece; at the location of Moscow for 1991 AD, there might be a short CNN television report on the Soviet coup (1 Gbyte).
To resolve longitude and latitude to about 0.1 seconds of arc (about 3 metres at the equator) requires 3-byte fields. To resolve dates to quarter-days also requires a 3-byte field.
There may not be room on secondary storage for the chunks of unstructured data, and they may even be stored in some other format, such as on optical disk, or as documents in some remote online library. So they will be represented in the main file by an identifier of, let us say, 4 bytes.
The documents contain cross references to each other by this identifier, by which we must be able to locate them rapidly in the system. The geographical and date coordinates should be rapidly retrievable given the identifier. A user should be able to examine documents for all dates at a given location or for all locations at a given date.
Secondary storage should hold a description of the name and of the location of any data that is stored elsewhere.
New documents are frequently added, but once there do not often change.
Suppose there are 10,000 different geographical references, 10,000 different dates, and 10,000,000 different documents.
Using the categories on which the course is based, and making appropriate calculations approximately, design file(s) to support the above application. Calculate file sizes, and estimate the times to do the operations. Assume a disk with 20 nsec. (nanoseconds) transfer time per byte and 5 msec. (milliseconds) average seek time.
308-420A Design Problem Week 9 FILES & SIZES (9 marks) Coords(Long, Lat, Date, DocID) 3 3 3 4 13 * 10M = 130MB Remote(DocID, Title, Location) 4 40 56 100 * 1M = 100MB Text(DocID, Doc) suppose 1st byte of ID indicates text, audio, .. 4 100K(avg) 100K* 8M = 800GB Video(DocID, Doc) 4 1G(avg) 1G* 1000 = 1TB etc. (these should add to 10M docs) AVS (18 marks) A V S lo hi 100% lo hi lo hi Coords Y (Y) Y Remote Y (Y) Y Y Text Y (Y) Y Y Video Y (Y) Y Y etc. Y (Y) Y Y Activity: Coords lo because need to ref individual doc by DocID hi? breakeven 13/250013=1/20000 activity given 1 loc (or date) is 1/1000: lo, but these are averages, and some cases could be > beakeven others we must assume activity is proportional to that for Coords Volatility: all files grow only: vol. depends on effect on organizing fields Symmetry: Coords hi(2-D): Date, {Long, Lat} NB. DocID is a key, may be a 3rd dim. others lo: DocID Volatility (again): suppose DocID is assigned on filing, last DocID + 1 Coords may affect file structure e.g., SMP won't do others need not affect file structure apart from appending at end since Seq is ruled out by A, V is hi CHOICE (5 marks) Coords Seq B B+Z trie kdtrie hash kdhash linhash tidy SMP DMP A X V X X X S X X X X X So: B+Z, kdtrie (logarithmic), DMP (direct). Choose DMP: 2-D, with DocID given by secondary index. This can be a 1:1 tidy "function" because DocIDs are consecutive integers: Key(Link) = 4 bytes * 10M = 40MB others Seq B B+Z trie kdtrie hash kdhash linhash tidy SMP DMP A X V X X X S X X X X X So: B-tree or trie (logarithmic), linear hashing (direct). Choose linhash. But note that Text, Video, etc., have _variable-length_ records, so will need indexing files, like Key. TIMING (8 marks) 1. Finding 1 doc by DocID access Key: 5 ms + B*20ns = 5 ms access Coords: ditto = 5 ms access Remote: ditto = 5 ms access Text or Video or etc: 5 ms (index) + 5 ms + 20ns*docsize total 15 ms (Remote) or 20+2 ms (Text) or 20 sec (Video) 2. Finding all docs for place (or date: same numbers) access Coords: suppose B = 1443, so n= 90,000= 300 * 300 (square by spec) 1 row is 1/300 of this: 300 pages, so 300*(5ms + 1443*20ns) = 1.5 sec (NB seq access is 130M * 20ns = 2.6 sec) access Remote or Text or Video or etc (resp. n= 10^5,10^9,10^9; r=10^4): n(1 - (1 - 1/n)^r) = r - r^2/2n + r^3/6n^2 - .. = resp. n*0.08, r, r = (resp. 8000, 10000, 10000 accesses) * (5ms + 1K*20ns) = (resp. 8000, 10000, 10000 accesses) * (5ms + 1K*20ns) NB this applies only to Remote, because Text, Video, etc. have _records_ much bigger than 1K, and we should analyze the indexing files.
Copyright © 2000 T. H. Merrett 308-420 Secondary Storage Algorithms and Data Structures Week 9 Design Problem
There are potentially about 18 million postal codes in Canada, each defining a region on the map which we can suppose to be a polygon of about six sides. Thus, there are about 53 million edges in the map of all postal codes (almost every edge is shared by the two regions it separates).
This question is concerned with a secondary storage representation of this postal code map, based on the edges. Each region is a cycle of edges, and each vertex (a corner where edges meet) is also a cycle, of the edges meeting there. So the edges, vertices, and regions are described by a structure in which each edge is stored four times, about 211 million records in all. Each edge appears in two vertex cycles (the vertex at the start of the edge and the vertex at the end) and in two region cycles (the region to the left of the edge and the region to the right). Each edge has an identifier and a direction (so ``start'', ``end'', ``left'', and ``right'' are established).
[The above two paragraphs motivate the following. You need not have read them completely.]
A cycle is stored as a set of edge pairs: an edge, and the edge following it in the cycle. The file to represent the cycles will have four fields, edgeID, edgeDirection, successorID, and successorDirection. It will have 211 million records.
In addition, you need to store information for each edge naming its two vertexes and its two regions, and coordinates for each vertex.
There are two kinds of operations that should be supported by your file design. The first is drawing the map of all the postal code regions, or any part of the map. The second is changing the map by means of {/em splice} operations: splice(edgeID1, edgeDirection1, edgeID2, edgeDirection2) finds each edge in the file and replaces it by the other edge. With enough splice operations, any rearrangement of the map is possible.
Using the categories on which the course is based, and making appropriate calculations approximately, design file(s) to support the above application. Calculate file sizes, and estimate the times to do the operations. Assume a disk with 20 nsec. (nanoseconds) transfer time per byte and 5 msec. (milliseconds) average seek time.
308-420A Design Problem Week 9 1) let's make it very simple: two files, one for drawing, one for splice (it wasn't specified what splice changes, so suppose it changes only cycles: this is not actually true - splice also creates or destroys faces) we'll also start by supposing applic 1 is to draw only the whole file. FILES & SIZES (9 marks) cycles(edgeID,edgeDir,succID,succDir) (4 bytes for ID since 2^26=64*10^6) 4 1 4 1 = 10 * 211*10^6 = 2.1*10^9 edges(edgeID,startX,startY,endX,endY,left,right) can also have start,end 4 4 4 4 4 6 6 4 4 = ((5 or 7) * 4 + 2*6) *53*10^6 = 2*10^9 AVS (18 marks) applic 1: draw whole file: use only edges - 1 pass to draw edges in any order 100% activity, does not change, no field singled out applic 2: splice: 1 splice costs 4 accesses to cycles on each of edgeID,succID and we assumed it does not affect edges activity 8/53M, 2D symmetry (edge/dir,succ/dir), splice changes key fields access-transfer ratio, rho = 5ms/20ns = 250,000 breakeven activities, R/(R+rho) = (10, 40)/250,000 = (1/25K, 1/6K) for (c, e) A V S e1 h l l, in fact 0D c2 l h h (2D: edgeID,succID) CHOICE (5 marks) Seq B B+Z trie kdtrie hash kdhash linhash tidy SMP DMP cA l V h h h h h S h h h h h h h ? ? ? try DMP on edge and succ (10^53*10^53 even though each is a 1/4th key) eA h h h V S - - - - - ? ? ? ? try seq TIMING (8 marks) e1 draw all: 2*10^9 bytes in 1 pass: 5ms + 2*10^9 * 20*10^-9 = 40 sec c1 splice: a) find the 4 matching edges under edgeID-edgeDir: 2*10^9 bytes / B=2K => n = 10^6 => fe = fs = 1000 (Ve = Vs = 10*53) so 1000 * (250,000 + 2K) * 20*10^-9 = 5 sec b) find the 4 matching edges under succID-succDir: also 5 sec c) do same for the edges they were swapped for: 10 sec. more 2) improvement to allow drawing of partial map: AVS (18 marks) e changes to allow us to specify (i) any set of regions (ii) a range of coords so can be as low A as 6/53M < 1/6K, so both lo and hi activity; still low V; symmetry - either a region, say left, or a coord pair, say startX,Y (this assumes edges are short, so left, right are not far apart, ditto start, end) A V S e1 lh l h, 2D (edge, startXY) CHOICE (5 marks) Seq B B+Z trie kdtrie hash kdhash linhash tidy SMP DMP eA l h h h V - - S h h h h h h ? ? ? try SMP on left, (startX,startY) NB z-order start for small range queries TIMING (8 marks) Vl = 18M, Vs = 36M, if coord resolution is high enough 2*10^9 bytes / B=1K => n = 2*10^6 => fl = fs/2 = 1000 e1 draw all: same - just read the SMP file in 1 pass, no extra overhead draw selected X-Y range (high-A): select on-the-fly in sequential scan draw selected regions (high-A): merge on left (or right) with region set draw selected regions: 2000 * (250,000 + 2K) * 20*10^-9 = 10 sec so 10 sec for every 1/1000 of the file draw selected coord pairs: 5 sec for every 1/2000 of the file