 |
School of Computer Science
Computer Science COMP 420 (Fall term)
Secondary Storage Algorithms and Data Structures
Instructor: T. H.
Merrett
"Secondary Storage Algorithms and Data Structures", COMP 420, introduces data
structures and algorithms
for data on secondary storage, notably queries and processing. The course
looks at sequential files, tree-structured files, and direct-access files
and at algorithms for applications requiring various levels of volatility,
symmetry, and activity.
We also examine file access across the Internet, using sockets and TCP/IP.
File and socket programming is in Java. (The class package
documentation is especially useful.)
Based on the categories covered, we discuss design techniques for specific
applications (which have in the past included student records and employment
databases, natural language retrieval, astronomical queries, musicology
queries, air traffic control, restaurant guide, city property registry,
world geographical database of towns, personnel database, hospital ward
monitor, and a medical insurance database).
The course also briefly introduces databases, using McGill's advanced
relational database system, relix.
Fundamental to this course is that secondary storage is memory organized
differently from main memory (RAM), so that data structures, algorithms,
complexity considerations and languages must all be rebuilt differently
from the familiar computer science for RAM.
COMP 420 Secondary Storage Algorithms and Data Structures
COURSE SUMMARY
Week Topic Asst.
1 Overview of Aldat [PDF slides 1797K]
Java classes and sequential I/O 1
Java and UNIX
Files, records, blocks, RAM and secondary storage
[Slides ps 291K pdf 214K] [Exercises ps 133K pdf 61K]
[Text ps 303K pdf 203K]
2 Sequential Files:
[Slides ps 109K pdf 68K] [Exercises ps 124K pdf 52K]
[Text ps 150K pdf 133K]
searching 2
merging 3
sorting
Random access in Java 4
3 Logarithmic Files:
[Slides ps 314K pdf 245K] [Exercises ps 206K pdf 102K]
[Text ps 530K pdf 302K]
B-trees: searching, inserting, splitting, overflows 5
Volatility, activity
Tries: text; variable-resolution [Notes ps 340K pdf 202K]
(Supplementary notes: Building and Searching Full Tries
Slides on the pointerless representation for tries)
Symmetry: quad trie, kd-tree, kd-trie, z-order 6
5 Direct Files:
[Slides ps 499K pdf 674K] [Exercises ps 122K pdf 125K]
[Text ps 444K pdf 302K]
Hashing: division/remainder, multiplicative
collision resolution - linear probing, 7
separate chaining
(Supplementary notes: Hash Delete with Linear Probing)
Volatility: virtual hashing, linear hashing
Activity: tidying, distributions
Symmetry: multipaging - static and dynamic
(Supplementary notes: Multipage Search and Insert)
7 Cost Analysis
Usage distributions: sequential/direct breakeven
Probe and load factors: B-trees and hashing
Hashing overflows - theory and measurement 8
(Supplementary notes: Cost Analysis for Trees)
File Design [Exercises ps 179K pdf 74K]
Using Access Categories, Activity, Volatility, Symmetry
and Cost Analysis 9
(Supplementary notes: Design Problem and Answers)
8 Relational Databases:
[Slides ps 191K pdf 261K] [Exercises ps 42K pdf 48K]
Relations
Relational Algebra
relix 10
9 Domain Algebra 11
Misc:
File support for concurrency [Slides ps 136K pdf 101K];
High dimensionality [Slides ps 139K pdf 101K]
Trie Join [PDF slides 70K]
The links from this web page constitute the text for the course. They are reasonably complete, but some omissions and errors have been left deliberately, and
will be addressed in the lectures.
Note on plagiarism