Skip to content
Sahithyan's S3
1
Sahithyan's S3 — Database Systems

Indexing

Indexing and hashing is used to speed up data retrieval in databases. Instead of scanning an entire file, indexes provide fast access paths to records using keys. When a file is modified, every index on the file must be updated.

An attribute or set of attributes used to look up records.

A separate data structure that stores pairs (index records) of the form:

(search-key, pointer to record)

Much smaller than the original file.

Ordered sequential file with a primary index.

Dynamic, self-balancing tree structure for indexing. An alternative to indexed-sequential files.

Automatically reorganizes after inserts/deletes (no file rebuild). Avoids block overflows. Keeps all keys in sorted order. Excellent for range and equality queries.

Slightly higher space and update overhead. Still, dominant indexing method in modern DBMS.

When comparing index types, measure:

  • Access time: how quickly data is found.
  • Insertion/deletion time: update performance.
  • Space overhead: extra storage used by index.
  • Supported access types: exact-match or range-based.

One index entry per search-key value. Points directly to the record. Faster lookups. More space. More maintenance cost.

Indexes only some search-key values. Can only be used when data is physically ordered on key.

Less space. Easier maintenance. Slow lookups.

To find a key KK:

  • Find the index entry with largest key <K\lt K.
  • Scan sequentially in the file.

Keys stored in sorted order.

Aka. clustering index. Order of index matches file order. Usually (but not always) based on the primary key. One per file. Sequential scan is efficient.

Aka. non-clustering index. Based on non-ordering attribute(s). Can be many per file. Must be dense. Sequential scan is inefficient.

Keys mapped to buckets using a hash function.

When a sparse index is built from a index file. Named outer index and inner index respectively.

All indices must be updated during insert/delete.

Used when the inner index is too large to fit in memory.

A bitmap is an array of bits. Can only be used when the attribute has a small domain. One bitmap per attribute value.

Each record is stored in a bucket determined by a hash function:

h(key)bucket addressh(key) → bucket\ address
  • Each bucket is typically one disk block.
  • Used for equality search (not range queries).
  • Example: h(DeptName) = sum(char codes) mod 10.

Occurs when too few buckets or non-uniform data distribution.

  • Overflow chaining: overflow buckets linked in a list.
  • Called closed hashing.
  • Open hashing (without overflow buckets) is unsuitable for DBMS use.

A good hash function:

  • Distributes keys uniformly across buckets.
  • Minimizes overflow chains.
  • Avoids clustering of similar key values.

Poor hash functions map many keys to the same bucket → slow access.