Database IndexEdit

A database index is a dedicated data structure that accelerates data retrieval in a database by providing a fast path to the rows that satisfy a query. Rather than scanning every row in a table, an index lets the database locate relevant rows by consulting the index structure first and then fetching the actual data. The trade-off is straightforward: faster reads come with extra writes, more storage, and added maintenance work. When designed well, indexes deliver predictable improvements in latency and throughput for a wide range of workloads, especially as data volumes grow and response time requirements tighten. This article surveys what an index is, how it works, the main index types, design considerations, and how indexing is implemented in major database systems.

Indexing is a core technique across transactional systems that require low-latency lookups and across data warehouses that run analytical queries over large datasets. The effectiveness of an index depends on both the data distribution and the queries issued. In practice, well-chosen indexes can mean the difference between a system that feels instant to users and one that lags under moderate load. See also Database and Query optimization for broader context about data access and plan selection.

History

The idea of indexing in data systems matured alongside the development of structured query processing. Early work focused on simple search structures, but the practical breakthrough came with the B-tree family of data structures, which support logarithmic search, range queries, and efficient updates. The widespread adoption of B-tree and its derivatives in commercial and open-source databases established a reliable foundation for indexing within SQL-based environments. Over time, specialized index types emerged to target particular workload characteristics, from high-cardinality transactional workloads to read-mostly analytic workloads.

How indexes work

An index is built on one or more columns (the index key) and stored separately from the table data. The index organizes keys in a way that makes locating the corresponding rows faster than scanning the entire table. Depending on the index type, the leaf entries store either the actual row pointers or the values needed to reconstruct results directly. The database’s Query optimizer uses statistics about the index and the data distribution to decide whether using the index is beneficial for a given query. If the index is helpful, the optimizer can choose an index scan or a more complex plan that combines index access with table lookups.

Key concepts include: - Cardinality: the number of distinct values in the indexed column(s), which influences the usefulness of an index. - selectivity: how effectively the index filters rows for a given query. - maintenance: inserts, updates, and deletes require the index to be updated as well, which can impact write throughput. - storage overhead: indexes consume space beyond the base table.

For a deeper look at how index structures map keys to data, see Data structure and Index.

Types of indexes

B-tree and B+-tree indexes

The B-tree family is the workhorse of many transactional systems. It supports efficient point lookups and range queries, with logarithmic search time and predictable performance as data grows. In many databases, the leaf nodes of a B-tree contain pointers to the actual rows or to the primary key values, enabling fast retrieval while keeping write amplification in check. See B-tree.

Hash indexes

Hash-based indexes provide very fast point lookups for equality predicates but do not support range queries efficiently. They are well-suited for exact-match lookups on high-cardinality columns in write-heavy environments, but they are less flexible for queries that require ordering or range scanning. See Hash index.

Bitmap indexes

Bitmap indexes encode the presence of a value as bits and are especially effective for columns with low cardinality in read-mostly workloads, such as certain classification or reporting scenarios in data warehouses. They can speed up complex predicates by combining bitmaps, but they are traditionally less suited to high-update environments. See Bitmap index.

Columnar and inverted indexes

Columnar indexes and related structures are common in analytic workloads where queries read large portions of a columnar dataset. Inverted indexes are a staple of search systems, mapping terms to their locations, enabling fast text search. See Columnar index and Inverted index.

Partial, expression-based, and functional indexes

Partial indexes cover only a subset of rows that satisfy a predicate, reducing maintenance cost and storage while speeding queries that align with the predicate. Function-based (or expression-based) indexes compute the index key from an expression rather than a raw column value, enabling efficient queries that transform data at query time. See Partial index and Functional index (or related terms) for more detail.

Clustered vs non-clustered and other specialized forms

A clustered index determines the physical order of data rows in storage, which can improve locality for range scans and sequential access. Non-clustered indexes store separate structures that point to the data. Both forms have trade-offs in write amplification and storage. See Clustered index and Non-clustered index.

Unique and composite indexes

Unique indexes enforce a constraint that all key values are distinct, while composite indexes cover multiple columns to support queries that filter on several attributes simultaneously. See Unique index and Composite index.

Design considerations

Query patterns: Focus on columns that appear in WHERE, JOIN, ORDER BY, and group-by clauses. Use selective predicates to justify index usage. See Query optimization and Statistics for how engines estimate selectivity.
Data distribution: High-cardinality columns often benefit from indexing, while low-cardinality columns may be prime candidates for bitmap or partial indexing in the right workloads.
Write vs read balance: Every index adds maintenance cost for inserts, updates, and deletes. In write-heavy systems, too many indexes can degrade throughput.
Index coverage: A covering index contains all columns needed by a query, allowing the database to satisfy the query entirely from the index without touching the table. See Covering index.
Determine rail capacity: Factors such as I/O bandwidth, concurrency, and hardware influence whether an index will deliver the expected performance gains. See Performance engineering.
Maintainability: Regular maintenance tasks such as reindexing, updating statistics, and monitoring fragmentation help sustain performance. See Index maintenance and Statistics.
Partitioning and localization: For very large tables, partitioned indexes and locality-aware designs can improve manageability and performance. See Partitioning and Brin index.

Performance and maintenance

Index performance depends on how well the chosen structure matches typical workload characteristics. Read latency improves when the index efficiently filters rows, but write latency can increase due to index updates. Database administrators monitor index usage, fragmentation, and bloat, and apply maintenance strategies such as reindexing, updating statistics, and adjusting fill factors. See Cost-based optimizer and Fill factor for further details on tuning and storage considerations.

Some systems offer automated indexing features that attempt to select or adapt indexes based on observed workloads. Proponents argue this reduces setup time and responds to changing patterns, while critics caution that automation may misinterpret unusual workloads or create redundant indexes. The debate often centers on the balance between automated convenience and human expertise, as well as the reliability of cost models used to justify index creation. See Automatic indexing and Index maintenance for related discussions.

Indexing in practice by database systems

PostgreSQL: Uses a versatile set of index types (including B-tree by default) and supports partial, covering, and expression-based indexes. It also relies on vacuuming and analyze runs to keep statistics accurate for the optimizer. See PostgreSQL.
MySQL: In InnoDB, the primary key index is clustered, and secondary indexes store pointers to the primary key, which can influence data layout and performance. MySQL also supports full-text and spatial indexes. See MySQL.
SQL Server: Distinguishes clustered from non-clustered indexes and offers options like filtered indexes and columnstore indexes for analytics. See SQL Server.
Oracle Database: Provides a broad ecosystem of index types, including B-tree and bitmap indexes, with partitioning options to scale large workloads. See Oracle Database.