BlackWaspTM
Programming Concepts

Database Normalisation (2)

The seventeenth part of the SQL Server Programming Fundamentals tutorial discusses the concept of database normalisation. Normalisation is a database design technique that minimises duplication of information, reducing the risk of introducing data errors.

Delete Anomalies

A delete anomaly occurs when the deletion of information from a table has unintended consequences, generally because the structure of the table makes deleting data difficult. Consider the situation where the company using the sample table has decided that they will no longer require the C# skill. If you were to delete all references to C#, Jim would disappear from the database entirely. Similarly, if Pat were to leave the company, removing her entry would erase the PRINCE2 skill from the database.

Normalisation

Database normalisation (US: normalization) reduces or removes the issues caused by duplicated data by extracting the replicated information into separate tables. Foreign key relationships are then added between the related tables to maintain referential integrity. Once this normalisation has occurred, the key information in each database table is held uniquely.

As an example, we may decide to move the job title values in the previous example to their own table with a primary key. Each of the employee rows would include a reference to that primary key. When querying employees, we would join the two tables together, ensuring that every employee with a particular title will be linked to the same job title row and would therefore return exactly the same information. This means that an update to a job title need only occur in one place to affect every linked employee.

Normalisation should be applied to databases that are updated frequently but seldom queried. Adding tables to a database generally increases the number of tables that must be joined to execute a query. As each join adds an overhead to the query, decreasing performance, it is possible to over-normalise a database. This is particularly noticeable for databases that are used only for reporting purposes. In these situations it is common to denormalise (US: denormalize) the database, introducing duplication that increases query performance.

Normal Forms

The normal forms of relational databases describe the level of vulnerability that a particular design has to data integrity problems. They define a set of rules that should be applied when designing a database. The forms each have an ordinal number, with "first normal form" being the least normalised and most vulnerable to anomalies. Higher ordinals indicate progressively stricter rules. Additionally, there are some named normal forms, such as Boyce-Codd Normal Form, that apply additional rules.

Normal forms should be considered to be guidelines for database design, rather than absolute rules. There are some situations where it is necessary to break with the normal forms. This is generally to improve the performance of queries that otherwise would require a large number of join operations.

Originally, Edgar F. Codd described the first, second and third normal forms. These three forms are those that are the most commonly adhered to and that are always achievable when designing databases. They will be described in the remaining sections of this article. There are further normal forms, including Boyce-Codd Normal Form and fourth, fifth and sixth normal form. However, these stricter forms are not always achievable.

First Normal Form

For a table to be compliant with first normal form (1NF), it must meet certain conditions. The key rules are:

  • The columns of the table must not require any specific ordering.
  • The rows of the table must not require any specific ordering.
  • It must not be possible to create duplicate rows.
  • Every column of every row must contain exactly one element of data.
  • Rows must not contain hidden information such as row identifiers or hidden timestamps.

These rules can be illustrated with some example tables.

Example 1

MessageType
Opened text file.Information
Outputted 15 lines to text file.Information
Failed to write line 16.Error

This table holds information, warning and error messages that are logged by a computer program. In order to read the log correctly, the details must be retrieved in the order in which they were written to the database. This breaks the second rule, which states that rows of the table must not require any specific ordering to be understood. It also breaks the third rule, as there is no possible unique candidate key for the table.

To resolve the ordering issue, the date and time of every log entry can be recorded. The candidate key problem can be fixed by adding a surrogate identifier to the table, shown underlined. In the following table, this column has been added and is using an identity column to generate values.

Log IDLog TimeMessageType
115/12/2008 12:01:23.123456Opened text file.Information
215/12/2008 12:01:23.123472Outputted 15 lines to text file.Information
315/12/2008 12:01:23.130120Failed to write line 16.Error
15 December 2008