 Database NormalisationThe seventeenth part of the SQL Server Programming Fundamentals tutorial discusses the concept of database normalisation. Normalisation is a database design technique that minimises duplication of information, reducing the risk of introducing data errors.
Data Anomalies
When information is held in a list format, the data is likely to include some duplication of information. This is often the case when creating a large list with many columns, such as in a spreadsheet format. Duplication can often occur when a single database table is used to hold lots of related information.
Duplication of information causes two key problems in a relational database. Firstly, the additional information can often lead to increased storage requirements. Secondly, and more importantly, you can experience data integrity problems, the majority of which can lead to one or more types of anomaly. Three anomalies are described in the following sections.
Update Anomalies
An update anomaly occurs when all copies of a single piece of information are not updated correctly. For example, consider the following table:
| Name | Job Title | Skill |
|---|
| Bob | Developer | C# | | Bob | Developer | SQL | | Jim | Developer | C# | | Sue | Developer | VB | | Pat | Project Manager | PRINCE2 |
To allow the table to hold the details of multiple employees, each with a job title and one or more skills, this table includes duplication in every column. This gives many opportunities for update anomalies. For example, if Bob's information is updated so that his job title is "Programmer", this should probably also update all other developer rows to indicate that the new job title replaces the old one for all developers. If not, the update could leave the table in the following state:
| Name | Job Title | Skill |
|---|
| Bob | Programmer | C# | | Bob | Programmer | SQL | | Jim | Developer | C# | | Sue | Developer | VB | | Pat | Project Manager | PRINCE2 |
The information in the table may still be considered accurate. However, a query of the database that intends to return all of the developers has become more difficult. A worse scenario will occur if the next update to change Bob's title back to "Developer" is only executed against one of his two entries, especially if the new job title is misspelled:
| Name | Job Title | Skill |
|---|
| Bob | Developr | C# | | Bob | Programmer | SQL | | Jim | Developer | C# | | Sue | Developer | VB | | Pat | Project Manager | PRINCE2 |
Insert Anomalies
Insert anomalies occur when the structure of a database table makes it difficult to insert information. This can happen when two or more distinct areas of information (or domains) exist within a table. In our previous example table, we may want to create a new employee who does not have a specific technical skill. This forces us to use a null value or dummy value for the skill:
| Name | Job Title | Skill |
|---|
| Bob | Developr | C# | | Bob | Programmer | SQL | | Jim | Developer | C# | | Sue | Developer | VB | | Pat | Project Manager | PRINCE2 | | Lee | Trainee Programmer | null) |
In this situation, the information in the table seems reasonable. However, when Lee obtains his first skill, simply adding a row, as would be done for other developers, causes an anomaly:
| Name | Job Title | Skill |
|---|
| Bob | Developr | C# | | Bob | Programmer | SQL | | Jim | Developer | C# | | Sue | Developer | VB | | Pat | Project Manager | PRINCE2 | | Lee | Trainee Programmer | (null) | | Lee | Trainee Programmer | C# |
Delete Anomalies
A delete anomaly occurs when the deletion of information from a table has unintended consequences, generally because the structure of the table makes deleting data difficult. Consider the situation where the company using the sample table has decided that they will no longer require the C# skill. If you were to delete all references to C#, Jim would disappear from the database entirely. Similarly, if Pat were to leave the company, removing her entry would erase the PRINCE2 skill from the database.
Normalisation
Database normalisation (US: normalization) reduces or removes the issues caused by duplicated data by extracting the replicated information into separate tables. Foreign key relationships are then added between the related tables to maintain referential integrity. Once this normalisation has occurred, the key information in each database table is held uniquely.
As an example, we may decide to move the job title values in the previous example to their own table with a primary key. Each of the employee rows would include a reference to that primary key. When querying employees, we would join the two tables together, ensuring that every employee with a particular title will be linked to the same job title row and would therefore return exactly the same information. This means that an update to a job title need only occur in one place to affect every linked employee.
Normalisation should be applied to databases that are updated frequently but seldom queried. Adding tables to a database generally increases the number of tables that must be joined to execute a query. As each join adds an overhead to the query, decreasing performance, it is possible to over-normalise a database. This is particularly noticeable for databases that are used only for reporting purposes. In these situations it is common to denormalise (US: denormalize) the database, introducing duplication that increases query performance.
Normal Forms
The normal forms of relational databases describe the level of vulnerability that a particular design has to data integrity problems. They define a set of rules that should be applied when designing a database. The forms each have an ordinal number, with "first normal form" being the least normalised and most vulnerable to anomalies. Higher ordinals indicate progressively stricter rules. Additionally, there are some named normal forms, such as Boyce-Codd Normal Form, that apply additional rules.
Normal forms should be considered to be guidelines for database design, rather than absolute rules. There are some situations where it is necessary to break with the normal forms. This is generally to improve the performance of queries that otherwise would require a large number of join operations.
Originally, Edgar F. Codd described the first, second and third normal forms. These three forms are those that are the most commonly adhered to and that are always achievable when designing databases. They will be described in the remaining sections of this article. There are further normal forms, including Boyce-Codd Normal Form and fourth, fifth and sixth normal form. However, these stricter forms are not always achievable.
First Normal Form
For a table to be compliant with first normal form (1NF), it must meet certain conditions. The key rules are:
- The columns of the table must not require any specific ordering.
- The rows of the table must not require any specific ordering.
- It must not be possible to create duplicate rows.
- Every column of every row must contain exactly one element of data.
- Rows must not contain hidden information such as row identifiers or hidden timestamps.
These rules can be illustrated with some example tables.
Example 1
| Message | Type |
|---|
| Opened text file. | Information | | Outputted 15 lines to text file. | Information | | Failed to write line 16. | Error |
This table holds information, warning and error messages that are logged by a computer program. In order to read the log correctly, the details must be retrieved in the order in which they were written to the database. This breaks the second rule, which states that rows of the table must not require any specific ordering to be understood. It also breaks the third rule, as there is no possible unique candidate key for the table.
To resolve the ordering issue, the date and time of every log entry can be recorded. The candidate key problem can be fixed by adding a surrogate identifier to the table, shown underlined. In the following table, this column has been added and is using an identity column to generate values.
| Log ID | Log Time | Message | Type |
|---|
| 1 | 15/12/2008 12:01:23.123456 | Opened text file. | Information | | 2 | 15/12/2008 12:01:23.123472 | Outputted 15 lines to text file. | Information | | 3 | 15/12/2008 12:01:23.130120 | Failed to write line 16. | Error |
Example 2
| Name | Job Title | Skills |
|---|
| Bob | Developer | C#, SQL | | Jim | Developer | C#, VB, SQL | | Sue | Developer | VB |
In the second example we return to the employee list. In this case, the database designer has created a Skills column to contain the list of skills that each employee has. Where an employee has several skills, they are comma-separated. This breaks the rule that every column of every row contains exactly one element of data.
This issue can be resolved by creating two tables instead of one. The first table contains employees' names and job titles with the name as the primary key. The second table contains one row for each skill for each employee with both columns being included in a composite key. On performing a query that requires both sets of data, a join is made based upon the employee name.
Employees Table
| Name | Job Title |
|---|
| Bob | Developer | | Jim | Developer | | Sue | Developer |
Employee Skills Table
| Employee | Skills |
|---|
| Bob | C# | | Bob | SQL | | Jim | C# | | Jim | VB | | Jim | SQL | | Sue | VB |
Second Normal Form
For a table to comply with second normal form (2NF), it must meet all of the requirements of 1NF. Additionally, all non-prime columns, ie. columns that are not part of a candidate key, must depend upon entire candidate keys, not just parts of those keys. This rule only applies to tables that contain composite keys. If all possible candidate keys are composed only of single columns, this rule has no effect.
Example
| Employee | Skills | Salary |
|---|
| Bob | C# | £30,000 | | Bob | SQL | £30,000 | | Jim | C# | £35,000 | | Jim | VB | £35,000 | | Jim | SQL | £35,000 | | Sue | VB | £30,000 |
In this example, a column has been added to the Employee Skills table to hold the employee's salary. This column breaks the 2NF rule, as the salary is dependant upon the employee but not the skill. This is only one column of the two-column composite key.
To modify the design to achieve 2NF, the Salary column should be removed from the Employee Skills table and added to the Employees table instead. This also removes the duplication of salary data that could cause update anomalies.
Employees Table
| Name | Job Title | Salary |
|---|
| Bob | Developer | £30,000 | | Jim | Developer | £35,000 | | Sue | Developer | £30,000 |
Employee Skills Table
| Employee | Skills |
|---|
| Bob | C# | | Bob | SQL | | Jim | C# | | Jim | VB | | Jim | SQL | | Sue | VB |
Third Normal Form
Third normal form (3NF) includes all of the requirements of 2NF. In addition, 3NF states that all non-prime columns must be directly dependant upon the candidate keys. Non-key columns may not be transitively dependant upon prime columns. ie. A column should not be dependant upon a non-key column that is dependant upon a key column.
Example
| Employee | Primary Office | Office Phone |
|---|
| Bob | London | 020 123456 | | Jim | Leeds | 0113 234567 | | Sue | Manchester | 0161 987654 |
In the above sample, the designer has created a new table that links each employee to an office location and that office's main telephone number. This table meets all of the requirements for 2NF. The only candidate key for the table is the Employee column. The Primary Office column is directly dependant upon this key. However, the Office Phone column is not directly related to the employee. It is transitively related to the employee via the primary office. This table does not therefore meet the requirements for 3NF.
To achieve 3NF, the table must be normalised into two tables. The first table will link the employee to their location. The second will hold the details of each location, including the telephone number. To perform a query that returns an employee's name and telephone number, you must join the tables. This solution removes the risk of update anomalies where the same office appears for multiple employees but the telephone number is changed for only one of them.
Employee Offices Table
| Employee | Primary Office |
|---|
| Bob | London | | Jim | Leeds | | Sue | Manchester |
Offices Table
| Office | Office Phone |
|---|
| London | 020 123456 | | Leeds | 0113 234567 | | Manchester | 0161 987654 |
Limitations
Third normal form can be overly complex for some business scenarios. You may decide not to fully normalise your tables to achieve 3NF for performance or other reasons. An example of this can be seen in the following table. This table holds details of customer locations including the customer ID, address and postcode. Technically, the address can be though of as dependent upon the postcode, which itself is related to the customer ID. To achieve 3NF, you could create a table of possible locations with the postal code as its primary key. In many situations, this would be deemed to be unnecessary.
| CustomerID | Address | Postcode |
|---|
| 123 | 020 123456 | W1 1WW | | 456 | 0113 234567 | LS2 2LS | | 789 | 0161 987654 | M3 3MM |
|