BlackWasp
Algorithms
.NET 2.0+

The Soundex Algorithm

Cultural differences and input errors can lead to words being spelled differently to a user's expectations. This makes it difficult to locate information quickly. The Soundex algorithm can alleviate this by assigning codes based upon the sound of words.

Soundex

The Soundex algorithm is used to generate a short code based upon the pronunciation of an English word. The codes always contain a single letter followed by three numeric digits. They can be used to compare two words to determine whether they sound alike. This can be very useful when searching for information in a database or text file, particularly when looking for names that are commonly misspelled.

An example of the use of Soundex could be in the search function of a customer database. When performing a text search for the surname, "Smith", people with the name, "Smythe", would not be found. However, as the Soundex code for both surnames is "S530", a phonetic Soundex-based search would find both customers. The codes and data could also be used to ask the user, "Did you mean Smythe?".

The Algorithm

The Soundex algorithm applies a series of rules to a string to generate the four-character code. The encoding steps are as follows:

  • Ignore all characters in the string being encoded except for the English letters, A to Z.
  • The first letter of the Soundex code is the first letter of the string being encoded.
  • After the first letter in the string, do not encode vowels or the letters H, W and Y. These letters may affect the code by being present but are not encoded directly.
  • Assign a numeric digit between one and six to all letters except the first using the following mappings:
    • 1: B, F, P or V
    • 2: C, G, J, K, Q, S, X, Z
    • 3: D, T
    • 4: L
    • 5: M, N
    • 6: R
  • Where any adjacent digits are the same, remove all but one of those digits unless a vowel, H, W or Y was found between them in the original text. The C# code described below uses a temporary placeholder for these non-encodable letters to avoid incorrect removal of adjacent, matching digits.
  • Force the code to be four characters in length by padding with zero characters or by truncation.

Implementing Soundex in C#

In the remainder of this article we will implement the Soundex algorithm in C#. We will create a class containing two public methods. The first will generate the Soundex code for a given string. The second will compare two strings and return a value that indicates how similar they sound.

To begin, create a new console application project and add a class named "Soundex".

Calculating a Soundex Code

The first step is to add the signature of the GetSoundex method. This public method accepts a single parameter containing the string to be encoded. It returns the Soundex code as a new string. To create the method, add the following code to the Soundex class:

public string GetSoundex(string value)
{
}

The initial letter of a Soundex code is usually presented in upper case. Following this standard, and also making it easier to perform the comparisons for both upper and lower case letters, the input string is capitalised during the initialisation of the algorithm. We also create a new StringBuilder object to hold the Soundex code as it is constructed. The first two lines of the GetSoundex method are therefore:

value = value.ToUpper();
StringBuilder soundex = new StringBuilder();

With the algorithm initialised we can process all of the individual letters in the input string. This is achieved by looping through the characters and calling a method named "AddCharacter" whenever a letter is found. AddCharacter is responsible for adding the initial letter and subsequent digits to the Soundex code StringBuilder object. We will create this method later.

foreach (char ch in value)
{
    if (char.IsLetter(ch))
        AddCharacter(soundex, ch);
}

At the end of the loop, the StringBuilder may contain the placeholder characters described earlier. It is also likely that it will not be exactly four characters in length. The final part of the GetSoundex method rectifies these problems before converting the StringBuilder into a string and returning the result. Add the final three lines to the method:

RemovePlaceholders(soundex);
FixLength(soundex);
return soundex.ToString();

Adding Soundex Character Codes

The GetSoundex method calls three private methods that have yet to be defined. The first of these is the AddCharacter method. This encodes a letter and appends it to the code. We'll split this method into two parts to make the code easy to read. The first part checks if the letter being processed is the first to be found. If it is, the letter becomes the first character in the code. If not, a further method, "AddSoundexDigit", is called to convert the letter to a numeric digit or placeholder.

private void AddCharacter(StringBuilder soundex, char ch)
{
    if (soundex.Length != 0)
        AddSoundexDigit(soundex, ch);
    else
        soundex.Append(ch);
}

Determining Soundex Digits

The AddSoundexDigit method is responsible for encoding letters as digits. Several steps are required. Firstly, the letter is converted to a value between one and six according to the rules described earlier. If the latest letter is not encodable, a full stop (period) character is used as a placeholder. Next, the generated digit is compared to the previous digit. Unique values are added to the Soundex code but duplicates are not. The placeholders ensure that duplicates are not removed when separated by a vowel or other non-encodable value.

private void AddSoundexDigit(StringBuilder soundex, char ch)
{
    char code = GetSoundexDigit(ch);
    if (code != soundex[soundex.Length - 1])
        soundex.Append(code);
}

private char GetSoundexDigit(char ch)
{
    if ("BFPV".Contains(ch))
        return '1';
    else if ("CGJKQSXZ".Contains(ch))
        return '2';
    else if ("DT".Contains(ch))
        return '3';
    else if (ch == 'L')
        return '4';
    else if ("MN".Contains(ch))
        return '5';
    else if (ch == 'R')
        return '6';
    else 
        return '.';
}

Removing Placeholder Characters

The next private method removes the placeholder characters from the StringBuilder object, leaving only the encoded characters. This is achieved with a call to the StringBuilder's Replace method:

private void RemovePlaceholders(StringBuilder soundex)
{
    soundex.Replace(".", "");
}

Setting the Soundex Length

To complete the GetSoundex method requires one final private method that forces the code to be four characters in length. If the length is currently less than four characters, a series of zeroes is added to pad out the string. If not, the length is set to four, truncating the Soundex code if necessary.

private void FixLength(StringBuilder soundex)
{
    int length = soundex.Length;
    if (length < 4)
        soundex.Append(new string('0', 4 - length));
    else
        soundex.Length = 4;
}

Comparing Strings

The second public method of the Soundex class compares two strings to determine if they sound alike. In this case we use a simple algorithm. Firstly, the two strings are encoded using the Soundex algorithm. Next, the pairs of characters at each of the four positions are compared. The method returns the number of matches that are found as an integer. A result of four indicates the best possible match and zero the worst possible match. These values are useful when sorting a list of possible matches with the most likely appearing first.

public int Compare(string value1, string value2)
{
    int matches = 0;
    string soundex1 = GetSoundex(value1);
    string soundex2 = GetSoundex(value2);

    for (int i = 0; i < 4; i++)
        if (soundex1[i] == soundex2[i]) matches++;

    return matches;
}

Testing the Methods

To test the algorithms we can use the Main method of the console application. The following simple test encodes two strings and compares their Soundex codes. Try changing the strings to see the results.

string value1 = "Smith";
string value2 = "Smythe";

Soundex soundex = new Soundex();
Console.WriteLine(soundex.GetSoundex(value1));      // Outputs "S530"
Console.WriteLine(soundex.GetSoundex(value2));      // Outputs "S530"
Console.WriteLine(soundex.Compare(value1, value2)); // Outputs "4"
Link to this Page12 February 2010
TwitterTwitter RSS Feed RSS