.NET 1.1+

Regular Expression Unicode Categories and Blocks

by Richard Carr, published at http://www.blackwasp.co.uk/RegexUnicodeCategories.aspx

The thirteenth part of the Regular Expressions in .NET tutorial continues to describe the pattern characters used when matching and substituting text. This article looks at the matching of characters from specific Unicode general categories and code point blocks.

Previous: Regular Expression Substitutions

Unicode

Unicode provides an industry standard encoding system for characters. Unlike simpler systems, such as ASCII, Unicode allows you to represent letters, numbers, white space, punctuation and other symbols for many different languages, both modern and historic. The standard defines codes for over 120,000 different symbols, or code points.

Unicode General Categories

In addition to defining the symbol that will be displayed or printed, Unicode adds extra properties for each code point. One such property is known as the general category. This property can be used to organise characters into groups and sub-groups that are not necessarily linked to the language of the character set. For example, upper case letters are always contained in the "Lu" category and the "Sm" category is applied to code points that represent mathematical symbols. There are many available general categories and single code points can be linked to more than one category.

Unicode Blocks

Another of the properties of Unicode code points organises characters into contiguous blocks. Unlike general categories, blocks have unique names that do not overlap; any single code point can only appear in one block. Example code blocks include "Arabic", "Bengali" and "Mongolian".

Matching General Categories and Blocks

When using regular expressions, you can match characters based upon the general categories that they have or the block in which they appear. To match a single character in either grouping you use the pattern, "/p", followed by the name of the category or block within braces. For example, to match mathematical symbols, you could use, "/p{Sm}", as in the following example:

string input = "5+5=10";

foreach (Match match in Regex.Matches(input, @"\p{Sm}"))
{
    Console.WriteLine("Matched '{0}' at index {1}", match.Value, match.Index);
}

/* OUTPUT
  
Matched '+' at index 1
Matched '=' at index 3

*/

As with the shorthand character classes, you can negate a match by capitalising the letter, 'p'. To demonstrate, try running the modified sample code below, which matches all characters that do not exist in the mathematical symbols category.

string input = "5+5=10";

foreach (Match match in Regex.Matches(input, @"\P{Sm}"))
{
    Console.WriteLine("Matched '{0}' at index {1}", match.Value, match.Index);
}

/* OUTPUT
  
Matched '5' at index 0
Matched '5' at index 2
Matched '1' at index 4
Matched '0' at index 5
             
*/

Next: Regular Expression Inline Options

30 November 2015