Displaying the scope of GREP wildcards

InDesign's GREP supports three classes of wildcards:

  1. the standard GREP wildcards \w, \u, \l, \d, \s, \h, \v, \K, and \R (and their negations (\W, \U, \L, \D, \S, \H, \V) (the latter four, \h and \v, were introduced in CS6; the latter two have no negative counterpart);
  2. POSIX wildcards (alnum, punct, etc.);
  3. Unicode properties, introduced in CS4. This class contains seven basic properties, each consisting of several sub-properties. For instance, the basic property \p{Punctuation} – or its abbreviated form \p{P*} – matches all punctuation. It has six subproperties: \p{Pi} and \p{Pf} (initial and final quotation marks, cross-linguistically); \p{Ps} and \p{Pe} match start and end punctuation such as brackets, braces, and parentheses; \p{Pc} finds all connector punctuation (underscore and Unicode points 203F, 2040, and 2054); and \p{Po} captures "other" punctuation, which covers periods, colons, commas, etc. As with the standard GREP classes, unicode properties can be negated by using a capital: \P{P*} matches anything that's not punctuation.

For details, see here:

With the addition of this third class of wildcards it is now possible to do more refined searches more easily. InDesign implements the Unicode properties as defined by Boost and documented in J. Friedl, Mastering Regular Expressions, O'Reilly, 2006, pp. 122 and 123.

The addition of the new class of wildcards means that InDesign's GREP now supports 112 wildcards: 16 standard, 12 POSIX, and 74 Unicode (counting negated classes separately for dramatic effect; if you don't like drama, there are still 60 classes). They are distinguished by their form: \w, [[:alpha:]], and \p{L*}, respectively. With such a plethora of symbols, the question arises of which character class matches which characters.

To check how InDesign's GREP classes and properies capture characters from selected Unicode ranges, I use an InDesign document as a template and a script that prints selected Unicode ranges in that document. The script then lets you select a GREP class and highlight its targets.

Use

The script and the document work in CS4 and later. Download the ZIP file (download links at the bottom of this page), retrieve these files and place them in your script folder:

  1. grep_mapper.idml;
  2. grep_mapper.jsx;
  3. grep_mapper.txt.

Then run the grep_mapper.jsx script. It shows this dialog:

[image: grep mapper interface]

The dialog shows the Unicode ranges listed in grep_mapper.txt – later on I'll show how you can change this list to add or remove Unicode ranges. Select any ranges in the usual way using the mouse to select single lines and Ctrl/Cmd+click and Shift+click to add to an existing selection. Finally, press OK or Enter/Return to create the selected unicode ranges in the template.

Suppose you selected all Latin unicode ranges, the first eight in the list. Press OK and the script opens the IDML file and prints the selected ranges in columns: a four- or five-digit unicode number followed by the corresponding character. Any squares indicate that the character is not available in the selected font. (The template uses Cambria; see Some notes on the template, below, for details and changing the font.)

[image: grep mapper document]

The script then displays a panel with the three GREP classes. To highlight the scope of a certain class or wildcard, select it in the tree view. For example, to dispay in the InDesign document all characters matched by, say, \p{Lowercase_letter}, expand the Unicode properties group, then expand the Letter, then click Lowercase_letter:

[image: grep mapper style panel]

From InDesin CC the panel looks a bit different, but its functionality is the same:

[image: grep mapper style panel CC]

When you click a node, the GREP expression is printed in the panel's frame. To see what is matched by the selected GREP expression, double-click the node:

[image: grep mapper style panel]

To display the scope of another class, just double-click the node in the panel.

The configuration file

The configuration file can be changed to add or remove unicode ranges. The first few lines of the configuration file look like this:

C0 controls and basic Latin   (0x0000-0x007F) /Latin
C1 controls and Latin-1 supplement   (0x0080-0x00FF) /Latin-1
Latin extended A   (0x0100-0x017F) /Latin-A
Latin extended B   (0x0180-0x024F) /Latin-B
Latin extended C   (0x2C60-0x2C7F) /Latin-C
Latin extended D   (0xA720-0xA78C) /Latin-D
Latin extended D   (0xA7FB-0xA7FF) /Latin-D
Latin extended additional   (0x1E00-0x1EFF) /Latin add.
Combining diacritical marks   (0x0300-0x036F) /Comb dia
Combining diacritical marks supplement   (0x1DC0-0x1DFF) /Comb d/supp
Combining diacritical marks for symbols   (0x20D0-0x20FF) /Comb d/sym
Combining half marks   (0xFE20-0xFE2F) /Comb half
IPA extensions   (0x0250-0x02AF) /IPA
Phonetic extensions + supplement   (0x1D00-0x1DBF) /IPA ext

Changing unicode ranges

Each entry in the configuration file consists of three parts:

1. The name of the range, e.g. "C1 controls and Latin-1 supplement". This is displayed in the script's dialog. You can use any wording you like.

2. The unicode range, e.g. (0x0080-0x00FF). This too is shown in the dialog. You must use the notation 0x0000 for the unicode values (or 0x00000 for plane-1 and higher planes), and the ranges must be wrapped in parentheses. The script uses these unicode numbers to print the range.

3. A label that is printed as a column header in the template. The script expects this at the end of each line, after a forward slash. You can use any text you like, but the shorter the better.

For example, you could combine the second and third, and the fourth and fifth lines as follows:

C0/1 controls, basic Latin and Latin-1   (0x0000-0x00FF) /Latin 0-1
Latin extended A/B   (0x0100-0x024F) /Latin-A/B

Some notes on the template

The template is an IDML file that can be used in CS4 and later. It uses the font Cambria, which is supplied with most modern OSs, and which has a very large character set. There is a companion font Cambria Math. Alternatives are Everson Mono, a very large Unicode font (25 Euro shareware), which is especially good for languages, Junicode (a free font) and Code2000 (US$ 5 shareware).


Useful script?

Consider making a donation. To make a donation, please press the button below. This is Paypal's payment system; you don't need a Paypal account to use it: you can use several types/brands of credit and debit card.

Peter Kahrel's paypal account

Download

All required files are put together in this zip file: grep_mapper.zip


Version history

28 February 2015: Updated the text with a note on two more GREP classes (\K and \R). This has no influence on the scripts, which are therefore unchanged. Added a screenshot of the script's panel in InDesign CC because it looks different from CS6 and earlier. The panel's functionality hasn't changed.

10 July 2014: Added a note about two wildcards introduced in CS6: \h (all horizontal space) and \v (all vertical space and break characters). The scripts haven't changed.

29 June 2013: Rewrote the script and the above text; added support for plane 1 and higher; dropped CS3 support.

Sept. 2009: first posted.


Installing and running scripts

Back to script index

Questions, comments? Get in touch