InDesign CS3's GREP supports two classes of wildcards: the five standard GREP wildcards \w, \u, \l, \d, and \s (and their negations (\W, \U, \L, \D, \S) and 12 POSIX wildcards (alnum, punct, etc.) (see http://oreilly.com/catalog/9780596156008/ for details).
InDesign CS4 added a third type of wildcard, the so-called Unicode properties. This new class of wildcards (new to InDesign, anyway) contains seven basic properties, each consisting of several sub-properties. For instance, the basic property \p{Punctuation} -- or its abbreviated form \p{P*} -- matches all punctuation. It has six subproperties: \p{Pi} and \p{Pf} (initial and final quotation marks, cross-linguistically); \p{Ps} and \p{Pe} match start and end punctuation such as brackets, braces, and parentheses; \p{Pc} finds all connector punctuation (underscore and Unicode points 203F, 2040, and 2054); and \p{Po} captures "other" punctuation, which covers periods, colons, commas, etc. (see grep_mapper.pdf for a complete overview). As with the standard GREP classes, unicode properties can be negated by using a capital: \P{P*} matches anything that's not punctuation.
With the addition of this third class of wildcards it is now possible to do more refined searches more easily. InDesign implements the Unicode properties as defined by Boost and documented in J. Friedl, Mastering Regular Expressions, O'Reilly, 2006, pp. 122 and 123. As far as I know they've not been documented for InDesign; CS4's support for them was pointed out to me by Laurent Tournier.
The addition of the new class of wildcards means that InDesign's GREP now supports 96 wildcards: 10 standard, 12 POSIX, and 74 Unicode (counting negated classes separately for dramatic effect; if you don't like drama, there are still 56 classes). They are distinguished by their form: \w, [[:alpha:]], and \p{L*}, respectively. With such a plethora of symbols, the question arises of which character class matches which characters.
To check how InDesign's GREP classes and properies capture characters from selected Unicode ranges, I use an InDesign document as a template and a script that prints selected Unicode ranges in that document. The template contains the 59 wildcards in as many paragraph styles -- indeed, each paragraph style has defined in it a GREP style, which, when selected, highlights the characters matched by the GREP character class defined in it. In a way, this is an extension of Gerald Singelmann's charts first published at indesign-faq.de.
CS3 and CS4 are dealt with in different parts as CS4 introduced new GREP classes and GREP styles. We'll start with CS4.
Download the ZIP file (download links at the bottom of this page), retrieve these files and place them in your script folder:
Then run the grep_mapper.jsx script. It shows this dialog:
![[image: grep mapper interface]](images/grep_mapper_interface.gif)
The dialog shows the Unicode ranges listed in grep_mapper.txt -- later on I'll show how you can change this list to add or remove Unicode ranges. Select any ranges in the usual way using the mouse to select single lines and Ctrl/Cmd+click and Shift+click to add to an existing selection. Finally, press OK or Enter/Return to create the selected unicode ranges in the template.
Suppose you selected all Latin unicode ranges, the first eight in the list. Press OK and the script opens the template and prints the selected ranges in columns: a four-digit unicode number followed by the corresponding character. Any squares indicate that the character is not available in the selected font. (The template uses Everson Mono; see Some notes on the template, below, for details and changing the font.)
![[image: grep mapper document]](images/grep_mapper_template_1.gif)
To display the scope of a certain class or wildcard, open the paragraph style panel (press F11). The paragraph styles are assembled in three groups that correspond with the three types of wildcard: Unicode property, GREP, and POSIX.
![[image: grep mapper style panel]](images/grep_mapper_style_panel_1.gif)
The Unicode style group contains seven style groups, each representing a basic property. To show them, click the Unicode style group:
![[image: grep mapper style panel]](images/grep_mapper_style_panel_2.gif)
To highlight in the InDesign document all characters matched by, say, \p{Lowercase_letter}, expand the Unicode group, then expand the \p{Letter} group, then click \p{Lowercase_letter}:
![[image: grep mapper style panel]](images/grep_mapper_template_2.gif)
To display the scope of another class, just click that paragraph style in the panel. Other types of GREP can be tested as well. To see how the standard GREP wildcards fare, open the GREP group to show the paragraph styles in that group. The result is this:
![[image: grep mapper style panel]](images/grep_mapper_style_panel_3.gif)
The way you select paragraphs in the template depends on how many pages were created in it. If there's just one page, then the most convenient way to select all paragraphs in the frame is simply to select the frame (as in the screen shot, above), then select a paragraph style. But this selection method doesn't work when there is more than one page in the document. If that's the case, you need to select all text in the story (click in the text and press Ctrl/Cmd+A) before selecting a paragraph style.
Download the ZIP file (download links at the bottom of this page), retrieve these files and place them in your script folder:
Then run the grep_mapper.jsx script. It shows the same dialog as shown above (shown abbreviated here). Select any range that you want printed in the document (click, Shift+click, and Ctrl/Cmd+click work as usual), then click OK or press Enter/Return. The grep_mapper_cs3 document is loaded and the script prints the selected unicode ranges in it.
![[image: grep mapper interface]](images/grep_mapper_interface_short.gif)
To show the scope of some GREP class, start the grep_show.jsx script. It shows this dialog:
![[image: grep mapper interface CS3]](images/grep_show_cs3.gif)
Pick any GREP class, then press OK. All characters matched by the selected GREP class are highlighted in the document. To display the scope of another GREP class, just run the script again.
What follows goes for both CS3 and CS4. The configuration file can be changed to add or remove unicode ranges. The first few lines of the configuration file look like this:
C0 controls and basic Latin (0x0000-0x007F) /Latin C1 controls and Latin-1 supplement (0x0080-0x00FF) /Latin-1 Latin extended A (0x0100-0x017F) /Latin-A Latin extended B (0x0180-0x024F) /Latin-B Latin extended C (0x2C60-0x2C7F) /Latin-C Latin extended D (0xA720-0xA78C) /Latin-D Latin extended D (0xA7FB-0xA7FF) /Latin-D Latin extended additional (0x1E00-0x1EFF) /Latin add. Combining diacritical marks (0x0300-0x036F) /Comb dia Combining diacritical marks supplement (0x1DC0-0x1DFF) /Comb d/supp Combining diacritical marks for symbols (0x20D0-0x20FF) /Comb d/sym Combining half marks (0xFE20-0xFE2F) /Comb half IPA extensions (0x0250-0x02AF) /IPA Phonetic extensions + supplement (0x1D00-0x1DBF) /IPA ext
Each entry in the configuration file consists of three parts:
1. The name of the range, e.g. "C1 controls and Latin-1 supplement". This is displayed in the script's dialog. You can use any wording you like.
2. The unicode range, e.g. (0x0080-0x00FF). This is shown in the dialog, too. You must use the notation 0x0000 for the unicode numbers, and the ranges must be wrapped in parentheses. The script uses these unicode numbers to print the range.
3. A label that is printed as a column header in the template. The script expects this at the end of each line, after a forward slash. You can use any text you like, but the shorter the better.
For example, you could combine the second and third, and the fourth and fifth lines as follows:
C0/1 controls, basic Latin and Latin-1 (0x0000-0x00FF) /Latin 0-1 Latin extended A/B (0x0100-0x024F) /Latin-A/B
The template uses the font Everson Mono (http://www.evertype.com/emono/), a very large Unicode font (25 Euro shareware). Alternatives are Code2000 (http://www.code2000.net/code2000_page.htm, US$ 5 shareware) and Arial Unicode. The latter is a free download from Microsoft, but it's ageing: it supports Unicode only up to version 2.1. Everson Mono and Code2000 are much more up to date.
To use another font in the template, edit the paragraph style [Basic Paragraph]. This change is propagated to all other paragraph styles.
All required files are put together in this zip file (CS3 and CS4): grep_mapper.zip
Installing and running scripts
Questions, comments? Get in touch