Language-aware paragraph sorter
Languages differ in the way that they treat accented letters when sorting lists. The script takes these differences into account by using the sort orders defined in a separate, editable, file (see below for details). The script sorts paragraphs (not tables). You can choose to sort formatted paragraphs (all formatting is respected) or unformatted -- the latter is much quicker but should be used only on texts without any formatting. The script can also also create retrograde lists (words are sorted from the end of the word rather than from the beginning), sort numerically, and sort by character style.
Use
- The download is a ZIP file containing the script (sort.jsxbin) and the sort orders (sortorders.txt). Place both in your script directory.
- Select the text to be sorted. To sort a selection of paragraphs, select those paragraphs. To sort a whole story, select the text frame containing all or part of the story or an insertion point.
- Run the script. It shows this dialog:

The script tries to determine the currently selected language. If it can, the language shows in the dialog; if it can't, it displays [No Language]. To select a different language, pick it from the dropdown. (For changing and adding sort orders, see "Background", below.) If the picked language can be found in the file sortorders.txt and has a sort order, this is shown at Sort order. You can change this to suit your own tastes. Check Save (sort orders/words to ignore) on exit to save any changes you've made to a sort order or when you've added a sort order (see below for details).
- The sorter can do a Retrograde sort, so that paragraphs are sorted back to front, sorting together, as it were, rhyming words.
- At Numeric sort you indicate that the sorter should sort a list numerically. In each paragraph the sorter picks the first number, which doesn't necessarily have to be at the beginning of the paragraph. Wrapping parentheses and brackets are ignored.
- To remove duplicate items after the list has been sorted, check Delete duplicates.
- To maintain any formatting in the text to be sorted, check Formatted text. If your text does not contain any formatting, uncheck this option. Texts sort much quicker when you disable Formatted text. (When you select some paragraphs, this option is not available for technical reasons. But this is not an issue because if you can select paragraphs to be sorted, the list is short enough for the formatted sort to be quick enough.)
- In English lists, names with patronyms (MacDonald, McCormick, O'Neil) are usually sorted at the beginning of the M and the O, respectively. The difference between Mc and Mac is neutralised. This option is enabled by checking Mac, Mc, O' first.
- Letter by letter ignores spaces and hyphens, Word by word sorts by the word. The table shows the difference between the two methods (the list is from R. M. Ritter, The Oxford Guide to Style, Oxford University Press, 2002, pp. 581-2).
| Word by word |
Letter by letter |
High, J.
high (light-hearted)
high chair
high-fliers
high heels
High-Smith, P.
high water
High Water (play)
highball
highbrow
Highclere Castle
highlights
Highsmith, A.
highways |
High, J.
high (light-hearted)
highball
highbrow
high chair
Highclere Castle
high-fliers
high heels
highlights
Highsmith, A.
High-Smith, P.
high water
High Water (play)
highways |
The way these two types of sort perform is determined to a large extent by the order in which the space, hyphen, and parenthesis is given in the sort string.
- At Ignore words you can indicate words at the beginning of paragraphs that the sorter should should ignore. This is useful for sorting titles of books, films, etc., in which articles such as the, a, and an are ignored. To activate this feature, check Ignore words. These words are saved in the file sortorders.txt (for details see below).
- The script can sort using character styles in a document. If the document contains any character styles, these are displayed at Character styles. To sort using a style, pick it in the dropdown and check Sort on selected character style. This is useful when you need to maintain a list whose sort order can't be determined automatically.
- At Ignore words you can indicate words at the beginning of paragraphs that the sorter should should ignore. This is useful for sorting titles of books, films, etc., in which articles such as the, a, and an are ignored. To activate this feature, check Ignore words. These words are saved in the file sortorders.txt (for details see below).
- The script can sort using character styles in a document. If the document contains any character styles, these are displayed at Character styles. To sort using a style, pick it in the dropdown and check Sort on selected character style. This is useful when you need to maintain a list whose sort order can't be determined automatically.
- Press OK to do the sort.
Download script (a ZIP file) --- View/download sortorders.txt (it's also in the ZIP file)
Background
Languages divide into different types according to how they treat accented letters and digraphs when sorting lists (if you see garbled characters in this text, enable Unicode/UTF-8 -- probably in View > Character Encoding or something similar; also select a Unicode font such as Lucida Sans Unicode or Microsoft's Times or Arial in the options section of your browser):
- Accented characters are grouped at the end of the alphabet. They are in effect considered as separate letters. This is the case in the Scandinavian languages. In Danish the sort order is ABC . . . XYZÆØÅ.
- Accented letters follow the unaccented ones, and these letters, too, are considered separate letters. Polish uses the sort order AĄBCĆ . . . XYZŹŻ.
- Accents are ignored. In German and Italian, for example, words are sorted as if the accents weren't there.
- Some letter combinations are treated as one character. In Czech, because the digraph ch is pronounced as a voiceless h, it sorted after h; in Spanish, ll is treated as a single l and ch is treated as c.
- Some languages mix two or more of these types; in Czech, some accented letters follow the unaccented ones, some accents are ignored, and, as mentioned in the previous point, ch is treated as a type of h. In Icelandic, Þ (thorn) is sorted at the end of the alphabet; Đ (D-bar) follows D; other accents are ignored.
- A special, tricky, case is French, where the position of the second accented letter in a word determines how a word is sorted. For example, the words cote, coté, côte, and côté are sorted by the script as shown in the first column, but should be ordered as in the second column:
Result Should be
-----------------
cote cote
coté côte
côte coté
côté côté
(Source: SortingAndCollating.pdf). The script doesn't handle these cases so French lists may need some manual post-ordering. I've no idea about the frequency of such cases -- it might not be a real problem. Follow the link, below, for detailed documentation.
- Finally, a completely neutral (or "diacritic-insensitive") sort order can be used to ignore each and every accent. This is useful, for instance, for sorting a name list (an address list, an index of authors) for an English-language publication in which several different types of accented letter are used. In such cases, all accents in Czech, Polish, Danish, etc. names are ignored.
The sorter handles all these possibilities (except, as mentioned, some French cases). The script looks for a text file "sortorders.txt" which should be located in the script folder. An attempt is made to determine the currently selected language and to show its sort order (if the file can't be found the script defaults to [No Language] and diacritic-insensitive sort order):

The different types of sort order are accounted for by using a different format for each type of letter. All types are displayed in the screen shot, which shows the sort order for Czech (a different language can be picked from the dropdown). The formats are as follows:
- To have an accented letter sort after the unaccented one, it is positioned after it: RŘ specifies that R and Ř are different letters and that R immediately precedes Ř. Naturally, this also covers the Scandinavian languages: the accented letters are simply placed after all the other letters.
- To ignore an accent, the accented letter is placed in square brackets. A[Á] indicates that A and Á should be treated as the same letter. Any number of similar letters can be added to the list: E[ÉĚ] stipulates that E, É, and Ě should be treated as the same letter.
- To treat a letter combination as a single letter, the combination is placed in curly brackets and placed after the letter it should follow in alphabetisation. For example, {CH} indicates that ch should be treated as a single letter. In Czech, this combination is sorted after the h. In the sort string, {CH} should therefore be placed after the H.
- To neutralise all accents, pick [No Language] from the dropdown.
- Enter just capitals: the script handles lower-case letters automatically.
The example here shows some lines from the file "sortorders.txt", showing how sort orders are encoded.
<This file uses UTF-8 encoding>
Polish 0123456789 AĄBCĆDEĘFGHIJKLŁMNŃOÓPQRSŚTUÚVWXYZŹŻ
Czech 0123456789 A[Á]BCČD[Ď]E[ÉĚ]FGH{CH}I[Í]JKLMN[Ň]O[Ó]PQRŘSŠT[Ť]U[ÚŮ]VWXY[Ý]ZŽ
Icelandic 0123456789 A[Á]BCDÐE[É]FGHI[Í]JKLMNO[Ó]PQRSTU[Ú]VWXY[Ý]ZÞÆÖ
[No Language] 0123456789 A[ÁÀÂÄÅĀĄĂÆ]BC[ÇĆČĊ]D[ĎĐ]E[ÉÈÊËĘĒĔĖĚ]FG
[ĢĜĞĠ]H[ĤĦ]I[ÍÌÎÏĪĨĬĮİ]J[ĵ]K[ķ]L[ŁĹĻĽ]MN[ÑŃŇŅŊ]O[ÓÒÔÖŌŎŐØŒ]
PQR[ŔŘŖ]S[ŚŠŜŞȘß]T[ŢȚŤŦ]ÞU[ÚÙÛÜŮŪŲŨŬŰŲ]VW[Ŵ]XY[ŸÝŶ]Z[ŹŻŽ]
Each line consists of two parts: the name of the language in InDesign's format, followed by a tab, followed by the sort order. The file is in UTF-8 format and must stay in that format.
(Note: the characters at [No Language] are on three lines for display purposes here only: they should remain on one line in the file.)
Adding and changing sort orders
It is easy to add a new sort order or to change an existing one. To change a sort order, pick the language and make any changes in the displayed string. Make sure that the Save sort order box is checked to save the changes. To add a new language, pick it from the dropdown and enter the sort order after Sort order:. The new data are stored in the sort-order file. You can edit the file in a text editor, but remember to save it in UTF-8 format.
When changing or adding sort orders, bear two things in mind:
- Add letters as capitals only. The script will take care of all corresponding lower-case letters.
- If you omit a character, don't worry: it will just be sorted incorrectly, it will not disappear from your documents
Ignore words
Words that are to be ignored at the beginning of paragraphs are listed together with the sort order. They can be entered using two formats. The simplest is just to list the words:
the a an
Write each word separated by a space. To enter an apostrophe, just type the straight apostrophe (or 'straight single quote') on the keyboard. The scripts changes it into a smart curly quote at runtime.
Another way to list the words is to write a regular expression. Some examples:
^(the |an? )
^(de[rsmn] |die )
Note that in this case it is necessary to add a space after each item. The script recognises an item as a regular expression by ^(. Make sure you enter the expression correctly -- the script doesn't check the expression's syntax at all.
Further information
There's a lot of information on sorting. SortingAndCollating.pdf is a good general overview.
Acknowledgement
With thanks to Igor Freiberger and Jaroslav Průka for comments on the (Brazilian) Portuguese and Czech sort orders.
Download script
Back to script index
Installing and running scripts
Questions, comments? Get in touch
Updated May 2009.