Language-aware paragraph sorter

Languages differ in the way that they treat accented letters when sorting lists. The script takes these differences into account by using the sort orders defined in a separate, editable, file (see below for details). The script sorts paragraphs (not tables; for tables, see here). The script can also create retrograde lists (words are sorted from the end of the word rather than from the beginning, sorting together rhyming words, so to speak), sort numerically, and sort by character style.

The script can be configured to sort paragraphs following the sort-order rules of any language. In addition, it can be set to sort paragraphs entirely accent-neutrally, which is needed, for instance, in English-language texts that can contain many accented characters, such as bibliographies and indexes.

Use

  1. The download is a ZIP file containing the script (sort.jsxbin) and the sort orders (sortorders.txt). Place both in your script directory.
  2. Select the text to be sorted. To sort a selection of paragraphs, select those paragraphs. To sort a whole story, select the text frame containing all or part of the story or an insertion point.
  3. Run the script. It shows this dialog: Sort dialog
    The script tries to determine the currently selected language. If it can, the language shows in the dialog; if it can't, it displays [No Language]. To select a different language, pick it from the dropdown. (For changing and adding sort orders, see Background, below.) If the picked language can be found in the file sortorders.txt and has a sort order, this is shown at Sort order. You can change this to suit your own tastes. Check Save (sort orders/words to ignore) on exit to save any changes you've made to a sort order or when you've added a sort order (see below for details).
  4. The sorter can do a Retrograde sort, so that paragraphs are sorted back to front, sorting together, as it were, rhyming words.
  5. To remove duplicate items after the list has been sorted, check Delete duplicates.
  6. Letter by letter ignores spaces and hyphens, Word by word sorts by the word. The table shows the difference between the two methods (the list is from R. M. Ritter, The Oxford Guide to Style, Oxford University Press, 2002, pp. 581-2).
    Word by word Letter by letter
    High, J.
    high (light-hearted)
    high chair
    high-fliers
    high heels
    High-Smith, P.
    high water
    High Water (play)
    highball
    highbrow
    Highclere Castle
    highlights
    Highsmith, A.
    highways
    High, J.
    high (light-hearted)
    highball
    highbrow
    high chair
    Highclere Castle
    high-fliers
    high heels
    highlights
    Highsmith, A.
    High-Smith, P.
    high water
    High Water (play)
    highways
    The way these two types of sort perform is determined to a large extent by the order in which the space, hyphen, and parenthesis is given in the sort string. Note: In both columns the first two lines look out of place but that's how they are listed.
  7. At Ignore words you can indicate words at the beginning of paragraphs that the sorter should ignore. This is useful for sorting titles of books, films, etc., in which articles such as the, a, and an are ignored. To activate this feature, check Ignore words. These words are saved in the file sortorders.txt (for details see below).
  8. The script can sort using character styles in a document. If the document contains any character styles, these are displayed at Character styles. To sort using a style, pick it in the dropdown and check Sort on selected character style. This is useful when you need to maintain a list whose sort order can't be determined automatically. For example, to sort a list of names of the form first name, surname by surname, apply a character style to the surname and sort by that style.
  9. Press OK to do the sort.

Tip: to sort a text completely accent-neutrally, select [No Language]:

sort accent neutral


Download script (a ZIP file) – View/download sortorders.txt (it's also in the ZIP file)


Background

Languages divide into different types according to how they treat accented letters and digraphs when sorting lists (if you see garbled characters in this text, enable Unicode/UTF-8 – probably in View > Character Encoding or something similar; also select a Unicode font such as Lucida Sans Unicode or Microsoft's Times or Arial in the options section of your browser):

  1. Accented characters are grouped at the end of the alphabet. They are in effect considered as separate letters. This is the case in the Scandinavian languages. In Danish the sort order is ABC . . . XYZÆØÅ.
  2. Accented letters follow the unaccented ones, and these letters, too, are considered separate letters. Polish uses the sort order AĄBCĆ . . . XYZŹŻ.
  3. Accents are ignored. In German and Italian, for example, words are sorted as if the accents weren't there.
  4. Some letter combinations are treated specially. Here, two subcases must be distinguished: (a) in Czech, the digraph ch treated as a separate letter and is sorted after h; (b) in Spanish, ll is treated as a single l but is considered an alternative of l.
  5. Some languages mix two or more of these types; in Czech, some accented letters follow the unaccented ones, some accents are ignored, and, as mentioned in the previous point, ch is sorted after h. In Icelandic, Þ (thorn) is sorted at the end of the alphabet; Đ (D-bar) follows D; other accents are ignored.
  6. A special, tricky, case is French, where the position of the second accented letter in a word determines how a word is sorted. For example, the words cote, coté, côte, and côté are sorted by the script as shown in the first column, but should be ordered as in the second column:
    Result	Should be
    -----------------
    cote	cote
    coté	côte
    côte	coté
    côté	côté
    (Source: SortingAndCollating.pdf.) The script doesn't handle these cases so French lists may need some manual post-ordering. I've no idea about the frequency of such cases – it might not be a real problem. Follow the link for detailed documentation.
  7. Finally, a completely neutral (or "diacritic-insensitive") sort order can be used to ignore each and every accent. This is useful, for instance, for sorting a name list (an address list, an index of authors) for an English-language publication in which several different types of accented letter are used. In such cases, all accents in Czech, Polish, Danish, etc. names are ignored.

The sorter handles all these possibilities (except, as mentioned, some French cases). The script looks for a text file "sortorders.txt" which should be located in the script folder. An attempt is made to determine the currently selected language and to show its sort order (if the file can't be found the script defaults to [No Language] and diacritic-insensitive sort order):

Sort paragraphs InDesign

The different types of sort order are accounted for by using a different format for each type of letter. All types are displayed in the screen shot, which shows the sort order for Czech (a different language can be picked from the dropdown). The formats are as follows:

  1. To have an accented letter sort after the unaccented one, it is positioned after it: specifies that R and Ř are different letters and that R immediately precedes Ř. Naturally, this also covers the Scandinavian languages: the accented letters are simply placed after all the other letters.
  2. To ignore an accent, the accented letter is placed in square brackets immediately following the unaccented letter. A[Á] indicates that Á should be treated as A. Any number of similar letters can be added to the list: E[ÉĚ] stipulates that É and Ě should be treated as E.
  3. To treat a letter combination as a single letter, we need to distinguish again the two subcases mentioned above.
    • To have the combination sort after a letter, the combination is placed in curly brackets after that letter. For example, {CH} indicates that ch should be treated as a single letter. In Czech, this combination is sorted after H. In the sort string, {CH} should therefore be placed after the H, as in the screenshot.
    • To indicate that a combination should be treated as an alternative of another letter, combine the [] and {} notations: in Spanish, LL is treated as a variant of L for sorting purposes, and this is indicated in the sort string as L[{LL}].
  4. To neutralise all accents, pick [No Language] from the dropdown.
  5. Enter just capitals: the script handles lower-case letters automatically.

Changing and adding sort orders

It is easy to add a new sort order or to change an existing one. To change a sort order, pick the language and make any changes in the displayed string. Make sure that the Save sort order box is checked. Press OK to save the changes.

To add a new language, pick it from the dropdown and enter the sort order at Sort order:. You can copy and paste a string here. Make sure that the Save sort order box is checked, then press OK to store the new data.

When changing or adding sort orders, bear two things in mind:

  1. add letters as capitals only. The script will take care of all corresponding lower-case letters;
  2. if you omit a character, don't worry: it will just be sorted incorrectly, it will not disappear from your documents.

Ignore words

Words that are to be ignored at the beginning of paragraphs are listed together with the sort order. They can be entered using two formats. The simplest is just to list the words:

the a an 

Write each word separated by a space. To enter an apostrophe, just type the straight apostrophe (or 'straight single quote') on the keyboard. The scripts changes it into a smart curly quote at runtime.

Another way to list the words is to write a regular expression. Some examples:

^(the|an?d?)\s
^(de[rsmn]|die|das)\s

The first expression matches the, a, an, and and; the second one, der, des, dem, den, die, and das.

The script recognises an item as a regular expression by the circumflex ^. Make sure you enter the expression correctly – the script doesn't check the expression's syntax at all.

sortorders.txt

The sort orders associated with certain languages are stored in a file called sortorders.txt, which lives in the same folder as the script. You can edit it in a text editor (note that the file is in UTF-8 format and must stay in that format), e.g. to remove languages or to edit an existing order. You could do that in the script's interface, but editing the file in an editor is a bit easier.

The example here lists some lines from the file, showing how sort orders are encoded (the first entry, [No Language], has been truncated here).

<This file uses UTF-8 encoding>
[No Language]	0123456789 A[ÁÀÂÄÅĀĄĂÆ]BC[ÇĆČĊ]D . . . Z[ŹŻŽ]
Polish	0123456789 AĄBCĆDEĘFGHIJKLŁMNŃOÓPQRSŚTUÚVWXYZŹŻ
Czech	0123456789 A[Á]BCČD[Ď]E[ÉĚ]FGH{CH}I[Í]JKLMN[Ň]O[Ó]PQRŘSŠT[Ť]U[ÚŮ]VWXY[Ý]ZŽ
German: Reformed	0123456789 A[Ä]BCDEFGHIJKLMNO[Ö]PQRS{SS}TU[Ü]VWXYZ
de_DE_2006	0123456789 A[Ä]BCDEFGHIJKLMNO[Ö]PQRS{SS}TU[Ü]VWXYZ
German: Traditional 0123456789 A[Ä]BCDEFGHIJKLMNO[Ö]PQRS{SS}TU[Ü]VWXYZ Icelandic 0123456789 A[Á]BCDÐE[É]FGHI[Í]JKLMNO[Ó]PQRSTU[Ú]VWXY[Ý]ZÞÆÖ

Each line consists of two parts: the name of the language in InDesign's internal format, followed by a tab, followed by the sort order. Note that even in English versions of InDesign, its internal format is not always the same as the way names are represented in the interface. For instance, German: Reformed corresponds with German: 1996 Reform and de_DE_2006 with German: 2006 Reform.

For that reason it is best to add new languages in the script's interface; you can then later edit the file in an editor if necessary.

Further information

There's a lot of information on sorting. SortingAndCollating.pdf is a good general overview. For some interesting technicalities, see this NISO report. See also Marc Autret's post on sorting in JavaSript, here, and a discussion in Adobe's scripting forum, here.

Acknowledgement

With thanks to Igor Freiberger and Jaroslav Průka for comments on the (Brazilian) Portuguese and Czech sort orders.


Useful script? Saved you lots of time?

Consider making a donation. To make a donation, please press the button below. This is Paypal's payment system; you don't need a Paypal account to use it: you can use several types/brands of credit and debit card.

Peter Kahrel's paypal account

Download script

Back to script index

Installing and running scripts

Questions, comments? Get in touch


Version history

8 December 2013: fixed a problem with sorting letter combinations after another letter. CH in Czech is sorted as a separate letter after H, while other languages may have letter combinations that are variants of a letter, as in Spanish, which treats LL as L. This required an addition to the sort-string syntax; see the text for details.

18 October 2012: fixed problem that could occur if the sort-order text file could not be found.

4 October 2012: index markers and XML tags interfered with the sort order; fixed.

3 October 2012: (1) deletion of duplicate paragraphs now works (optionally) when sorting whole stories and selections; (2) there was a problem if a selection of paragraphs included the story's last paragraph – fixed.

3 June 2012:

(1) The script's interface is now (finally) language independent. This means that the sortorder.txt file has changed, so if you start using this latest version of the script, you must use the new version of the sort-order file. Note that the changes are in the language names (which now use InDesign's internal localisation-independent names), so if you made many changes you can transfer those changes.

(2) The script now handles numbers correctly: lines that start with numbers, but also lists such as Figure 1.1, Figure 1.2, Figure 2.1, etc.

(2) The option to sort numerically is no longer necessary, and since descending sorts weren't going to make it anyway the interface was rearranged. Minor changes to the behaviour of the interface.

(4) Added some usage notes to the description of the sortorder.txt file.

31 May 2012: uploaded the wrong file; corrected.

30 May 2012: fixed a problem with the sorting of some accented characters.

11 Febrary 2012: the rewrite caused some problems with some details of letter-by-letter and word-by-word sorting. Fixed.

16 July 2011: a rewrite from scratch, apart from the interface. The script is now much faster, so much so that I removed the Formatted text option. The script now sorts everything as if it was formatted. The option to sort English and Irish patronyms (Mac, Mc, O') is no longer there: the script now does that by default.

1 January 2011: fixed problem with sorting upper-case letters with diacritics.

Updated January 2010.