coreutils: Special handling of file extensions

 
 30.3.3 Special handling of file extensions
 ------------------------------------------
 
 GNU Coreutils version sort implements specialized handling of strings
 that look like file names with extensions.  This enables slightly more
 natural ordering of file names.
 
    The following additional rules apply when comparing two strings where
 both begin with non-‘.’.  They also apply when comparing two strings
 where both begin with ‘.’ but neither is ‘.’ or ‘..’.
 
   1. A suffix (i.e., a file extension) is defined as: a dot, followed by
      an ASCII letter or tilde, followed by zero or more ASCII letters,
      digits, or tildes; all repeated zero or more times, and ending at
      string end.  This is equivalent to matching the extended regular
      expression ‘(\.[A-Za-z~][A-Za-z0-9~]*)*$’ in the C locale.  The
      longest such match is used, except that a suffix is not allowed to
      match an entire nonempty string.
 
   2. The suffixes are temporarily removed, and the strings are compared
      rules::) without special priority (see ⇒Special priority in
      GNU Coreutils version sort).
 
   3. If the suffix-less strings do not compare equal, this comparison
      result is used and the suffixes are effectively ignored.
 
   4. If the suffix-less strings compare equal, the suffixes are restored
      and the entire strings are compared using version sort.
 
    Examples for rule 1:
 
    • ‘hello-8.txt’: the suffix is ‘.txt’
 
    • ‘hello-8.2.txt’: the suffix is ‘.txt’ (‘.2’ is not included because
      the dot is not followed by a letter)
 
    • ‘hello-8.0.12.tar.gz’: the suffix is ‘.tar.gz’ (‘.0.12’ is not
      included)
 
    • ‘hello-8.2’: no suffix (suffix is an empty string)
 
    • ‘hello.foobar65’: the suffix is ‘.foobar65’
 
    • ‘gcc-c++-10.8.12-0.7rc2.fc9.tar.bz2’: the suffix is ‘.fc9.tar.bz2’
      (‘.7rc2’ is not included as it begins with a digit)
 
    • ‘.autom4te.cfg’: the suffix is the entire string.
 
    Examples for rule 2:
 
    • Comparing ‘hello-8.txt’ to ‘hello-8.2.12.txt’, the ‘.txt’ suffix is
      temporarily removed from both strings.
 
    • Comparing ‘foo-10.3.tar.gz’ to ‘foo-10.tar.xz’, the suffixes
      ‘.tar.gz’ and ‘.tar.xz’ are temporarily removed from the strings.
 
    Example for rule 3:
 
    • Comparing ‘hello.foobar65’ to ‘hello.foobar4’, the suffixes
      (‘.foobar65’ and ‘.foobar4’) are temporarily removed.  The
      remaining strings are identical (‘hello’).  The suffixes are then
      restored, and the entire strings are compared (‘hello.foobar4’
      comes first).
 
    Examples for rule 4:
 
    • When comparing the strings ‘hello-8.2.txt’ and ‘hello-8.10.txt’,
      the suffixes (‘.txt’) are temporarily removed.  The remaining
      strings (‘hello-8.2’ and ‘hello-8.10’) are compared as previously
      described (‘hello-8.2’ comes first).  (In this case the suffix
      removal algorithm does not have a noticeable effect on the
      resulting order.)
 
    How does the suffix-removal algorithm effect ordering results?
 
    Consider the comparison of hello-8.txt and hello-8.2.txt.
 
    Without the suffix-removal algorithm, the strings will be broken down
 to the following parts:
 
      hello-  vs  hello-  (rule 2, all non-digits)
      8       vs  8       (rule 3, all digits)
      .txt    vs  .       (rule 2)
      empty   vs  2
      empty   vs  .txt
 
    The comparison of the third parts (‘.’ vs ‘.txt’) will determine that
 the shorter string comes first – resulting in ‘hello-8.2.txt’ appearing
 first.
 
    Indeed this is the order in which Debian’s ‘dpkg’ compares the
 strings.
 
    A more natural result is that ‘hello-8.txt’ should come before
 ‘hello-8.2.txt’, and this is where the suffix-removal comes into play:
 
    The suffixes (‘.txt’) are removed, and the remaining strings are
 broken down into the following parts:
 
      hello-  vs  hello-  (rule 2, all non-digits)
      8       vs  8       (rule 3, all digits)
      empty   vs  .       (rule 2)
      empty   vs  2
 
    As empty strings sort before non-empty strings, the result is
 ‘hello-8’ being first.
 
    A real-world example would be listing files such as:
 ‘gcc_10.fc9.tar.gz’ and ‘gcc_10.8.12.7rc2.fc9.tar.bz2’: Debian’s
 algorithm would list ‘gcc_10.8.12.7rc2.fc9.tar.bz2’ first, while ‘ls -v’
 will list ‘gcc_10.fc9.tar.gz’ first.
 
    These priorities make sense for ‘ls -v’: Versioned files will be
 listed in a more natural order.
 
    For ‘sort -V’ these priorities might seem arbitrary.  However,
 because the sorting code is shared between the ‘ls’ and ‘sort’ program,
 the ordering rules are the same.