asveikau a day ago

Sorting is language specific even if you're restricted to languages using Latin characters. Eg. How do you sort N relative to Ñ? How do you treat the Turkish variations on the letter I?

Doing a dumb sort by character or byte values is obviously the wrong call for any diacritics, but the right call may also depend on the language.

  • dmurray a day ago

    And that's why there are a hundred different possible values for LC_COLLATE, and it's completely normal that two popular Unix distributions picked different default values for that setting...right?

    It would have been reasonable to conclude the article a third of the way through, and say "sorting is locale-dependent, if what you value is consistent behaviour between different OSs (instead of sorting based on the user's preferences) you need to implement the sorting yourself."

    • harrall a day ago

      LC_ALL=C which gives you consistent sorting behavior.

      The article does mention it but in passing.

  • tracker1 9 hours ago

    Beyond that, are what/why you are sorting... should File1.foo come before File005.foo or file020.foo? I've honestly thought about creating my own file manager just to case-insensitively sort files where sequences of numbers are padded to the same length, and only if there's an identical match is case-sensitivity put lower first, then upper on first original difference.

    My worry is that it would perform badly on really large directories... That said, for where it's a pain, it would be helpful to say the least.

  • encom a day ago

    Before the Danish language adopted the letter "å" (in 1948), the vowel was written as "aa". In the Danish alphabet, "å" is the last letter. Therefore a list of three Danish city names would be correctly sorted as:

      * Albertslund
      * Odense
      * Aarhus
    
    This feels like material for another Tom Scott video.
    • tpmoney a day ago

      Not Tom Scott, but Dylan Beattie has done a handful of interesting talks[1] effectively on "there's no such thing as plain text" which in part covers this sort of thing. In fact, I think your Danish cities list is actually one of his examples.

      [1]: https://www.youtube.com/watch?v=gd5uJ7Nlvvo

      • encom 15 hours ago

        Finally had time to watch it, that was excellent. Thanks for the link.

        Pike matchbox.

    • qw 20 hours ago

      And to make it more interesting, Sweden also has the letter "å", but it's in the 27th place in the alphabet (followed by "ä" and "ö"). In the Danish/Norwegian alphabet, the letter "å" is the last letter of the alphabet.

    • plufz a day ago

      Haha. Like it was enough with ” tooghalvfems”.

o11c a day ago

Minor note: on Debian (and possibly other distros), you don't have to use `locale-gen` to dynamically build things into `$complocaledir/locale-archive` (which, incidentally, can cause random breakage for programs that happen to start during system upgrades).

The `locales-all` package works more like macOS. It's only a ~10MB download but unpacks to take ~250MB of disk space (these numbers will vary based on your libc version and packaging format).

There are a lot of sparse arrays and UTF32 character data in compiled locales.

Incidentally, the command to dump a locale's data is:

  LC_ALL=whatever locale -ck `locale | sed 's/=.*//; /LANG\|LC_ALL/d'`
kbd a day ago

In my Zsh startup on Mac I had to worry about collation, as I expected ~ to sort last (I have a directory prefixed with ~ to load plugins that need to be loaded last). Idk why a locale of utf-8 has it sorting differently, but I needed LC_COLLATE=C to have it sort as expected:

    # source all shell config
    export LC_COLLATE=C # ensure consistent sort, ~ at end
    for file in ~/bin/shell/**/*.(z|)sh; do
      source "$file";
    done
kenada a day ago

When I updated the Darwin SDK and source releases in nixpkgs last year, I tried using the FreeBSD locale data. It worked in a technical sense, but it broke things that depended on the quirks in the Apple’s locale data. That statement about compatibility is unfortunately true.

1a527dd5 a day ago

Ask anyone who did a postgres upgrade. The words "collate" and "glibc" are enough to cause me to pause now. Learnt loads, never going to really use it again, but man do I understand the pain that causes now.

bluedino a day ago

Now I'm remembering all the fun we had a long time ago with php websites that used an AS/400 for a data source. They didn't sort the same, and the mom and pop web dev shop that was hired to create the web site didn't understand the issue and hacked around it and failed.

skopje a day ago

So the ISO way is the right way, right?

  • dataflow a day ago

    I wondered the same. What's the right ordering?

loeg 2 days ago

(2020)

greesil a day ago

It's not a stable sort?