Coding: Fleeting Thoughts

A place to discuss the implementation and style of computer programs.

Moderators: phlip, Moderators General, Prelates

User avatar
chridd
Has a vermicelli title
Posts: 846
Joined: Tue Aug 19, 2008 10:07 am UTC
Location: ...Earth, I guess?
Contact:

Re: Coding: Fleeting Thoughts

Postby chridd » Mon Aug 12, 2019 4:41 pm UTC

Hmb<NUL>& wouldnb<NUL><CAN>t that just mess thingsb<NUL><DC4>like, everythingb<NUL><DC4>up further?
~ chri d. d. /tʃɹɪ.di.di/ (Phonotactics, schmphonotactics) · she · Forum game scores
mittfh wrote:I wish this post was very quotable...

commodorejohn
Posts: 1197
Joined: Thu Dec 10, 2009 6:21 pm UTC
Location: Placerville, CA
Contact:

Re: Coding: Fleeting Thoughts

Postby commodorejohn » Mon Aug 12, 2019 5:11 pm UTC

Not seeing a downside there.
"'Legacy code' often differs from its suggested alternative by actually working and scaling."
- Bjarne Stroustrup
www.commodorejohn.com - in case you were wondering, which you probably weren't.

User avatar
Flumble
Yes Man
Posts: 2263
Joined: Sun Aug 05, 2012 9:35 pm UTC

Re: Coding: Fleeting Thoughts

Postby Flumble » Mon Aug 12, 2019 5:18 pm UTC

chridd wrote:Hmb<NUL>& wouldnb<NUL><CAN>t that just mess thingsb<NUL><DC4>like, everythingb<NUL><DC4>up further?
What the hell kind of encoding are you using where codepoints below 128 (in particular , and ') require three array elements?

But nah, it wouldn't mess up commodorejohn's UTF-32 string. :mrgreen: ...except for exposing the user to their combining characters and zero-width joiners and mapping 137,994 valid codepoints to 128 may-be-valid codepoints.

User avatar
chridd
Has a vermicelli title
Posts: 846
Joined: Tue Aug 19, 2008 10:07 am UTC
Location: ...Earth, I guess?
Contact:

Re: Coding: Fleeting Thoughts

Postby chridd » Mon Aug 12, 2019 9:07 pm UTC

Those are …, ’, and —, and it's UTF-8
(…although I think I messed up and used ‘ instead of ’)
~ chri d. d. /tʃɹɪ.di.di/ (Phonotactics, schmphonotactics) · she · Forum game scores
mittfh wrote:I wish this post was very quotable...

User avatar
Xanthir
My HERO!!!
Posts: 5426
Joined: Tue Feb 20, 2007 12:49 am UTC
Location: The Googleplex
Contact:

Re: Coding: Fleeting Thoughts

Postby Xanthir » Wed Aug 14, 2019 5:07 pm UTC

The problem is never with Unicode. Unicode, and its canonical-in-practice encoding UTF-8, is easy and simple to use, and solved all of our problems.

The problem is always, *always*, with legacy bullshit that mangles encodings and tries to treat things as ASCII or Latin-1 (or, occasionally, weirder things). Then you get garbled bullshit. If everyone just uses a proper string type that understands codepoints, and uses UTF-8 to encode their strings to bytes, everything works perfectly.
(defun fibs (n &optional (a 1) (b 1)) (take n (unfold '+ a b)))

User avatar
Sizik
Posts: 1260
Joined: Wed Aug 27, 2008 3:48 am UTC

Re: Coding: Fleeting Thoughts

Postby Sizik » Wed Aug 14, 2019 6:22 pm UTC

Xanthir wrote:The problem is never with Unicode. Unicode, and its canonical-in-practice encoding UTF-8, is easy and simple to use, and solved all of our problems.

The problem is always, *always*, with legacy bullshit that mangles encodings and tries to treat things as ASCII or Latin-1 (or, occasionally, weirder things). Then you get garbled bullshit. If everyone just uses a proper string type that understands codepoints, and uses UTF-8 to encode their strings to bytes, everything works perfectly.

From an input/data processing standpoint, yes. From a typesetting/font rendering perspective, it's more the fact that human language is complicated, and you have to deal with a lot of rules about how we've chosen to encode it in its various forms.
she/they
gmalivuk wrote:
King Author wrote:If space (rather, distance) is an illusion, it'd be possible for one meta-me to experience both body's sensory inputs.
Yes. And if wishes were horses, wishing wells would fill up very quickly with drowned horses.

User avatar
ucim
Posts: 6888
Joined: Fri Sep 28, 2012 3:23 pm UTC
Location: The One True Thread

Re: Coding: Fleeting Thoughts

Postby ucim » Wed Aug 14, 2019 6:40 pm UTC

Xanthir wrote:The problem is never with Unicode.
Uh... that's a rather limited view. Perhaps you can enlighten me.

One example, "e with an acute accent" or é. It's treated as a different character from e without an accent, or e. But it's not. It's the character 'e', modified by putting an accent on it. Some languages use diacritical marks like this to indicate pronunciation; they don't change the spelling, they modify the appearance of the word (at the letter level). And (at least in Spanish) this is generally only done with lowercase glyphs. Capital letters don't often use the accent at all.

OTOH, you have "n with a tilde", or ñ. This (in Spanish at least) is a different letter. It is not a modified 'n' (although its history indicates that the tilde itself once represented "the rest of a(n understood) word" and was used with many different letters; now only the ñ remains.)

José is spelled with the letter e (and often misspelled, even by me, as Jose)

año is spelled with the letter ñ. Spell it with an n and you get a completely different word.

Granted, papa and papá are also different words (Pope or potato, and father) but they are spelled the same. The final letter in both cases is an a. And not to put too fine a point on it, el papa is the pope, and la papa is the potato. Nonetheless, the a without the accent is the same letter as the a with an accent. The accent is a modifier, which is separate in concept from a letter.

Why (other than using up real estate) did Unicode choose to create different codepoints (which many fonts don't even bother to fill!) for letters with and without common diacritics, instead of creating a diacritic codepoint (for each diacritic) and a code for "put this on top of that"? It's not like they <koff> emoji <koff> couldn't do that.

I await enlightenment (but hopefully not like First Cleric!)

Jose
Order of the Sillies, Honoris Causam - bestowed by charlie_grumbles on NP 859 * OTTscar winner: Wordsmith - bestowed by yappobiscuts and the OTT on NP 1832 * Ecclesiastical Calendar of the Order of the Holy Contradiction * Heartfelt thanks from addams and from me - you really made a difference.

speising
Posts: 2364
Joined: Mon Sep 03, 2012 4:54 pm UTC
Location: wien

Re: Coding: Fleeting Thoughts

Postby speising » Wed Aug 14, 2019 7:01 pm UTC

ucim wrote:
Xanthir wrote:The problem is never with Unicode.
Uh... that's a rather limited view. Perhaps you can enlighten me.



Why (other than using up real estate) did Unicode choose to create different codepoints (which many fonts don't even bother to fill!) for letters with and without common diacritics, instead of creating a diacritic codepoint (for each diacritic) and a code for "put this on top of that"? It's not like they <koff> emoji <koff> couldn't do that.

I await enlightenment (but hopefully not like First Cleric!)

Jose

https://en.m.wikipedia.org/wiki/Combini ... ical_Marks ?

User avatar
ucim
Posts: 6888
Joined: Fri Sep 28, 2012 3:23 pm UTC
Location: The One True Thread

Re: Coding: Fleeting Thoughts

Postby ucim » Thu Aug 15, 2019 4:11 am UTC

Thanks. So if I understand, Unicode did it right, and then it did it wrong? Is this a case of too many standards? (Answer in the comic itself!) And how well supported is this? (Maybe the Windows history of having its own extended ASCII code for e-with-an-accent contributed to this?)

Jose
Order of the Sillies, Honoris Causam - bestowed by charlie_grumbles on NP 859 * OTTscar winner: Wordsmith - bestowed by yappobiscuts and the OTT on NP 1832 * Ecclesiastical Calendar of the Order of the Holy Contradiction * Heartfelt thanks from addams and from me - you really made a difference.

User avatar
phlip
Restorer of Worlds
Posts: 7573
Joined: Sat Sep 23, 2006 3:56 am UTC
Location: Australia
Contact:

Re: Coding: Fleeting Thoughts

Postby phlip » Thu Aug 15, 2019 7:14 am UTC

Unicode has many masters. Both in the sense that there are a number of different goals it's trying to achieve, and in the sense that it's old and the people who run it have changed, and the attitudes of those people have changed over the years.

One of those goals is that Unicode should be round-trip-invariant to every other charset... you should be able to take a string in any charset, convert it to Unicode, then convert it back to that charset, and have the resulting string be unchanged. Which means that Unicode needs to include every character that's been included in any previous charset, and they need to be distinguishable. If some charset has a character for "é" then Unicode needs to have a character for "é". If old DOS charsets have box-drawing characters, Unicode needs to have box-drawing characters. If some Japanese phone text-message charset has emoji, then Unicode needs to have emoji.

Now, sure, maybe they could have written the decoding tables to say that an "é" in some random charset should encode to "e plus a combining diacritic" in Unicode, and then that combination should encode back to "é" in the random charset. But there are definitely tradeoffs in adding that complexity, especially back in the day when Unicode was starting.

And even then, there are existing character sets that include both pre-composed and combining diacritics, that include both "é" as a single codepoint, and "e plus a combining diacritic" in the source charset. So these have to decode into different Unicode strings, otherwise you wouldn't be able to round-trip encode them back to different strings.

Unicode has a whole system for monitoring and manipulating these things, and ways for programs to consider "é" and "e plus a combining diacritic" as the same character.

Code: Select all

enum ಠ_ಠ {°□°╰=1, °Д°╰, ಠ益ಠ╰};
void ┻━┻︵​╰(ಠ_ಠ ⚠) {exit((int)⚠);}
[he/him/his]

speising
Posts: 2364
Joined: Mon Sep 03, 2012 4:54 pm UTC
Location: wien

Re: Coding: Fleeting Thoughts

Postby speising » Thu Aug 15, 2019 12:55 pm UTC

Also, there are languages where é really *is* a separate letter. the real fault lies with those who use this character when writing spanish, instead of the combining variant. (well, usually you don't really have a choice on a normal keyboard)

User avatar
Xenomortis
Not actually a special flower.
Posts: 1455
Joined: Thu Oct 11, 2012 8:47 am UTC

Re: Coding: Fleeting Thoughts

Postby Xenomortis » Thu Aug 15, 2019 2:04 pm UTC

Sounds like all those other "not-English" languages are a mistake. :D
Image

User avatar
Flumble
Yes Man
Posts: 2263
Joined: Sun Aug 05, 2012 9:35 pm UTC

Re: Coding: Fleeting Thoughts

Postby Flumble » Thu Aug 15, 2019 2:35 pm UTC

That's very naïve, señor.

Xanthir wrote:The problem is never with Unicode. Unicode, and its canonical-in-practice encoding UTF-8, is easy and simple to use, and solved all of our problems.

Unicode has a clear goal and all its codepoints are catalogued well, but it is the source of (or at least is a good target to get the blame for) a lot of problems challenges. Database and search engines need to work with both combined characters and characters with combining diacritics and fuzzy equalities, while text drawing engines needs to handle not only hundreds of font rules, but also unicode rules for directionality and combination.
While most of the time you can rely on an existing engine (with millions of man-hours) to do the work for you, sometimes you want/have to do it yourself and you fall headfirst into the Abyss of Text.

User avatar
Xanthir
My HERO!!!
Posts: 5426
Joined: Tue Feb 20, 2007 12:49 am UTC
Location: The Googleplex
Contact:

Re: Coding: Fleeting Thoughts

Postby Xanthir » Thu Aug 15, 2019 4:06 pm UTC

In context, I was operating under the assumption that virtually anyone's complaints about Unicode are actually about encoding-confusion producing mojibake. People blame Unicode whenever a legacy program decides that only ASCII exists, or that it should output in Windows-1252 encoding, or something like that.

That all said, yeah, turns out sorting strings is hard. Blame human language, not Unicode. ^_^
(defun fibs (n &optional (a 1) (b 1)) (take n (unfold '+ a b)))

User avatar
Link
Posts: 1419
Joined: Sat Mar 07, 2009 11:33 am UTC
Location: ᘝᓄᘈᖉᐣ
Contact:

Re: Coding: Fleeting Thoughts

Postby Link » Fri Aug 16, 2019 12:02 am UTC

Combining characters (diacritics, emoji modifiers, etc.) are indeed one of the major pitfalls I was running into. Such things tend to break not-wholly-unreasonable assumptions such as "reversing the order of code points reverses the string", which is annoyingly nontrivial to correct. Then there's also the issue of there being multiple encodings, though frankly I have few qualms about telling non-UTF-8 users to get fucked.

Of course, Unicode itself is a Very Good Thing™, which is why I want to at least semi-properly support it. (Except I'm trying very hard to ignore that right-to-left languages are a thing for the time being, because ugh.)

Very tangentially related FT: why the fuck do C++ strings not have a constructor of signature std::string(char), while they do have an assignment operator std::string &operator=(char)?

Tub
Posts: 475
Joined: Wed Jul 27, 2011 3:13 pm UTC

Re: Coding: Fleeting Thoughts

Postby Tub » Fri Aug 16, 2019 8:30 am UTC

For some reason, nobody considers encoding A as <uppercase> + <a>, even though they are considered "the same character" in any language I know. But with accents, that's different.

Sorting and case-insensitive comparisons are solved problems - if you're willing to use existing libraries. Of course, nothing will save you from using those solutions in a stupid way, like php did - variable names are case insensitive, but comparisons are made based on your current locale. Thus, php scripts can have entirely different semantics based on your locale. :roll:

Reversing a string is a problem I haven't seen outside of programming interviews. Correctly handling a RTL marker on the other hand...

User avatar
ucim
Posts: 6888
Joined: Fri Sep 28, 2012 3:23 pm UTC
Location: The One True Thread

Re: Coding: Fleeting Thoughts

Postby ucim » Fri Aug 16, 2019 2:39 pm UTC

Tub wrote:For some reason, nobody considers encoding A as <uppercase> + <a>, even though they are considered "the same character" in any language I know.
Yes, you are right. That is an inconsistency. However, one exception is far easier to handle in code than the cascade of exceptions and exceptions to the exceptions we have now.

But in any case, thanks (to all!) for the insight and history lesson. I see why it "couldn't have been any other way".

How are roman numerals handled in arithmetic? If I type 6+3 I get 9. If I type (the roman numeral codepoints) VI + III do I get IX?

(And we won't even go into the fact that font designers can make a three that looks like a six. Is that a bug or a feature?)

Jose
Order of the Sillies, Honoris Causam - bestowed by charlie_grumbles on NP 859 * OTTscar winner: Wordsmith - bestowed by yappobiscuts and the OTT on NP 1832 * Ecclesiastical Calendar of the Order of the Holy Contradiction * Heartfelt thanks from addams and from me - you really made a difference.


Return to “Coding”

Who is online

Users browsing this forum: No registered users and 5 guests