Unicode string literals in PolyML
A small hack with a large ergonomic payoff
2024-04-28
As I wrote previously, Iβve put away the SML# and got back to recreational programming in plain old Standard ML. I really love the spartan feeling of the language and the fact that it doesnβt allow for too much fanciness (module-level magic aside), while still enabling a nice, high-level functional-programming experience.
One βspartanβ aspect of Standard ML which does not constitute a nice experience is its insistence on all string literals containing only ASCII characters. This is limiting and frustrating, especially for those of us who use non-English alphabets daily. Now, thanks to the clever design of UTF-8, unicode text encoded as UTF-8 will get displayed nicely on your screen as a Standard ML string.
Here is a demo:
$ cat emoji.txt
πππ
$ poly
Poly/ML 5.7.1 Release
> val textFile = TextIO.openIn "emoji.txt";
val textFile = ?: TextIO.instream
> val (SOME s) = TextIO.inputLine textFile;
val s = "\240\159\153\136\240\159\153\137\240\159\153\138\n": string
> print s;
πππ
val it = (): unit
As you can see, if we can get the unicode text into a string
, we can display
it thanks to the pervasive unicode support in our OS. Thatβs good enough for me. However, itβs not easy to get that unicode text into a string, outside of
reading streams from a file.
val monkeys = "πππ";
poly: : error: unprintable character \240 found in string
poly: : error: unprintable character \159 found in string
poly: : error: unprintable character \153 found in string
poly: : error: unprintable character \136 found in string
poly: : error: unprintable character \240 found in string
poly: : error: unprintable character \159 found in string
poly: : error: unprintable character \153 found in string
poly: : error: unprintable character \137 found in string
poly: : error: unprintable character \240 found in string
poly: : error: unprintable character \159 found in string
poly: : error: unprintable character \153 found in string
poly: : error: unprintable character \138 found in string
Static Errors
Iβve spent some time digging into the PolyML code to make the above possible. With my patch for unicode string literals applied, the above interaction is legal syntactically!
$ poly
Poly/ML 5.9.1 Release (Git version v5.9.1-64-ga71e81c1)
> val monkeys = "πππ";
val monkeys = "πππ": string
> String.size monkeys;
val it = 12: int
As you can see from the result of the call to String.size
,
my patch does not magically make PolyML strings unicode-aware,
but the ergonomic improvement is fantastic nevertheless.
The .patch file linked above also contains tests, but if youβre only interested in the implementation code, here is the relevant diff:
diff --git a/basis/String.sml b/basis/String.sml
index a2b2a7ab..9bca902a 100644
--- a/basis/String.sml
+++ b/basis/String.sml
@@ -158,7 +158,7 @@ local
fun isHexDigit c =
isDigit c orelse (#"a" <= c andalso c <= #"f")
orelse (#"A" <= c andalso c <= #"F")
- fun isGraph c = #"!" <= c andalso c <= #"~"
+ fun isGraph c = #"!" <= c andalso c <= chr 255
fun isPrint c = isGraph c orelse c = #" "
fun isPunct c = isGraph c andalso not (isAlphaNum c)
(* NOTE: The web page includes 0 <= ord c but all chars satisfy that. *)
@@ -719,7 +719,7 @@ local
case getc str of (* Read the first character. *)
NONE => SOME("", str) (* Just end-of-stream. *)
| SOME(ch, str') =>
- if ch < chr 32 orelse chr 126 < ch
+ if ch < chr 32 orelse chr 255 < ch
then NONE (* Non-printable character. *)
else if ch = #"\\"
then (* escape *)