Previous    |   Up   |  

Unicode string literals in PolyML

A small hack with a large ergonomic payoff

As I wrote previously, I’ve put away the SML# and got back to recreational programming in plain old Standard ML. I really love the spartan feeling of the language and the fact that it doesn’t allow for too much fanciness (module-level magic aside), while still enabling a nice, high-level functional-programming experience.

One β€œspartan” aspect of Standard ML which does not constitute a nice experience is its insistence on all string literals containing only ASCII characters. This is limiting and frustrating, especially for those of us who use non-English alphabets daily. Now, thanks to the clever design of UTF-8, unicode text encoded as UTF-8 will get displayed nicely on your screen as a Standard ML string.

Here is a demo:

$ cat emoji.txt
πŸ™ˆπŸ™‰πŸ™Š


$ poly
Poly/ML 5.7.1 Release

> val textFile = TextIO.openIn "emoji.txt";
val textFile = ?: TextIO.instream
> val (SOME s) = TextIO.inputLine textFile;
val s = "\240\159\153\136\240\159\153\137\240\159\153\138\n": string
> print s;
πŸ™ˆπŸ™‰πŸ™Š
val it = (): unit

As you can see, if we can get the unicode text into a string, we can display it thanks to the pervasive unicode support in our OS. That’s good enough for me. However, it’s not easy to get that unicode text into a string, outside of reading streams from a file.

val monkeys = "πŸ™ˆπŸ™‰πŸ™Š";
poly: : error: unprintable character \240 found in string
poly: : error: unprintable character \159 found in string
poly: : error: unprintable character \153 found in string
poly: : error: unprintable character \136 found in string
poly: : error: unprintable character \240 found in string
poly: : error: unprintable character \159 found in string
poly: : error: unprintable character \153 found in string
poly: : error: unprintable character \137 found in string
poly: : error: unprintable character \240 found in string
poly: : error: unprintable character \159 found in string
poly: : error: unprintable character \153 found in string
poly: : error: unprintable character \138 found in string
Static Errors

I’ve spent some time digging into the PolyML code to make the above possible. With my patch for unicode string literals applied, the above interaction is legal syntactically!

$ poly
Poly/ML 5.9.1 Release (Git version v5.9.1-64-ga71e81c1)
> val monkeys = "πŸ™ˆπŸ™‰πŸ™Š";
val monkeys = "πŸ™ˆπŸ™‰πŸ™Š": string
> String.size monkeys;
val it = 12: int

As you can see from the result of the call to String.size, my patch does not magically make PolyML strings unicode-aware, but the ergonomic improvement is fantastic nevertheless.

The .patch file linked above also contains tests, but if you’re only interested in the implementation code, here is the relevant diff:

diff --git a/basis/String.sml b/basis/String.sml
index a2b2a7ab..9bca902a 100644
--- a/basis/String.sml
+++ b/basis/String.sml
@@ -158,7 +158,7 @@ local
         fun isHexDigit c =
             isDigit c orelse (#"a" <= c andalso c <= #"f")
                  orelse (#"A" <= c andalso c <= #"F")
-        fun isGraph c = #"!" <= c andalso c <= #"~"
+        fun isGraph c = #"!" <= c andalso c <= chr 255
         fun isPrint c = isGraph c orelse c = #" "
         fun isPunct c = isGraph c andalso not (isAlphaNum c)
         (* NOTE: The web page includes 0 <= ord c but all chars satisfy that. *)
@@ -719,7 +719,7 @@ local
         case getc str of (* Read the first character. *)
             NONE => SOME("", str) (* Just end-of-stream. *)
           | SOME(ch, str') =>
-                if ch < chr 32 orelse chr 126 < ch
+                if ch < chr 32 orelse chr 255 < ch
                 then NONE (* Non-printable character. *)
                 else if ch = #"\\"
                 then (* escape *)
Previous    |   Up   |