Working Review of "Practical ML Programming with SML#" (Ohori, Ueno), CHAPTER 7
Interoperability with the C language
2023-10-22
I was quite ambivalent going into the chapter on C interoperability. For one, I’ve never done any serious programming in C, and two, I know that C FFIs are notorious for being difficult to operate. Having done the exercises, I can say that SML# does live up to its claims of easy C interoperability, but there are some papercuts that the C-naïve programmer (such as yours truly) will have to sustain to make it all come together.
First, let’s quickly summarize the contents of this chapter. Then, below, I’ll list out my various gripes and difficulties, including my methods of overcoming them.
Chapter summary
7.1 ML and the role of C interoperability
A quick motivating summary of why C interop is important. The authors make some very good points about hardware-specific libraries becoming first available as C libraries. This intro serves to whet your appetite for the subsequent tech demos.
7.2 Datatypes suitable for direct pass-through
Starting out with the easy things first – in this case, a list of SML#
datatypes that can be directly passed to C functions as arguments and received
as return values. Unsurprisingly, we’re limited to char
s and the numeric
types, excluding IntInf
s (a.k.a BigInt
s).
7.3 C import expressions
A short and sweet section that has us define a little C library, and then
import it directly into the interactive smlsharp
interpreter, utilizing the
numeric types described in the previous section. Overall, the didactic value
would be excellent, if not for the fact that the example won’t work without
explicitly specifying LD_LIBRARY_PATH
on the command-line. This is a pattern
that will repeat itself, and on which I’ll have more to say, below.
7.4 Separate compilation and linking
Kicking it up a notch, we’re now going to develop an SML# module that encapsulates a C library. In this case, it’s the Mersenne Twister random number generator, another Japanese contribution to software engineering. Excellent.
This chapter feels like a “production-grade” restatement of 7.3. It exposes the
entire surface area of the Mersenne Twister C library, and demonstrates how to
modify the autogenerated Makefile to ensure that the .so
file is linked to the
output binary.
I did have several difficulties completing this one, ranging from the innocent
(the URL for the MT source files has changed at Hiroshima University), to the
annoying (had to modify the autogenerated Makefile by hand to include the .so
file), and the already known (need to set up LD_LIBRARY_PATH
in order to run the
resulting binary).
Overall, the difficulties are not insurmountable, and once the kinks are ironed out, it’s quite a fun experience to build a Mersenne Twister facade structure in SML#. This chapter does get a bit of treatment below.
7.5 Exporting data with a pointer-based runtime representation
The datatypes in question are: string
, τ ref
, τ array
, tuples, and
records. As long as the τ
is one of the types from section 7.2, we can freely
pass these in to C functions as-is. The caveat here is that strings are
interpreted as const char *
, as they are immutable in SML#.
The example used to illustrate this is modf
from the C standard library,
which overwrites its second argument (double *iptr
in C, real ref
in SML#).
While its easy to work with simple types, I think marshaling SML# records into
C-land, with their fields being order-dependent, would be a more involved
topic.
7.6 Importing data with a pointer-based runtime representation
While translating pointer-based SML# data into C is automatically done by the SML# runtime, we can’t create such data in C-land and then use it on the SML# side. This is because pointers created in C lack the necessary GC metadata for the SML# runtime to be able to handle their lifecycle correctly.
Hence, it’s necessary to export the data from C as pointers, and subsequently
read the pointed-to data from SML#, using the Pointer
structure.
Apart from listing the interface of structure Pointer
, this section shows us
how to read strings into SML# (via getenv
), and also how to use FILE
pointers to do byte-based disk I/O. Fun stuff, no snags.
7.7 An example of integrating a polymorphic C library
The famous quicksort is up next, but we’re not going to be rehashing the old “quicksort in three lines” trope of functional programming. We’re actually going to implement a polymorphic (and in-place mutating) quicksort in SML#, which will use, under the hood, the stdlib qsort:
void qsort(void *base, size_t nmemb, size_t size, int (*compar)(const void *, const void *));
This is a much more advanced example, which features a polymorphic subject pointer (base), the need to know the size of an array element in C (size_t size), and requires us to pass in a pointer to a comparison function.
This section is probably the most satisfying of all. An elegant implementation of a complex FFI interaction is much more impressive than the examples involving marshaling numbers across the C boundary.
In the end, we have a typesafe qsort
function which mutates arrays in-place,
an which we can use to sort the numbers conveniently generated with the
Mersenne Twister from the beginning of the chapter.
7.8 Exercises
After having done the exercises, we’ll have a typesafe quicksort which accepts the ‘standard’ SML comparison functions, as defined in the Basis library.
qsort inputArray Int64.compare
Now, the gripes
1. LD_LIBRARY_PATH
Maybe I’m missing something here, and it’s my lack of C experience speaking,
but the SML# tooling simply doesn’t work with directory-local .so
files.
Despite the examples all using -L. -lmy
to load libmy.so
, this simply
doesn’t work on my machine. It’s not a big deal for me and doesn’t detract from
the experience, buy I can’t help but feel that there’s something missing here:
either in my knowledge of C-based workflows, or in the difference between my
machine and the SML# creators’ machines.
My solution is as follows (using libsqr.so
as an example)
% cat libsqr.c
short sqrShort (short n) { return (n * n); }
% gcc -shared -fPIC libsqr.so libsqr.c
% smlsharp -L. -lsqr
SML# for x86_64-unknown-linux-gnu with LLVM 12.0.1
# val sqr = _import "sqrShort" : int16 -> int16;
dynamic link failed: libsqr.so: cannot open shared object file: No such file or directory
% LD_LIBRARY_PATH=$(pwd) smlsharp -L. -lsqr
SML# for x86_64-unknown-linux-gnu with LLVM 12.0.1
# val sqr = _import "sqrShort" : int16 -> int16;
val sqr = fn : int16 -> int16
# sqr 16;
val it = 256 : int16
2. Autogenerated Makefiles and C library compilation
I love the fact that SML# leans on make
to achieve its separate compilation
capabilities. But I’ve found I disagree with the designers on the role of the
autogenerated Makefiles in the general programming flow.
In my personal workflows (including this blog), I like to use Makefiles as a
top-level driver for development. So, for example, to build and publish the
Jekyll output, I’ll run make publish
, and that’s that.
When developing a multi-language project, it makes sense to tie together the
various build tools offered by the languages (npm, mix, cargo) at a high level,
so that running make build
builds all the subcomponents of a program, farming
out the language-specific details to, say mix compile
, etc.
So what happens when we’re developing a C-based library inside of our SML# project, and we want to use a top-level Makefile to run it all?
So far, I’ve used this pattern:
all0: Makefile.smlsharp all
Makefile.smlsharp: $(shell find . | grep '.smi$$')
smlsharp -MMm main.smi > $@
include Makefile.smlsharp
This lets me have one standard Makefile on the top-level of my project, where I
could define non-SML# targets and tasks, and also have smlsharp
regenerate
Makefile.smlsharp
whenever there are changes in the inter-module dependency
structures.
In short, Makefile.smlsharp
is fully defined by the existing .smi files in the
project, hence I think of it as non-essential, ephemeral data that can go into
.gitignore
. I don’t ever have to look at these files, I just rely on the fact
that they can build main
along with all its dependencies.
Now, if we add a C-based shared library into our development mix, we have something like the following:
all0: mt19937-64.so Makefile.smlsharp all
mt19937-64.so: mt19937-64.c
gcc -shared -fPIC $< -o $@
mt19937-64.c:
curl -s -O http://www.math.sci.hiroshima-u.ac.jp/m-mat/MT/VERSIONS/C-LANG/mt19937-64.c
Makefile.smlsharp: $(shell find . | grep '.smi$$')
smlsharp -MMm main.smi > $@
include Makefile.smlsharp
But now, our main
file will fail to link to mt19936-64.so
. Why? Because we
need to hand-edit the autogenerated Makefile and change the line
LIBS =
to
LIBS = mt19937-64.so
This requirement effectively changes the status of the autogenerated Makefile from ephemeral and easily-recreated to something non-ephemeral, requiring frequent manual modification.
Since I honestly believe we should be able to treat the autogenerated Makefiles
as throwaway artifacts, I’ve made a pull request to SML# that, if accepted,
will allow us to define LIBS
in the environment, and have the autogenerated
Makefile pick it up.
This means that we can put the line
LIBS=mt19937-64.so
in our top-level Makefile and never have to hand-edit the output of smlsharp -Mmm
.
3. The Pointer structure
All of the examples involving marshaling data into SML# involve the interactive interpreter, and not standalone compilation. So when the time comes to compile your binary with quicksort, you run into this error message:
(name evaluation "190") unbound variable: Pointer.load
Using the very same files in an interactive session works, but compilation fails. Poring over the example code in the chapter gives no clue.
It turns out that you need to _require "ffi.smi"
to get access to the pointer
structure. It helps to have the SML# source checked out locally, as the
example code is very helpful in figuring out small things like this.
I feel that the lack of _require "ffi.smi"
in the chapter text is an
omission.
4. The language of import declarations
This gripe is perhaps very minor, but it’s details like this that make some languages much harder to learn than others. In Standard ML—and by extension in SML#–there are two ‘languages’ to master. One is the type-level language, the other is the value-level language. They are similar but not the same.
For example, the type-level definition of a function that takes a 2-tuple of two types and returns nothing is:
(* Type language *)
val someFun : 'a * 'b ref -> unit
Now, the value-level implementation of this has a slightly different syntax:
(* Value language *)
fun someFun (a,b) = ()
Note the differences:
1) The type-level tuple constructor is _ * _
. The value-level tuple constructor is ( _ , _ )
.
2) The type-level syntax for a type variable is 'a
. The value-level variable is a
.
3) The name of the unit type is unit
. The value for the unit type is ()
.
On top of this confusion, SML# throws in another language: the language describing the types of imported FFI functions. For the function above, it would be something like:
val c_someFun = _import "someFun" : ('a, 'b ptr) -> ()
This language seems to combine both the value-level and type-level languages. I know there must have been a good reason for introducing it, but I feel having fresh learners learn yet another subtle “DSL” is an educational barrier.
For what it’s worth, Haskell does a better job of having the type-level and value-level languages resemble each other to a greater degree:
-- type language
someFun :: (a, b) -> ()
-- value language
someFun (a,b) = ()
Summing up
This chapter was very interesting to read and implement. For someone who is not used to working with C, there are some gotchas that aren’t clearly marked in the book, and I had to figure things out myself.
As with all things related to binary interoperability, it feels like a lot of the designs were informed by real-world constraints and tastes of the authors.
All-in-all, it feels like developing an SML# wrapper around C libraries should be quite straightforward. This bodes well for future chapters, which involve graphics programming with Cairo and the like. I’m looking forward to what comes next.