Working Review of "Practical ML Programming with SML#" (Ohori, Ueno), CHAPTER 7

Interoperability with the C language
2023-10-22

I was quite ambivalent going into the chapter on C interoperability. For one, I’ve never done any serious programming in C, and two, I know that C FFIs are notorious for being difficult to operate. Having done the exercises, I can say that SML# does live up to its claims of easy C interoperability, but there are some papercuts that the C-naïve programmer (such as yours truly) will have to sustain to make it all come together.

First, let’s quickly summarize the contents of this chapter. Then, below, I’ll list out my various gripes and difficulties, including my methods of overcoming them.

Chapter summary

7.1 ML and the role of C interoperability

A quick motivating summary of why C interop is important. The authors make some very good points about hardware-specific libraries becoming first available as C libraries. This intro serves to whet your appetite for the subsequent tech demos.

7.2 Datatypes suitable for direct pass-through

Starting out with the easy things first – in this case, a list of SML# datatypes that can be directly passed to C functions as arguments and received as return values. Unsurprisingly, we’re limited to chars and the numeric types, excluding IntInfs (a.k.a BigInts).

7.3 C import expressions

A short and sweet section that has us define a little C library, and then import it directly into the interactive smlsharp interpreter, utilizing the numeric types described in the previous section. Overall, the didactic value would be excellent, if not for the fact that the example won’t work without explicitly specifying LD_LIBRARY_PATH on the command-line. This is a pattern that will repeat itself, and on which I’ll have more to say, below.

7.4 Separate compilation and linking

Kicking it up a notch, we’re now going to develop an SML# module that encapsulates a C library. In this case, it’s the Mersenne Twister random number generator, another Japanese contribution to software engineering. Excellent.

This chapter feels like a “production-grade” restatement of 7.3. It exposes the entire surface area of the Mersenne Twister C library, and demonstrates how to modify the autogenerated Makefile to ensure that the .so file is linked to the output binary.

I did have several difficulties completing this one, ranging from the innocent (the URL for the MT source files has changed at Hiroshima University), to the annoying (had to modify the autogenerated Makefile by hand to include the .so file), and the already known (need to set up LD_LIBRARY_PATH in order to run the resulting binary).

Overall, the difficulties are not insurmountable, and once the kinks are ironed out, it’s quite a fun experience to build a Mersenne Twister facade structure in SML#. This chapter does get a bit of treatment below.

7.5 Exporting data with a pointer-based runtime representation

The datatypes in question are: string, τ ref, τ array, tuples, and records. As long as the τ is one of the types from section 7.2, we can freely pass these in to C functions as-is. The caveat here is that strings are interpreted as const char *, as they are immutable in SML#.

The example used to illustrate this is modf from the C standard library, which overwrites its second argument (double *iptr in C, real ref in SML#). While its easy to work with simple types, I think marshaling SML# records into C-land, with their fields being order-dependent, would be a more involved topic.

7.6 Importing data with a pointer-based runtime representation

While translating pointer-based SML# data into C is automatically done by the SML# runtime, we can’t create such data in C-land and then use it on the SML# side. This is because pointers created in C lack the necessary GC metadata for the SML# runtime to be able to handle their lifecycle correctly.

Hence, it’s necessary to export the data from C as pointers, and subsequently read the pointed-to data from SML#, using the Pointer structure.

Apart from listing the interface of structure Pointer, this section shows us how to read strings into SML# (via getenv), and also how to use FILE pointers to do byte-based disk I/O. Fun stuff, no snags.

7.7 An example of integrating a polymorphic C library

The famous quicksort is up next, but we’re not going to be rehashing the old “quicksort in three lines” trope of functional programming. We’re actually going to implement a polymorphic (and in-place mutating) quicksort in SML#, which will use, under the hood, the stdlib qsort:

void qsort(void *base, size_t nmemb, size_t size, int (*compar)(const void *, const void *));

This is a much more advanced example, which features a polymorphic subject pointer (base), the need to know the size of an array element in C (size_t size), and requires us to pass in a pointer to a comparison function.

This section is probably the most satisfying of all. An elegant implementation of a complex FFI interaction is much more impressive than the examples involving marshaling numbers across the C boundary.

In the end, we have a typesafe qsort function which mutates arrays in-place, an which we can use to sort the numbers conveniently generated with the Mersenne Twister from the beginning of the chapter.

7.8 Exercises

After having done the exercises, we’ll have a typesafe quicksort which accepts the ‘standard’ SML comparison functions, as defined in the Basis library.

qsort inputArray Int64.compare

Now, the gripes

1. `LD_LIBRARY_PATH`

Maybe I’m missing something here, and it’s my lack of C experience speaking, but the SML# tooling simply doesn’t work with directory-local .so files. Despite the examples all using -L. -lmy to load libmy.so, this simply doesn’t work on my machine. It’s not a big deal for me and doesn’t detract from the experience, buy I can’t help but feel that there’s something missing here: either in my knowledge of C-based workflows, or in the difference between my machine and the SML# creators’ machines.

My solution is as follows (using libsqr.so as an example)

% cat libsqr.c
short sqrShort (short n) { return (n * n); }

% gcc -shared -fPIC libsqr.so libsqr.c

% smlsharp -L. -lsqr
SML#  for x86_64-unknown-linux-gnu with LLVM 12.0.1
# val sqr = _import "sqrShort" : int16 -> int16;
dynamic link failed: libsqr.so: cannot open shared object file: No such file or directory

% LD_LIBRARY_PATH=$(pwd) smlsharp -L. -lsqr
SML#  for x86_64-unknown-linux-gnu with LLVM 12.0.1
# val sqr = _import "sqrShort" : int16 -> int16;
val sqr = fn : int16 -> int16
# sqr 16;
val it = 256 : int16

2. Autogenerated Makefiles and C library compilation

I love the fact that SML# leans on make to achieve its separate compilation capabilities. But I’ve found I disagree with the designers on the role of the autogenerated Makefiles in the general programming flow.

In my personal workflows (including this blog), I like to use Makefiles as a top-level driver for development. So, for example, to build and publish the Jekyll output, I’ll run make publish, and that’s that.

When developing a multi-language project, it makes sense to tie together the various build tools offered by the languages (npm, mix, cargo) at a high level, so that running make build builds all the subcomponents of a program, farming out the language-specific details to, say mix compile, etc.

So what happens when we’re developing a C-based library inside of our SML# project, and we want to use a top-level Makefile to run it all?

So far, I’ve used this pattern:

all0: Makefile.smlsharp all

Makefile.smlsharp: $(shell find . | grep '.smi$$')
	smlsharp -MMm main.smi > $@

include Makefile.smlsharp

This lets me have one standard Makefile on the top-level of my project, where I could define non-SML# targets and tasks, and also have smlsharp regenerate Makefile.smlsharp whenever there are changes in the inter-module dependency structures.

In short, Makefile.smlsharp is fully defined by the existing .smi files in the project, hence I think of it as non-essential, ephemeral data that can go into .gitignore. I don’t ever have to look at these files, I just rely on the fact that they can build main along with all its dependencies.

Now, if we add a C-based shared library into our development mix, we have something like the following:

all0: mt19937-64.so Makefile.smlsharp all

mt19937-64.so: mt19937-64.c
	gcc -shared -fPIC $< -o $@

mt19937-64.c:
	curl -s -O http://www.math.sci.hiroshima-u.ac.jp/m-mat/MT/VERSIONS/C-LANG/mt19937-64.c

Makefile.smlsharp: $(shell find . | grep '.smi$$')
	smlsharp -MMm main.smi > $@

include Makefile.smlsharp

But now, our main file will fail to link to mt19936-64.so. Why? Because we need to hand-edit the autogenerated Makefile and change the line

LIBS =

LIBS = mt19937-64.so

This requirement effectively changes the status of the autogenerated Makefile from ephemeral and easily-recreated to something non-ephemeral, requiring frequent manual modification.

Since I honestly believe we should be able to treat the autogenerated Makefiles as throwaway artifacts, I’ve made a pull request to SML# that, if accepted, will allow us to define LIBS in the environment, and have the autogenerated Makefile pick it up.

This means that we can put the line

LIBS=mt19937-64.so

in our top-level Makefile and never have to hand-edit the output of smlsharp -Mmm.

3. The Pointer structure

All of the examples involving marshaling data into SML# involve the interactive interpreter, and not standalone compilation. So when the time comes to compile your binary with quicksort, you run into this error message:

  (name evaluation "190") unbound variable: Pointer.load

Using the very same files in an interactive session works, but compilation fails. Poring over the example code in the chapter gives no clue.

It turns out that you need to _require "ffi.smi" to get access to the pointer structure. It helps to have the SML# source checked out locally, as the example code is very helpful in figuring out small things like this.

I feel that the lack of _require "ffi.smi" in the chapter text is an omission.

4. The language of import declarations

This gripe is perhaps very minor, but it’s details like this that make some languages much harder to learn than others. In Standard ML—and by extension in SML#–there are two ‘languages’ to master. One is the type-level language, the other is the value-level language. They are similar but not the same.

For example, the type-level definition of a function that takes a 2-tuple of two types and returns nothing is:

(* Type language *)
val someFun : 'a * 'b ref -> unit

Now, the value-level implementation of this has a slightly different syntax:

(* Value language *)
fun someFun (a,b) = ()

Note the differences:

1) The type-level tuple constructor is _ * _. The value-level tuple constructor is ( _ , _ ).

2) The type-level syntax for a type variable is 'a. The value-level variable is a.

3) The name of the unit type is unit. The value for the unit type is ().

On top of this confusion, SML# throws in another language: the language describing the types of imported FFI functions. For the function above, it would be something like:

val c_someFun = _import "someFun" : ('a, 'b ptr) -> ()

This language seems to combine both the value-level and type-level languages. I know there must have been a good reason for introducing it, but I feel having fresh learners learn yet another subtle “DSL” is an educational barrier.

For what it’s worth, Haskell does a better job of having the type-level and value-level languages resemble each other to a greater degree:

-- type language
someFun :: (a, b) -> ()

-- value language
someFun (a,b) = ()

Summing up

This chapter was very interesting to read and implement. For someone who is not used to working with C, there are some gotchas that aren’t clearly marked in the book, and I had to figure things out myself.

As with all things related to binary interoperability, it feels like a lot of the designs were informed by real-world constraints and tastes of the authors.

All-in-all, it feels like developing an SML# wrapper around C libraries should be quite straightforward. This bodes well for future chapters, which involve graphics programming with Cairo and the like. I’m looking forward to what comes next.