How I learned to never match on os:cmd output

Today I learned…
2019-05-14

A late change in requirements from a customer had me scrambling to switch an HDFS connector script — from a Python program — to the standard Hadoop tool hdfs.

The application that was launching the connector script was written in Erlang, and was responsible for uploading some files to an HDFS endpoint, like so:

UploadCmd = lists:flatten(io_lib:format("hdfs put ~p ~p", [Here, There])),
"" = os:cmd(UploadCmd),

This was all fine and dandy when the UploadCmd was implemented in full by me. When I switched out the Python script for the hdfs command, all my tests continued to work, and the data was indeed being written successfully to my local test hdfs node. So off to production it went.

Several hours later I got notified that there’s some problems with the new code. After inspecting the logs it became clear that the hdfs command was producing unexpected output (WARN: blah blah took longer than expected (..)) and causing the Erlang program to treat the upload operation as failed.

As is the case for reasonable Erlang applications, the writing process would crash upon a failed match, then restart and attempt to continue where it left off — by trying to upload Here to There. Now, this operation kept legitimately failing, because it had in fact succeeded the first time, and HDFS would not allow us to overwrite There (unless we added a -f flag to put).

The solution

The quick-and-dirty solution was to wrap the UploadCmd in a script that captured the exit code, and then printed it out at the end, like so:

sh -c '{UploadCmd}; RES=$?; echo; echo $RES'

Now, your Erlang code can match on the last line of the output and interpret it as a integer exit code. Not the most elegant of solutions, but elegant enough to work around os:cmd/1’s blindess to exit codes.

Lesson learned

The UNIX way states that programs should be silent on success and vocal on error. Sadly, many applications don’t follow the UNIX way, and the bigger the application at hand, the higher the probability that one of its dependencies will use STDOUT or STDERR as its own personal scratchpad.

My lesson: never rely on os:cmd/1 output in production code, unless the command you’re running is fully under your control, and you can be certain that its outputs are completely and exhaustively specified by you.

I do heavily rely on os:cmd output in test code, and I have no intention of stopping. Early feedback about unexpected output is great in tests.