Worked Example: cat

The standard Unix utility cat takes a number of command-line options, followed by zero or more input files. If no files are provided, or if one of them is a dash (-), then it takes the standard input as the corresponding input instead of reading a file. The contents of the inputs are written, one after the other, to the standard output. If a specified input file does not exist, this is noted on standard error, but cat continues concatenating the remaining inputs. A non-zero exit code is returned if any of the input files do not exist.

This section describes a simplified version of cat, called feline. Unlike commonly-used versions of cat, feline has no command-line options for features such as numbering lines, indicating non-printing characters, or displaying help text. Furthermore, it cannot read more than once from a standard input that's associated with a terminal device.

To get the most benefit from this section, follow along yourself. It's OK to copy-paste the code examples, but it's even better to type them in by hand. This makes it easier to learn the mechanical process of typing in code, recovering from mistakes, and interpreting feedback from the compiler.

Getting started

The first step in implementing feline is to create a package and decide how to organize the code. In this case, because the program is so simple, all the code will be placed in Main.lean. The first step is to run lake new feline. Edit the Lakefile to remove the library, and delete the generated library code and the reference to it from Main.lean. Once this has been done, lakefile.lean should contain:

import Lake
open Lake DSL

package «feline» {
  -- add package configuration options here
}

@[default_target]
lean_exe «feline» {
  root := `Main
}

and Main.lean should contain something like:

def main : IO Unit :=
  IO.println s!"Hello, cats!"

Alternatively, running lake new feline exe instructs lake to use a template that does not include a library section, making it unnecessary to edit the file.

Ensure that the code can be built by running lake build.

Concatenating Streams

Now that the basic skeleton of the program has been built, it's time to actually enter the code. A proper implementation of cat can be used with infinite IO streams, such as /dev/random, which means that it can't read its input into memory before outputting it. Furthermore, it should not work one character at a time, as this leads to frustratingly slow performance. Instead, it's better to read contiguous blocks of data all at once, directing the data to the standard output one block at a time.

The first step is to decide how big of a block to read. For the sake of simplicity, this implementation uses a conservative 20 kilobyte block. USize is analogous to size_t in C—it's an unsigned integer type that is big enough to represent all valid array sizes.

def bufsize : USize := 20 * 1024

Streams

The main work of feline is done by dump, which reads input one block at a time, dumping the result to standard output, until the end of the input has been reached:

partial def dump (stream : IO.FS.Stream) : IO Unit := do
  let buf ← stream.read bufsize
  if buf.isEmpty then
    pure ()
  else
    let stdout ← IO.getStdout
    stdout.write buf
    dump stream

The dump function is declared partial, because it calls itself recursively on input that is not immediately smaller than an argument. When a function is declared to be partial, Lean does not require a proof that it terminates. On the other hand, partial functions are also much less amenable to proofs of correctness, because allowing infinite loops in Lean's logic would make it unsound. However, there is no way to prove that dump terminates, because infinite input (such as from /dev/random) would mean that it does not, in fact, terminate. In cases like this, there is no alternative to declaring the function partial.

The type IO.FS.Stream represents a POSIX stream. Behind the scenes, it is represented as a structure that has one field for each POSIX stream operation. Each operation is represented as an IO action that provides the corresponding operation:

structure Stream where
  flush   : IO Unit
  read    : USize → IO ByteArray
  write   : ByteArray → IO Unit
  getLine : IO String
  putStr  : String → IO Unit

The Lean compiler contains IO actions (such as IO.getStdout, which is called in dump) to get streams that represent standard input, standard output, and standard error. These are IO actions rather than ordinary definitions because Lean allows these standard POSIX streams to be replaced in a process, which makes it easier to do things like capturing the output from a program into a string by writing a custom IO.FS.Stream.

The control flow in dump is essentially a while loop. When dump is called, if the stream has reached the end of the file, pure () terminates the function by returning the constructor for Unit. If the stream has not yet reached the end of the file, one block is read, and its contents are written to stdout, after which dump calls itself directly. The recursive calls continue until stream.read returns an empty byte array, which indicates that the end of the file has been reached.

When an if expression occurs as a statement in a do, as in dump, each branch of the if is implicitly provided with a do. In other words, the sequence of steps following the else are treated as a sequence of IO actions to be executed, just as if they had a do at the beginning. Names introduced with let in the branches of the if are visible only in their own branches, and are not in scope outside of the if.

There is no danger of running out of stack space while calling dump because the recursive call happens as the very last step in the function, and its result is returned directly rather than being manipulated or computed with. This kind of recursion is called tail recursion, and it is described in more detail later in this book. Because the compiled code does not need to retain any state, the Lean compiler can compile the recursive call to a jump.

If feline only redirected standard input to standard output, then dump would be sufficient. However, it also needs to be able to open files that are provided as command-line arguments and emit their contents. When its argument is the name of a file that exists, fileStream returns a stream that reads the file's contents. When the argument is not a file, fileStream emits an error and returns none.

def fileStream (filename : System.FilePath) : IO (Option IO.FS.Stream) := do
  let fileExists ← filename.pathExists
  if not fileExists then
    let stderr ← IO.getStderr
    stderr.putStrLn s!"File not found: {filename}"
    pure none
  else
    let handle ← IO.FS.Handle.mk filename IO.FS.Mode.read
    pure (some (IO.FS.Stream.ofHandle handle))

Opening a file as a stream takes two steps. First, a file handle is created by opening the file in read mode. A Lean file handle tracks an underlying file descriptor. When there are no references to the file handle value, a finalizer closes the file descriptor. Second, the file handle is given the same interface as a POSIX stream using IO.FS.Stream.ofHandle, which fills each field of the Stream structure with the corresponding IO action that works on file handles.

Handling Input

The main loop of feline is another tail-recursive function, called process. In order to return a non-zero exit code if any of the inputs could not be read, process takes an argument exitCode that represents the current exit code for the whole program. Additionally, it takes a list of input files to be processed.

def process (exitCode : UInt32) (args : List String) : IO UInt32 := do
  match args with
  | [] => pure exitCode
  | "-" :: args =>
    let stdin ← IO.getStdin
    dump stdin
    process exitCode args
  | filename :: args =>
    let stream ← fileStream ⟨filename⟩
    match stream with
    | none =>
      process 1 args
    | some stream =>
      dump stream
      process exitCode args

Just as with if, each branch of a match that is used as a statement in a do is implicitly provided with its own do.

There are three possibilities. One is that no more files remain to be processed, in which case process returns the error code unchanged. Another is that the specified filename is "-", in which case process dumps the contents of the standard input and then processes the remaining filenames. The final possibility is that an actual filename was specified. In this case, fileStream is used to attempt to open the file as a POSIX stream. Its argument is encased in ⟨ ... ⟩ because a FilePath is a single-field structure that contains a string. If the file could not be opened, it is skipped, and the recursive call to process sets the exit code to 1. If it could, then it is dumped, and the recursive call to process leaves the exit code unchanged.

process does not need to be marked partial because it is structurally recursive. Each recursive call is provided with the tail of the input list, and all Lean lists are finite. Thus, process does not introduce any non-termination.

Main

The final step is to write the main action. Unlike prior examples, main in feline is a function. In Lean, main can have one of three types:

  • main : IO Unit corresponds to programs that cannot read their command-line arguments and always indicate success with an exit code of 0,
  • main : IO UInt32 corresponds to int main(void) in C, for programs without arguments that return exit codes, and
  • main : List String → IO UInt32 corresponds to int main(int argc, char **argv) in C, for programs that take arguments and signal success or failure.

If no arguments were provided, feline should read from standard input as if it were called with a single "-" argument. Otherwise, the arguments should be processed one after the other.

def main (args : List String) : IO UInt32 :=
  match args with
  | [] => process 0 ["-"]
  | _ =>  process 0 args

Meow!

To check whether feline works, the first step is to build it with lake build. First off, when called without arguments, it should emit what it receives from standard input. Check that

echo "It works!" | ./build/bin/feline

emits It works!.

Secondly, when called with files as arguments, it should print them. If the file test1.txt contains

It's time to find a warm spot

and test2.txt contains

and curl up!

then the command

./build/bin/feline test1.txt test2.txt

should emit

It's time to find a warm spot
and curl up!

Finally, the - argument should be handled appropriately.

echo "and purr" | ./build/bin/feline test1.txt - test2.txt

should yield

It's time to find a warm spot
and purr
and curl up!

Exercise

Extend feline with support for usage information. The extended version should accept a command-line argument --help that causes documentation about the available command-line options to be written to standard output.