Special Types

Understanding the representation of data in memory is very important. Usually, the representation can be understood from the definition of a datatype. Each constructor corresponds to an object in memory that has a header that includes a tag and a reference count. The constructor's arguments are each represented by a pointer to some other object. In other words, List really is a linked list and extracting a field from a structure really does just chase a pointer.

There are, however, some important exceptions to this rule. A number of types are treated specially by the compiler. For example, the type UInt32 is defined as Fin (2 ^ 32), but it is replaced at run-time with an actual native implementation based on machine words. Similarly, even though the definition of Nat suggests an implementation similar to List Unit, the actual run-time representation uses immediate machine words for sufficiently-small numbers and an efficient arbitrary-precision arithmetic library for larger numbers. The Lean compiler translates from definitions that use pattern matching into the appropriate operations for this representation, and calls to operations like addition and subtraction are mapped to fast operations from the underlying arithmetic library. After all, addition should not take time linear in the size of the addends.

The fact that some types have special representations also means that care is needed when working with them. Most of these types consist of a structure that is treated specially by the compiler. With these structures, using the constructor or the field accessors directly can trigger an expensive conversion from an efficient representation to a slow one that is convenient for proofs. For example, String is defined as a structure that contains a list of characters, but the run-time representation of strings uses UTF-8, not linked lists of pointers to characters. Applying the constructor to a list of characters creates a byte array that encodes them in UTF-8, and accessing the field of the structure takes time linear in the length of the string to decode the UTF-8 representation and allocate a linked list. Arrays are represented similarly. From the logical perspective, arrays are structures that contain a list of array elements, but the run-time representation is a dynamically-sized array. At run time, the constructor translates the list into an array, and the field accessor allocates a linked list from the array. The various array operations are replaced with efficient versions by the compiler that mutate the array when possible instead of allocating a new one.

Both types themselves and proofs of propositions are completely erased from compiled code. In other words, they take up no space, and any computations that might have been performed as part of a proof are similarly erased. This means that proofs can take advantage of the convenient interface to strings and arrays as inductively-defined lists, including using induction to prove things about them, without imposing slow conversion steps while the program is running. For these built-in types, a convenient logical representation of the data does not imply that the program must be slow.

If a structure type has only a single non-type non-proof field, then the constructor itself disappears at run time, being replaced with its single argument. In other words, a subtype is represented identically to its underlying type, rather than with an extra layer of indirection. Similarly, Fin is just Nat in memory, and single-field structures can be created to keep track of different uses of Nats or Strings without paying a performance penalty. If a constructor has no non-type non-proof arguments, then the constructor also disappears and is replaced with a constant value where the pointer would otherwise be used. This means that true, false, and none are constant values, rather than pointers to heap-allocated objects.

The following types have special representations:

Type	Logical representation	Run-time Representation
`Nat`	Unary, with one pointer from each `Nat.succ`	Efficient arbitrary-precision integers
`Int`	A sum type with constructors for positive or negative values, each containing a `Nat`	Efficient arbitrary-precision integers
`UInt8`, `UInt16`, `UInt32`, `UInt64`	A `Fin` with an appropriate bound	Fixed-precision machine integers
`Char`	A `UInt32` paired with a proof that it's a valid code point	Ordinary characters
`String`	A structure that contains a `List Char` in a field called `data`	UTF-8-encoded string
`Array α`	A structure that contains a `List α` in a field called `data`	Packed arrays of pointers to `α` values
`Sort u`	A type	Erased completely
Proofs of propositions	Whatever data is suggested by the proposition when considered as a type of evidence	Erased completely

Exercise

The definition of Pos does not take advantage of Lean's compilation of Nat to an efficient type. At run time, it is essentially a linked list. Alternatively, a subtype can be defined that allows Lean's fast Nat type to be used internally, as described in the initial section on subtypes. At run time, the proof will be erased. Because the resulting structure has only a single data field, it is represented as that field, which means that this new representation of Pos is identical to that of Nat.

After proving the theorem ∀ {n k : Nat}, n ≠ 0 → k ≠ 0 → n + k ≠ 0, define instances of ToString, and Add for this new representation of Pos. Then, define an instance of Mul, proving any necessary theorems along the way.

Functional Programming in Lean

Special Types

Exercise