Worked Example: Typed Queries
Indexed families are very useful when building an API that is supposed to resemble some other language. They can be used to write a library of HTML constructors that don't permit generating invalid HTML, to encode the specific rules of a configuration file format, or to model complicated business constraints. This section describes an encoding of a subset of relational algebra in Lean using indexed families, as a simpler demonstration of techniques that can be used to build a more powerful database query language.
This subset uses the type system to enforce requirements such as disjointness of field names, and it uses type-level computation to reflect the schema into the types of values that are returned from a query. It is not a realistic system, however—databases are represented as linked lists of linked lists, the type system is much simpler than that of SQL, and the operators of relational algebra don't really match those of SQL. However, it is large enough to demonstrate useful principles and techniques.
A Universe of Data
In this relational algebra, the base data that can be held in columns can have types Int
, String
, and Bool
and are described by the universe DBType
:
inductive DBType where
| int | string | bool
abbrev DBType.asType : DBType → Type
| .int => Int
| .string => String
| .bool => Bool
Using asType
allows these codes to be used for types.
For example:
#eval ("Mount Hood" : DBType.string.asType)
"Mount Hood"
It is possible to compare the values described by any of the three database types for equality.
Explaining this to Lean, however, requires a bit of work.
Simply using BEq
directly fails:
def DBType.beq (t : DBType) (x y : t.asType) : Bool :=
x == y
failed to synthesize instance
BEq (asType t)
Just as in the nested pairs universe, type class search doesn't automatically check each possibility for t
's value
The solution is to use pattern matching to refine the types of x
and y
:
def DBType.beq (t : DBType) (x y : t.asType) : Bool :=
match t with
| .int => x == y
| .string => x == y
| .bool => x == y
In this version of the function, x
and y
have types Int
, String
, and Bool
in the three respective cases, and these types all have BEq
instances.
The definition of dbEq
can be used to define a BEq
instance for the types that are coded for by DBType
:
instance {t : DBType} : BEq t.asType where
beq := t.beq
This is not the same as an instance for the codes themselves:
instance : BEq DBType where
beq
| .int, .int => true
| .string, .string => true
| .bool, .bool => true
| _, _ => false
The former instance allows comparison of values drawn from the types described by the codes, while the latter allows comparison of the codes themselves.
A Repr
instance can be written using the same technique.
The method of the Repr
class is called reprPrec
because it is designed to take things like operator precedence into account when displaying values.
Refining the type through dependent pattern matching allows the reprPrec
methods from the Repr
instances for Int
, String
, and Bool
to be used:
instance {t : DBType} : Repr t.asType where
reprPrec :=
match t with
| .int => reprPrec
| .string => reprPrec
| .bool => reprPrec
Schemas and Tables
A schema describes the name and type of each column in a database:
structure Column where
name : String
contains : DBType
abbrev Schema := List Column
In fact, a schema can be seen as a universe that describes rows in a table. The empty schema describes the unit type, a schema with a single column describes that value on its own, and a schema with at least two columns is represented by a tuple:
abbrev Row : Schema → Type
| [] => Unit
| [col] => col.contains.asType
| col1 :: col2 :: cols => col1.contains.asType × Row (col2::cols)
As described in the initial section on product types, Lean's product type and tuples are right-associative. This means that nested pairs are equivalent to ordinary flat tuples.
A table is a list of rows that share a schema:
abbrev Table (s : Schema) := List (Row s)
For example, a diary of visits to mountain peaks can be represented with the schema peak
:
abbrev peak : Schema := [
⟨"name", DBType.string⟩,
⟨"location", DBType.string⟩,
⟨"elevation", DBType.int⟩,
⟨"lastVisited", .int⟩
]
A selection of peaks visited by the author of this book appears as an ordinary list of tuples:
def mountainDiary : Table peak := [
("Mount Nebo", "USA", 3637, 2013),
("Moscow Mountain", "USA", 1519, 2015),
("Himmelbjerget", "Denmark", 147, 2004),
("Mount St. Helens", "USA", 2549, 2010)
]
Another example consists of waterfalls and a diary of visits to them:
abbrev waterfall : Schema := [
⟨"name", .string⟩,
⟨"location", .string⟩,
⟨"lastVisited", .int⟩
]
def waterfallDiary : Table waterfall := [
("Multnomah Falls", "USA", 2018),
("Shoshone Falls", "USA", 2014)
]
Recursion and Universes, Revisited
The convenient structuring of rows as tuples comes at a cost: the fact that Row
treats its two base cases separately means that functions that use Row
in their types and are defined recursively over the codes (that, is the schema) need to make the same distinctions.
One example of a case where this matters is an equality check that uses recursion over the schema to define a function that checks rows for equality.
This example does not pass Lean's type checker:
def Row.bEq (r1 r2 : Row s) : Bool :=
match s with
| [] => true
| col::cols =>
match r1, r2 with
| (v1, r1'), (v2, r2') =>
v1 == v2 && bEq r1' r2'
type mismatch
(v1, r1')
has type
?m.6559 × ?m.6562 : Type (max ?u.6571 ?u.6570)
but is expected to have type
Row (col :: cols) : Type
The problem is that the pattern col :: cols
does not sufficiently refine the type of the rows.
This is because Lean cannot yet tell whether the singleton pattern [col]
or the col1 :: col2 :: cols
pattern in the definition of Row
was matched, so the call to Row
does not compute down to a pair type.
The solution is to mirror the structure of Row
in the definition of Row.bEq
:
def Row.bEq (r1 r2 : Row s) : Bool :=
match s with
| [] => true
| [_] => r1 == r2
| _::_::_ =>
match r1, r2 with
| (v1, r1'), (v2, r2') =>
v1 == v2 && bEq r1' r2'
instance : BEq (Row s) where
beq := Row.bEq
Unlike in other contexts, functions that occur in types cannot be considered only in terms of their input/output behavior. Programs that use these types will find themselves forced to mirror the algorithm used in the type-level function so that their structure matches the pattern-matching and recursive behavior of the type. A big part of the skill of programming with dependent types is the selection of appropriate type-level functions with the right computational behavior.
Column Pointers
Some queries only make sense if a schema contains a particular column.
For example, a query that returns mountains with an elevation greater than 1000 meters only makes sense in the context of a schema with a "elevation"
column that contains integers.
One way to indicate that a column is contained in a schema is to provide a pointer directly to it, and defining the pointer as an indexed family makes it possible to rule out invalid pointers.
There are two ways that a column can be present in a schema: either it is at the beginning of the schema, or it is somewhere later in the schema. Eventually, if a column is later in a schema, then it will be the beginning of some tail of the schema.
The indexed family HasCol
is a translation of the specification into Lean code:
inductive HasCol : Schema → String → DBType → Type where
| here : HasCol (⟨name, t⟩ :: _) name t
| there : HasCol s name t → HasCol (_ :: s) name t
The family's three arguments are the schema, the column name, and its type.
All three are indices, but re-ordering the arguments to place the schema after the column name and type would allow the name and type to be parameters.
The constructor here
can be used when the schema begins with the column ⟨name, t⟩
; it is thus a pointer to the first column in the schema that can only be used when the first column has the desired name and type.
The constructor there
transforms a pointer into a smaller schema into a pointer into a schema with one more column on it.
Because "elevation"
is the third column in peak
, it can be found by looking past the first two columns with there
, after which it is the first column.
In other words, to satisfy the type HasCol peak "elevation" .int
, use the expression .there (.there .here)
.
One way to think about HasCol
is as a kind of decorated Nat
—zero
corresponds to here
, and succ
corresponds to there
.
The extra type information makes it impossible to have off-by-one errors.
A pointer to a particular column in a schema can be used to extract that column's value from a row:
def Row.get (row : Row s) (col : HasCol s n t) : t.asType :=
match s, col, row with
| [_], .here, v => v
| _::_::_, .here, (v, _) => v
| _::_::_, .there next, (_, r) => get r next
The first step is to pattern match on the schema, because this determines whether the row is a tuple or a single value.
No case is needed for the empty schema because there is a HasCol
available, and both constructors of HasCol
specify non-empty schemas.
If the schema has just a single column, then the pointer must point to it, so only the here
constructor of HasCol
need be matched.
If the schema has two or more columns, then there must be a case for here
, in which case the value is the first one in the row, and one for there
, in which case a recursive call is used.
Because the HasCol
type guarantees that the column exists in the row, Row.get
does not need to return an Option
.
HasCol
plays two roles:
-
It serves as evidence that a column with a particular name and type exists in a schema.
-
It serves as data that can be used to find the value associated with the column in a row.
The first role, that of evidence, is similar to way that propositions are used.
The definition of the indexed family HasCol
can be read as a specification of what counts as evidence that a given column exists.
Unlike propositions, however, it matters which constructor of HasCol
was used.
In the second role, the constructors are used like Nat
s to find data in a collection.
Programming with indexed families often requires the ability to switch fluently between both perspectives.
Subschemas
One important operation in relational algebra is to project a table or row into a smaller schema.
Every column not present in the smaller schema is forgotten.
In order for projection to make sense, the smaller schema must be a subschema of the larger schema, which means that every column in the smaller schema must be present in the larger schema.
Just as HasCol
makes it possible to write a single-column lookup in a row that cannot fail, a representation of the subschema relationship as an indexed family makes it possible to write a projection function that cannot fail.
The ways in which one schema can be a subschema of another can be defined as an indexed family.
The basic idea is that a smaller schema is a subschema of a bigger schema if every column in the smaller schema occurs in the bigger schema.
If the smaller schema is empty, then it's certainly a subschema of the bigger schema, represented by the constructor nil
.
If the smaller schema has a column, then that column must be in the bigger schema, and all the rest of the columns in the subschema must also be a subschema of the bigger schema.
This is represented by the constructor cons
.
inductive Subschema : Schema → Schema → Type where
| nil : Subschema [] bigger
| cons :
HasCol bigger n t →
Subschema smaller bigger →
Subschema (⟨n, t⟩ :: smaller) bigger
In other words, Subschema
assigns each column of the smaller schema a HasCol
that points to its location in the larger schema.
The schema travelDiary
represents the fields that are common to both peak
and waterfall
:
abbrev travelDiary : Schema :=
[⟨"name", .string⟩, ⟨"location", .string⟩, ⟨"lastVisited", .int⟩]
It is certainly a subschema of peak
, as shown by this example:
example : Subschema travelDiary peak :=
.cons .here
(.cons (.there .here)
(.cons (.there (.there (.there .here))) .nil))
However, code like this is difficult to read and difficult to maintain.
One way to improve it is to instruct Lean to write the Subschema
and HasCol
constructors automatically.
This can be done using the tactic feature that was introduced in the Interlude on propositions and proofs.
That interlude uses by simp
to provide evidence of various propositions.
In this context, two tactics are useful:
- The
constructor
tactic instructs Lean to solve the problem using the constructor of a datatype. - The
repeat
tactic instructs Lean to repeat a tactic over and over until it either fails or the proof is finished.
In the next example, by constructor
has the same effect as just writing .nil
would have:
example : Subschema [] peak := by constructor
However, attempting that same tactic with a slightly more complicated type fails:
example : Subschema [⟨"location", .string⟩] peak := by constructor
unsolved goals
case a
⊢ HasCol peak "location" DBType.string
case a
⊢ Subschema [] peak
Errors that begin with unsolved goals
describe tactics that failed to completely build the expressions that they were supposed to.
In Lean's tactic language, a goal is a type that a tactic is to fulfill by constructing an appropriate expression behind the scenes.
In this case, constructor
caused Subschema.cons
to be applied, and the two goals represent the two arguments expected by cons
.
Adding another instance of constructor
causes the first goal (HasCol peak \"location\" DBType.string
) to be addressed with HasCol.there
, because peak
's first column is not "location"
:
example : Subschema [⟨"location", .string⟩] peak := by
constructor
constructor
unsolved goals
case a.a
⊢ HasCol
[{ name := "location", contains := DBType.string }, { name := "elevation", contains := DBType.int },
{ name := "lastVisited", contains := DBType.int }]
"location" DBType.string
case a
⊢ Subschema [] peak
However, adding a third constructor
results in the first goal being solved, because HasCol.here
is applicable:
example : Subschema [⟨"location", .string⟩] peak := by
constructor
constructor
constructor
unsolved goals
case a
⊢ Subschema [] peak
A fourth instance of constructor
solves the Subschema peak []
goal:
example : Subschema [⟨"location", .string⟩] peak := by
constructor
constructor
constructor
constructor
Indeed, a version written without the use of tactics has four constructors:
example : Subschema [⟨"location", .string⟩] peak :=
.cons (.there .here) .nil
Instead of experimenting to find the right number of times to write constructor
, the repeat
tactic can be used to ask Lean to just keep trying constructor
as long as it keeps making progress:
example : Subschema [⟨"location", .string⟩] peak := by repeat constructor
This more flexible version also works for more interesting Subschema
problems:
example : Subschema travelDiary peak := by repeat constructor
example : Subschema travelDiary waterfall := by repeat constructor
The approach of blindly trying constructors until something works is not very useful for types like Nat
or List Bool
.
Just because an expression has type Nat
doesn't mean that it's the correct Nat
, after all.
But types like HasCol
and Subschema
are sufficiently constrained by their indices that only one constructor will ever be applicable, which means that the contents of the program itself are less interesting, and a computer can pick the correct one.
If one schema is a subschema of another, then it is also a subschema of the larger schema extended with an additional column.
This fact can be captured as a function definition.
Subschema.addColumn
takes evidence that smaller
is a subschema of bigger
, and then returns evidence that smaller
is a subschema of c :: bigger
, that is, bigger
with one additional column:
def Subschema.addColumn (sub : Subschema smaller bigger) : Subschema smaller (c :: bigger) :=
match sub with
| .nil => .nil
| .cons col sub' => .cons (.there col) sub'.addColumn
A subschema describes where to find each column from the smaller schema in the larger schema.
Subschema.addColumn
must translate these descriptions from the original larger schema into the extended larger schema.
In the nil
case, the smaller schema is []
, and nil
is also evidence that []
is a subschema of c :: bigger
.
In the cons
case, which describes how to place one column from smaller
into larger
, the placement of the column needs to be adjusted with there
to account for the new column c
, and a recursive call adjusts the rest of the columns.
Another way to think about Subschema
is that it defines a relation between two schemas—the existence of an expression with type Subschema bigger smaller
means that (bigger, smaller)
is in the relation.
This relation is reflexive, meaning that every schema is a subschema of itself:
def Subschema.reflexive : (s : Schema) → Subschema s s
| [] => .nil
| _ :: cs => .cons .here (reflexive cs).addColumn
Projecting Rows
Given evidence that s'
is a subschema of s
, a row in s
can be projected into a row in s'
.
This is done using the evidence that s'
is a subschema of s
, which explains where each column of s'
is found in s
.
The new row in s'
is built up one column at a time by retrieving the value from the appropriate place in the old row.
The function that performs this projection, Row.project
, has three cases, one for each case of Row
itself.
It uses Row.get
together with each HasCol
in the Subschema
argument to construct the projected row:
def Row.project (row : Row s) : (s' : Schema) → Subschema s' s → Row s'
| [], .nil => ()
| [_], .cons c .nil => row.get c
| _::_::_, .cons c cs => (row.get c, row.project _ cs)
Conditions and Selection
Projection removes unwanted columns from a table, but queries must also be able to remove unwanted rows. This operation is called selection. Selection relies on having a means of expressing which rows are desired.
The example query language contains expressions, which are analogous to what can be written in a WHERE
clause in SQL.
Expressions are represented by the indexed family DBExpr
.
Because expressions can refer to columns from the database, but different sub-expressions all have the same schema, DBExpr
takes the database schema as a parameter.
Additionally, each expression has a type, and these vary, making it an index:
inductive DBExpr (s : Schema) : DBType → Type where
| col (n : String) (loc : HasCol s n t) : DBExpr s t
| eq (e1 e2 : DBExpr s t) : DBExpr s .bool
| lt (e1 e2 : DBExpr s .int) : DBExpr s .bool
| and (e1 e2 : DBExpr s .bool) : DBExpr s .bool
| const : t.asType → DBExpr s t
The col
constructor represents a reference to a column in the database.
The eq
constructor compares two expressions for equality, lt
checks whether one is less than the other, and
is Boolean conjunction, and const
is a constant value of some type.
For example, an expression in peak
that checks whether the elevation
column is greater than 1000 and the location is "Denmark"
can be written:
def tallInDenmark : DBExpr peak .bool :=
.and (.lt (.const 1000) (.col "elevation" (by repeat constructor)))
(.eq (.col "location" (by repeat constructor)) (.const "Denmark"))
This is somewhat noisy.
In particular, references to columns contain boilerplate calls to by repeat constructor
.
A Lean feature called macros can help make expressions easier to read by eliminating this boilerplate:
macro "c!" n:term : term => `(DBExpr.col $n (by repeat constructor))
This declaration adds the c!
keyword to Lean, and instructs Lean to replace any instance of c!
followed by an expression with the corresponding DBExpr.col
construction.
Here, term
stands for Lean expressions, rather than commands, tactics, or some other part of the language.
Lean macros are a bit like C preprocessor macros, except they are better integrated into the language and they automatically avoid some of the pitfalls of CPP.
In fact, they are very closely related to macros in Scheme and Racket.
With this macro, the expression can be much easier to read:
def tallInDenmark : DBExpr peak .bool :=
.and (.lt (.const 1000) (c! "elevation"))
(.eq (c! "location") (.const "Denmark"))
Finding the value of an expression with respect to a given row uses Row.get
to extract column references, and it delegates to Lean's operations on values for every other expression:
def DBExpr.evaluate (row : Row s) : DBExpr s t → t.asType
| .col _ loc => row.get loc
| .eq e1 e2 => evaluate row e1 == evaluate row e2
| .lt e1 e2 => evaluate row e1 < evaluate row e2
| .and e1 e2 => evaluate row e1 && evaluate row e2
| .const v => v
Evaluating the expression for Valby Bakke, the tallest hill in the Copenhagen area, yields false
because Valby Bakke is much less than 1 km over sea level:
#eval tallInDenmark.evaluate ("Valby Bakke", "Denmark", 31, 2023)
false
Evaluating it for a fictional mountain of 1230m elevation yields true
:
#eval tallInDenmark.evaluate ("Fictional mountain", "Denmark", 1230, 2023)
true
Evaluating it for the highest peak in the US state of Idaho yields false
, as Idaho is not part of Denmark:
#eval tallInDenmark.evaluate ("Mount Borah", "USA", 3859, 1996)
false
Queries
The query language is based on relational algebra. In addition to tables, it includes the following operators:
- The union of two expressions that have the same schema combines the rows that result from two queries
- The difference of two expressions that have the same schema removes rows found in the second result from the rows in the first result
- Selection by some criterion filters the result of a query according to an expression
- Projection into a subschema, removing columns from the result of a query
- Cartesian product, combining every row from one query with every row from another
- Renaming a column in the result of a query, which modifies its schema
- Prefixing all columns in a query with a name
The last operator is not strictly necessary, but it makes the language more convenient to use.
Once again, queries are represented by an indexed family:
inductive Query : Schema → Type where
| table : Table s → Query s
| union : Query s → Query s → Query s
| diff : Query s → Query s → Query s
| select : Query s → DBExpr s .bool → Query s
| project : Query s → (s' : Schema) → Subschema s' s → Query s'
| product :
Query s1 → Query s2 →
disjoint (s1.map Column.name) (s2.map Column.name) →
Query (s1 ++ s2)
| renameColumn :
Query s → (c : HasCol s n t) → (n' : String) → !((s.map Column.name).contains n') →
Query (s.renameColumn c n')
| prefixWith :
(n : String) → Query s →
Query (s.map fun c => {c with name := n ++ "." ++ c.name})
The select
constructor requires that the expression used for selection return a Boolean.
The product
constructor's type contains a call to disjoint
, which ensures that the two schemas don't share any names:
def disjoint [BEq α] (xs ys : List α) : Bool :=
not (xs.any ys.contains || ys.any xs.contains)
The use of an expression of type Bool
where a type is expected triggers a coercion from Bool
to Prop
.
Just as decidable propositions can be considered to be Booleans, where evidence for the proposition is coerced to true
and refutations of the proposition are coerced to false
, Booleans are coerced into the proposition that states that the expression is equal to true
.
Because all uses of the library are expected to occur in contexts where the schemas are known ahead of time, this proposition can be proved with by simp
.
Similarly, the renameColumn
constructor checks that the new name does not already exist in the schema.
It uses the helper Schema.renameColumn
to change the name of the column pointed to by HasCol
:
def Schema.renameColumn : (s : Schema) → HasCol s n t → String → Schema
| c :: cs, .here, n' => {c with name := n'} :: cs
| c :: cs, .there next, n' => c :: renameColumn cs next n'
Executing Queries
Executing queries requires a number of helper functions. The result of a query is a table; this means that each operation in the query language requires a corresponding implementation that works with tables.
Cartesian Product
Taking the Cartesian product of two tables is done by appending each row from the first table to each row from the second.
First off, due to the structure of Row
, adding a single column to a row requires pattern matching on its schema in order to determine whether the result will be a bare value or a tuple.
Because this is a common operation, factoring the pattern matching out into a helper is convenient:
def addVal (v : c.contains.asType) (row : Row s) : Row (c :: s) :=
match s, row with
| [], () => v
| c' :: cs, v' => (v, v')
Appending two rows is recursive on the structure of both the first schema and the first row, because the structure of the row proceeds in lock-step with the structure of the schema. When the first row is empty, appending returns the second row. When the first row is a singleton, the value is added to the second row. When the first row contains multiple columns, the first column's value is added to the result of recursion on the remainder of the row.
def Row.append (r1 : Row s1) (r2 : Row s2) : Row (s1 ++ s2) :=
match s1, r1 with
| [], () => r2
| [_], v => addVal v r2
| _::_::_, (v, r') => (v, r'.append r2)
List.flatMap
applies a function that itself returns a list to every entry in an input list, returning the result of appending the resulting lists in order:
def List.flatMap (f : α → List β) : (xs : List α) → List β
| [] => []
| x :: xs => f x ++ xs.flatMap f
The type signature suggests that List.flatMap
could be used to implement a Monad List
instance.
Indeed, together with pure x := [x]
, List.flatMap
does implement a monad.
However, it's not a very useful Monad
instance.
The List
monad is basically a version of Many
that explores every possible path through the search space in advance, before users have the chance to request some number of values.
Because of this performance trap, it's usually not a good idea to define a Monad
instance for List
.
Here, however, the query language has no operator for restricting the number of results to be returned, so combining all possibilities is exactly what is desired:
def Table.cartesianProduct (table1 : Table s1) (table2 : Table s2) : Table (s1 ++ s2) :=
table1.flatMap fun r1 => table2.map r1.append
Just as with List.product
, a loop with mutation in the identity monad can be used as an alternative implementation technique:
def Table.cartesianProduct (table1 : Table s1) (table2 : Table s2) : Table (s1 ++ s2) := Id.run do
let mut out : Table (s1 ++ s2) := []
for r1 in table1 do
for r2 in table2 do
out := (r1.append r2) :: out
pure out.reverse
Difference
Removing undesired rows from a table can be done using List.filter
, which takes a list and a function that returns a Bool
.
A new list is returned that contains only the entries for which the function returns true
.
For instance,
["Willamette", "Columbia", "Sandy", "Deschutes"].filter (·.length > 8)
evaluates to
["Willamette", "Deschutes"]
because "Columbia"
and "Sandy"
have lengths less than or equal to 8
.
Removing the entries of a table can be done using the helper List.without
:
def List.without [BEq α] (source banned : List α) : List α :=
source.filter fun r => !(banned.contains r)
This will be used with the BEq
instance for Row
when interpreting queries.
Renaming Columns
Renaming a column in a row is done with a recursive function that traverses the row until the column in question is found, at which point the column with the new name gets the same value as the column with the old name:
def Row.rename (c : HasCol s n t) (row : Row s) : Row (s.renameColumn c n') :=
match s, row, c with
| [_], v, .here => v
| _::_::_, (v, r), .here => (v, r)
| _::_::_, (v, r), .there next => addVal v (r.rename next)
While this function changes the type of its argument, the actual return value contains precisely the same data as the original argument.
From a run-time perspective, renameRow
is nothing but a slow identity function.
One difficulty in programming with indexed families is that when performance matters, this kind of operation can get in the way.
It takes a very careful, often brittle, design to eliminate these kinds of "re-indexing" functions.
Prefixing Column Names
Adding a prefix to column names is very similar to renaming a column.
Instead of proceeding to a desired column and then returning, prefixRow
must process all columns:
def prefixRow (row : Row s) : Row (s.map fun c => {c with name := n ++ "." ++ c.name}) :=
match s, row with
| [], _ => ()
| [_], v => v
| _::_::_, (v, r) => (v, prefixRow r)
This can be used with List.map
in order to add a prefix to all rows in a table.
Once again, this function only exists to change the type of a value.
Putting the Pieces Together
With all of these helpers defined, executing a query requires only a short recursive function:
def Query.exec : Query s → Table s
| .table t => t
| .union q1 q2 => exec q1 ++ exec q2
| .diff q1 q2 => exec q1 |>.without (exec q2)
| .select q e => exec q |>.filter e.evaluate
| .project q _ sub => exec q |>.map (·.project _ sub)
| .product q1 q2 _ => exec q1 |>.cartesianProduct (exec q2)
| .renameColumn q c _ _ => exec q |>.map (·.rename c)
| .prefixWith _ q => exec q |>.map prefixRow
Some arguments to the constructors are not used during execution.
In particular, both the constructor project
and the function Row.project
take the smaller schema as explicit arguments, but the type of the evidence that this schema is a subschema of the larger schema contains enough information for Lean to fill out the argument automatically.
Similarly, the fact that the two tables have disjoint column names that is required by the product
constructor is not needed by Table.cartesianProduct
.
Generally speaking, dependent types provide many opportunities to have Lean fill out arguments on behalf of the programmer.
Dot notation is used with the results of queries to call functions defined both in the Table
and List
namespaces, such List.map
, List.filter
, and Table.cartesianProduct
.
This works because Table
is defined using abbrev
.
Just like type class search, dot notation can see through definitions created with abbrev
.
The implementation of select
is also quite concise.
After executing the query q
, List.filter
is used to remove the rows that do not satisfy the expression.
Filter expects a function from Row s
to Bool
, but DBExpr.evaluate
has type Row s → DBExpr s t → t.asType
.
Because the type of the select
constructor requires that the expression have type DBExpr s .bool
, t.asType
is actually Bool
in this context.
A query that finds the heights of all mountain peaks with an elevation greater than 500 meters can be written:
open Query in
def example1 :=
table mountainDiary |>.select
(.lt (.const 500) (c! "elevation")) |>.project
[⟨"elevation", .int⟩] (by repeat constructor)
Executing it returns the expected list of integers:
#eval example1.exec
[3637, 1519, 2549]
To plan a sightseeing tour, it may be relevant to match all pairs mountains and waterfalls in the same location. This can be done by taking the Cartesian product of both tables, selecting only the rows in which they are equal, and then projecting out the names:
open Query in
def example2 :=
let mountain := table mountainDiary |>.prefixWith "mountain"
let waterfall := table waterfallDiary |>.prefixWith "waterfall"
mountain.product waterfall (by simp)
|>.select (.eq (c! "mountain.location") (c! "waterfall.location"))
|>.project [⟨"mountain.name", .string⟩, ⟨"waterfall.name", .string⟩] (by repeat constructor)
Because the example data includes only waterfalls in the USA, executing the query returns pairs of mountains and waterfalls in the US:
#eval example2.exec
[("Mount Nebo", "Multnomah Falls"),
("Mount Nebo", "Shoshone Falls"),
("Moscow Mountain", "Multnomah Falls"),
("Moscow Mountain", "Shoshone Falls"),
("Mount St. Helens", "Multnomah Falls"),
("Mount St. Helens", "Shoshone Falls")]
Errors You May Meet
Many potential errors are ruled out by the definition of Query
.
For instance, forgetting the added qualifier in "mountain.location"
yields a compile-time error that highlights the column reference c! "location"
:
open Query in
def example2 :=
let mountains := table mountainDiary |>.prefixWith "mountain"
let waterfalls := table waterfallDiary |>.prefixWith "waterfall"
mountains.product waterfalls (by simp)
|>.select (.eq (c! "location") (c! "waterfall.location"))
|>.project [⟨"mountain.name", .string⟩, ⟨"waterfall.name", .string⟩] (by repeat constructor)
This is excellent feedback! On the other hand, the text of the error message is quite difficult to act on:
unsolved goals
case a.a.a.a.a.a.a
mountains : Query (List.map (fun c => { name := "mountain" ++ "." ++ c.name, contains := c.contains }) peak) :=
prefixWith "mountain" (table mountainDiary)
waterfalls : Query (List.map (fun c => { name := "waterfall" ++ "." ++ c.name, contains := c.contains }) waterfall) :=
prefixWith "waterfall" (table waterfallDiary)
⊢ HasCol (List.map (fun c => { name := "waterfall" ++ "." ++ c.name, contains := c.contains }) []) "location" ?m.109970
Similarly, forgetting to add prefixes to the names of the two tables results in an error on by simp
, which should provide evidence that the schemas are in fact disjoint;
open Query in
def example2 :=
let mountains := table mountainDiary
let waterfalls := table waterfallDiary
mountains.product waterfalls (by simp)
|>.select (.eq (c! "mountain.location") (c! "waterfall.location"))
|>.project [⟨"mountain.name", .string⟩, ⟨"waterfall.name", .string⟩] (by repeat constructor)
However, the error message is similarly unhelpful:
unsolved goals
mountains : Query peak := table mountainDiary
waterfalls : Query waterfall := table waterfallDiary
⊢ False
Lean's macro system contains everything needed not only to provide a convenient syntax for queries, but also to arrange for the error messages to be helpful.
Unfortunately, it is beyond the scope of this book to provide a description of implementing languages with Lean macros.
An indexed family such as Query
is probably best as the core of a typed database interaction library, rather than its user interface.
Exercises
Dates
Define a structure to represent dates. Add it to the DBType
universe and update the rest of the code accordingly. Provide the extra DBExpr
constructors that seem to be necessary.
Nullable Types
Add support for nullable columns to the query language by representing database types with the following structure:
structure NDBType where
underlying : DBType
nullable : Bool
abbrev NDBType.asType (t : NDBType) : Type :=
if t.nullable then
Option t.underlying.asType
else
t.underlying.asType
Use this type in place of DBType
in Column
and DBExpr
, and look up SQL's rules for NULL
and comparison operators to determine the types of DBExpr
's constructors.
Experimenting with Tactics
What is the result of asking Lean to find values of the following types using by repeat constructor
? Explain why each gives the result that it does.
Nat
List Nat
Vect Nat 4
Row []
Row [⟨"price", .int⟩]
Row peak
HasCol [⟨"price", .int⟩, ⟨"price", .int⟩] "price" .int