2. Lexical Structure

This section describes the detailed lexical structure of the Lean language. Many readers will want to skip this section on a first reading.

Lean input is processed into a stream of tokens by its scanner, using the UTF-8 encoding. The next token is the longest matching prefix of the remaining input.

token ::=  symbol | command | ident | string | char | numeral |
           decimal | quoted_symbol | doc_comment | mod_doc_comment |
           field_notation

Tokens can be separated by the whitespace characters space, tab, line feed, and carriage return, as well as comments. Single-line comments start with --, whereas multi-line comments are enclosed by /- and -/ and can be nested.

2.1. Symbols and Commands

Symbols are static tokens that are used in term notations and commands. They can be both keyword-like (e.g. the have keyword) or use arbitrary Unicode characters.

Command tokens are static tokens that prefix any top-level declaration or action. They are usually keyword-like, with transitory commands like #print prefixed by an additional #. The set of built-in commands is listed in the Chapter 5 section.

Users can dynamically extend the sets of both symbols (via the commands listed in Section 2.6) and command tokens (via the [user_command] attribute).

2.2. Identifiers

An atomic identifier, or atomic name, is (roughly) an alphanumeric string that does not begin with a numeral. A (hierarchical) identifier, or name, consists of one or more atomic names separated by periods.

Parts of atomic names can be escaped by enclosing them in pairs of French double quotes «».

def foo.«bar.baz» := 0  -- name parts ["foo", "bar.baz"]
ident              ::=  atomic_ident | ident "." atomic_ident
atomic_ident       ::=  atomic_ident_start atomic_ident_rest*
atomic_ident_start ::=  letterlike | "_" | escaped_ident_part
letterlike         ::=  [a-zA-Z] | greek | coptic | letterlike_symbols
greek              ::=  <[α-ωΑ-Ωἀ-῾] except for [λΠΣ]>
coptic             ::=  [ϊ-ϻ]
letterlike_symbols ::=  [℀-⅏]
escaped_ident_part ::=  "«" [^«»\r\n\t]* "»"
atomic_ident_rest  ::=  atomic_ident_start | [0-9'ⁿ] | subscript
subscript          ::=  [₀-₉ₐ-ₜᵢ-ᵪ]

2.3. String Literals

String literals are enclosed by double quotes ("). They may contain line breaks, which are conserved in the string value.

string        ::=  '"' string_item '"'
string_item   ::=  string_char | string_escape
string_char   ::=  [^\\]
string_escape ::=  "\" ("\" | '"' | "'" | "n" | "t" | "x" hex_char hex_char)
hex_char      ::=  [0-9a-fA-F]

2.4. Char Literals

Char literals are enclosed by single quotes (').

char ::=  "'" string_item "'"

2.5. Numeric Literals

Numeric literals can be specified in various bases.

numeral   ::=  numeral10 | numeral2 | numeral8 | numeral16
numeral10 ::=  [0-9]+
numeral2  ::=  "0" [bB] [0-1]+
numeral8  ::=  "0" [oO] [0-7]+
numeral16 ::=  "0" [xX] hex_char+

Decimal literals are currently only being used for some set_option values.

decimal ::=  [0-9]+ "." [0-9]+

2.6. Quoted Symbols

In a fixed set of commands (notation, local notation, and reserve), symbols (known or unknown) can be quoted by enclosing them in backticks (`). Quoted symbols are used by these commands for registering new notations and symbols.

quoted_symbol       ::=  "`" " "* quoted_symbol_start quoted_symbol_rest* " "* "`"
quoted_symbol_start ::=  [^0-9"\n\t `]
quoted_symbol_rest  ::=  [^"\n\t `]

A quoted symbol may contain surrounding whitespace, which is customarily used for pretty printing the symbol and ignored while scanning.

While backticks are not allowed in a user-defined symbol, they are used in some built-in symbols (see Quotations), which are accessible outside of the set of commands noted above.

2.7. Doc Comments

A special form of comments, doc comments are used to document modules and declarations.

doc_comment     ::=  "/--" ([^-] | "-" [^/])* "-/"
mod_doc_comment ::=  "/-!" ([^-] | "-" [^/])* "-/"

2.8. Field Notation

Trailing field notation tokens are used in expressions such as (1+1).to_string. Note that a.to_string is a single identifier, but may be interpreted as a field notation expression by the parser.

field_notation ::=  "." ([0-9]+ | atomic_ident)