2. Lexical Structure¶
This section describes the detailed lexical structure of the Lean language. Many readers will want to skip this section on a first reading.
Lean input is processed into a stream of tokens by its scanner, using the UTF-8 encoding. The next token is the longest matching prefix of the remaining input.
token ::=symbol
|command
|ident
|string
|char
|numeral
|decimal
|quoted_symbol
|doc_comment
|mod_doc_comment
|field_notation
Tokens can be separated by the whitespace characters space, tab, line feed, and carriage return, as well as comments. Single-line comments start with --
, whereas multi-line comments are enclosed by /-
and -/
and can be nested.
2.1. Symbols and Commands¶
Symbols are static tokens that are used in term notations and commands. They can be both keyword-like (e.g. the have
keyword) or use arbitrary Unicode characters.
Command tokens are static tokens that prefix any top-level declaration or action. They are usually keyword-like, with transitory commands like #print
prefixed by an additional #
. The set of built-in commands is listed in the Chapter 5 section.
Users can dynamically extend the sets of both symbols (via the commands listed in Section 2.6) and command tokens (via the [user_command]
attribute).
2.2. Identifiers¶
An atomic identifier, or atomic name, is (roughly) an alphanumeric string that does not begin with a numeral. A (hierarchical) identifier, or name, consists of one or more atomic names separated by periods.
Parts of atomic names can be escaped by enclosing them in pairs of French double quotes «»
.
def foo.«bar.baz» := 0 -- name parts ["foo", "bar.baz"]
ident ::=atomic_ident
|ident
"."atomic_ident
atomic_ident ::=atomic_ident_start
atomic_ident_rest
* atomic_ident_start ::=letterlike
| "_" |escaped_ident_part
letterlike ::= [a-zA-Z] |greek
|coptic
|letterlike_symbols
greek ::= <[α-ωΑ-Ωἀ-῾] except for [λΠΣ]> coptic ::= [ϊ-ϻ] letterlike_symbols ::= [℀-⅏] escaped_ident_part ::= "«" [^«»\r\n\t]* "»" atomic_ident_rest ::=atomic_ident_start
| [0-9'ⁿ] |subscript
subscript ::= [₀-₉ₐ-ₜᵢ-ᵪ]
2.3. String Literals¶
String literals are enclosed by double quotes ("
). They may contain line breaks, which are conserved in the string value.
string ::= '"'string_item
'"' string_item ::=string_char
|string_escape
string_char ::= [^\\] string_escape ::= "\" ("\" | '"' | "'" | "n" | "t" | "x"hex_char
hex_char
) hex_char ::= [0-9a-fA-F]
2.5. Numeric Literals¶
Numeric literals can be specified in various bases.
numeral ::=numeral10
|numeral2
|numeral8
|numeral16
numeral10 ::= [0-9]+ numeral2 ::= "0" [bB] [0-1]+ numeral8 ::= "0" [oO] [0-7]+ numeral16 ::= "0" [xX]hex_char
+
Decimal literals are currently only being used for some set_option
values.
decimal ::= [0-9]+ "." [0-9]+
2.6. Quoted Symbols¶
In a fixed set of commands (notation
, local notation
, and reserve
), symbols (known or unknown) can be quoted by enclosing them in backticks (`
). Quoted symbols are used by these commands for registering new notations and symbols.
quoted_symbol ::= "`" " "*quoted_symbol_start
quoted_symbol_rest
* " "* "`" quoted_symbol_start ::= [^0-9"\n\t `] quoted_symbol_rest ::= [^"\n\t `]
A quoted symbol may contain surrounding whitespace, which is customarily used for pretty printing the symbol and ignored while scanning.
While backticks are not allowed in a user-defined symbol, they are used in some built-in symbols (see Quotations), which are accessible outside of the set of commands noted above.
2.7. Doc Comments¶
A special form of comments, doc comments are used to document modules and declarations.
doc_comment ::= "/--" ([^-] | "-" [^/])* "-/" mod_doc_comment ::= "/-!" ([^-] | "-" [^/])* "-/"
2.8. Field Notation¶
Trailing field notation tokens are used in expressions such as (1+1).to_string
. Note that a.to_string
is a single identifier, but may be interpreted as a field notation expression by the parser.
field_notation ::= "." ([0-9]+ | atomic_ident
)