Mercury modules are comprised of a sequence of tokens separated by any amount of whitespace, comments, and line number directives. These separators are mostly ignored by the parser, but in some cases whitespace may be required to separate tokens that would otherwise be ambiguous. In other cases whitespace is not allowed, e.g., before the open-ct token, or after a ‘.’ operator that would otherwise be interpreted as an end token.
Whitespace is defined to be the following characters:
Unicode name | Unicode code point | Notes |
---|---|---|
SPACE | U+0020 | |
CHARACTER TABULATION | U+0009 | Horizontal-tab |
LINE FEED | U+000A | |
LINE TABULATION | U+000B | Vertical-tab |
FORM FEED | U+000C | |
CARRIAGE RETURN | U+000D |
The ‘%’ character starts a comment that continues to the end of the line. The ‘/*’ character sequence starts a comment that continues until the next occurrence of ‘*/’.
A line number directive consists of the character ‘#’, a positive integer specifying the line number, and then a newline. Line number directives specify a current line number; they are used in conjunction with the ‘pragma source_file’ declaration (see Source file name) to indicate that errors in the subsequent Mercury code should be reported as coming from a different location. This is useful if the code in question was generated by another tool, in which case the line number can be set to the corresponding location in the original source file from which the Mercury code was derived. The Mercury compiler can thereby issue more informative error messages using locations in the original source file. A ‘#line’ directive specifies the line number for the immediately following line. Line numbers for lines after that are incremented as usual, so the second line after a ‘#100’ directive would be considered to be line number 101.
The different tokens in Mercury are as follows.
A string is a sequence of characters enclosed in double quotes ("
).
Within a string, two adjacent double quotes stand for a single double quote. For example, the string ‘ """" ’ is a string of length one, containing a single double quote: the outermost pair of double quotes encloses the string, and the innermost pair stand for a single double quote.
Strings may also contain backslash escapes. ‘\a’ stands for “alert” (a beep character), ‘\b’ for backspace, ‘\r’ for carriage-return, ‘\f’ for form-feed, ‘\t’ for tab, ‘\n’ for newline, ‘\v’ for vertical-tab. An escaped backslash, single-quote, or double-quote stands for itself.
The sequence ‘\x’ introduces a hexadecimal escape; it must be followed by a sequence of hexadecimal digits and then a closing backslash. It is replaced with the character whose character code is identified by the hexadecimal number. Similarly, a backslash followed by an octal digit is the beginning of an octal escape; as with hexadecimal escapes, the sequence of octal digits must be terminated with a closing backslash.
The sequence ‘\u’ or ‘\U’ can be used to escape Unicode characters. ‘\u’ must be followed by the Unicode character code expressed as four hexadecimal digits. ‘\U’ must be followed by the Unicode character code expressed as eight hexadecimal digits. The highest allowed value is ‘\U0010FFFF’.
A backslash followed immediately by a newline is deleted; thus an escaped newline can be used to continue a string over more than one source line. (String literals may also contain embedded newlines.)
A name is either an unquoted name, a quoted name, a graphic name, or a single semicolon character. An unquoted name is a lowercase letter followed by zero or more letters, underscores, and digits. A quoted name is any sequence of zero or more characters enclosed in single quotes ('). Within a quoted name, two adjacent single quotes stand for a single single quote. Quoted names can also contain backslash escapes of the same form as for strings. A graphic name is a sequence of one or more of the following characters
! & * + - : < = > ? @ ^ ~ \ # $ . /
where the first character is not ‘#’.
An unquoted name, graphic name, or semicolon is treated as equivalent to a quoted name containing the same sequence of characters.
An operator is one of the builtin operators (see Builtin operators) or a user-defined operator. A user-defined operator is a name, module qualified name (see The module system), or variable, enclosed in grave accents (backquotes). User-defined operators are left-associative infix operators that bind more strongly than most other operators; see the builtin operator table for their relative binding strength.
The builtin operators, with the exception of comma, are all names, and as such they can be used without arguments supplied. For example, ‘f(+)’ is syntactically valid. In some cases parentheses may be required to limit the scope of an operator without arguments, e.g. if it appears as an argument to another operator. The comma operator is not a name and therefore requires single quotes in order to be used without arguments.
Note that an operator in single quotes is still an operator, so any requirement for parentheses will remain unchanged.
A variable is an uppercase letter or underscore followed by zero or more letters, underscores, and digits. A variable token consisting of single underscore is treated specially: each instance of ‘_’ denotes a distinct variable. (In addition, variables starting with an underscore are presumed to be “don’t-care” variables; the compiler will issue a warning if a variable that does not start with an underscore occurs only once, or if a variable starting with an underscore occurs more than once in the same scope.)
An integer is either a decimal, binary, octal, hexadecimal, or character-code literal. A decimal literal is any sequence of decimal digits. A binary literal is ‘0b’ followed by any sequence of binary digits. An octal literal is ‘0o’ followed by any sequence of octal digits. A hexadecimal literal is ‘0x’ followed by any sequence of hexadecimal digits. A character-code literal is ‘0'’ followed by any single character.
Decimal, binary, octal and hexadecimal literals may be optionally terminated by a suffix that indicates whether the literal represents a signed or unsigned integer and what the size of that integer is. These suffixes are:
Suffix | Signedness | Size |
---|---|---|
i or no suffix | Signed | Implementation-defined |
i8 | Signed | 8-bit |
i16 | Signed | 16-bit |
i32 | Signed | 32-bit |
i64 | Signed | 64-bit |
u | Unsigned | Implementation-defined |
u8 | Unsigned | 8-bit |
u16 | Unsigned | 16-bit |
u32 | Unsigned | 32-bit |
u64 | Unsigned | 64-bit |
For decimal, binary, octal and hexadecimal literals, an arbitrary number of underscores (‘_’) may be inserted between the digits. An arbitrary number of underscores may also be inserted between the radix prefix (i.e. ‘0b’, ‘0o’ and ‘0x’) and the initial digit. Similarly, an arbitrary number of underscores may be inserted between the final digit and the signedness suffix. The purpose of the underscores is to improve readability; they do not affect the numeric value of the literal.
A floating point literal consists of a sequence of decimal digits, a decimal point (‘.’) and a sequence of digits (the fraction part), and the letter ‘E’ (or ‘e’), an optional sign (‘+’ or ‘-’), and then another sequence of decimal digits (the exponent). The fraction part or the exponent (but not both) may be omitted.
An arbitrary number of underscores (‘_’) may be inserted between the digits in a floating point literal. Underscores may not occur adjacent to any non-digit characters (i.e. ‘.’, ‘e’, ‘E’, ‘+’ or ‘-’) in a floating point literal. The purpose of the underscores is to improve readability; they do not affect the numeric value of the literal.
An implementation-defined literal consists of a dollar sign (‘$’) followed by an unquoted name.
A left parenthesis, ‘(’, that is not preceded by whitespace.
A left parenthesis, ‘(’, that is preceded by whitespace.
A right parenthesis, ‘)’.
A left square bracket, ‘[’.
A right square bracket, ‘]’.
A left curly bracket, ‘{’.
A right curly bracket, ‘}’.
A “head-tail separator”, i.e. a vertical bar, ‘|’.
A comma, ‘,’.
A full stop (period), ‘.’.