Study Guide: BNF (Backus-Naur Form)
Instructions
This is a study guide with links to past lectures, assignments, and handouts, as well as additional practice problems to assist you in learning the concepts.
Assignments
For solutions to these assignments once they have been released, find the links in the front page calendar.
Lectures
Readings
BNF is not covered in the 61A textbook, but here's a chapter from a UCI textbook: EBNF: A Notation to Describe Syntax. That chapter goes into more detail than strictly needed for our class, but may be helpful for those of you who like to learn from textbook readings.
Guides
BNF
Backus-Naur Form (BNF) is a syntax for describing a context-free grammar. It was invented for describing the syntax of programming languages, and is still commonly used in documentation and language parsers. EBNF is a dialect of BNF which contains some convenient shorthands.
An EBNF grammar contains symbols and a set of recursive production rules. In 61A, we are using the Python Lark library to write EBNF grammars, which has a few specific rules for grammar writing.
There are two types of symbols: Non-terminal symbols can expand into non-terminals (including themselves) or terminals. In the Python Lark library, non-terminal symbols are always lowercase. Terminal symbols can be strings or regular expressions. In Lark, terminals are always uppercase.
Consider these two production rules:
numbers: INTEGER | numbers "," INTEGER
INTEGER: /-?\d+/
The symbol numbers
is a non-terminal with a recursive production rule. It corresponds to either an INTEGER
terminal or to the numbers
symbol (itself) plus a comma plus an INTEGER
terminal. The INTEGER
terminal is defined using a regular expression which matches any number of digits with an optional - sign in front.
This grammar can describe strings like:
10
10,-11
10,-11,12
And so on, with any number of integers in front.
A grammar should also specify a start symbol, which corresponds to the whole expression being parsed (or the whole sentence, for a spoken language).
For the simple example of comma-separated numbers, the start symbol could just be the numbers terminal itself:
?start: numbers
numbers: numbers "," INTEGER | INTEGER
INTEGER: /-?\d+/
EBNF grammars can use these shorthand notations for specifying how many symbols to match:
EBNF Notation | Meaning | Pure BNF Equivalent |
---|---|---|
item* | Zero or more items | items: | items item |
item+ | One or more items | items: item | items item |
[item] item? |
Optional item | optitem: | item |
Lark also includes a few handy features:
- You can specify tokens to complete ignore by using the ignore directive at the bottom of a grammar. For example,
%ignore /\s+/
ignores all whitespace (tabs/spaces/new lines). - You can import pre-defined terminals for common types of data to match. For example, %import common.NUMBER imports a terminal that matches any integer or decimal number.
Using all of that, here's an EBNF grammar that corresponds to the Calculator language:
start: calc_expr?
calc_expr: NUMBER | calc_op
calc_op: "(" OPERATOR calc_expr* ")"
OPERATOR: "+" | "-" | "*" | "/"
%ignore /\s+/
%import common.NUMBER
You can paste that into code.cs61a.org and then input Calculator expressions in the interpreter to see their parse trees. Try it!
Practice Problems
Q1: Toddler-ese
Consider this EBNF grammar for "Toddler-ese", a language for communicating wants and needs with toddlers.
?start: sentence
sentence: describe_wants | describe_feeling
describe_wants: TODDLER " wants " noun_phrase "!"
noun_phrase: (ARTICLE " ")? NOUN
describe_feeling: TODDLER " is " EMOTION "!"
TODDLER: "beverly" | "baggy" | "baby"
ARTICLE: "the" | "a" | "an" | "un" | "una"
NOUN: "ball" | "elmo" | "chalk" | "gusano"
EMOTION: "sad" | "mad" | "tired"
Which of these phrases cannot be parsed by this grammar?
baby wants ball!
baggy is mad!
beverly is elmo!
baggy wants un elmo!
beverly is elmo!
, cannot be parsed by this grammar. The "is" will be matched by the describe_feeling
non-terminal, but "elmo" isn't matched by the EMOTION
terminal, so there is no rule that matches the whole phrase.
Remember that you can paste the grammar in code.cs61a.org and then enter expressions in the terminal below to see the parse trees or errors for each expression!
Which of these phrases can be parsed by this grammar?
baggy is mad!!
beverly is un gusano!
baggy wants a sad elmo!
baby wants the ball!
baby wants the ball!
, can be parsed by this grammar. Phrase #1 has too many exclamation points,
phrase #2 is using a NOUN
where an EMOTION
is expected. Phrase #3 adds an adjective where none is allowed.
Which BNF line would we modify to add support for emotions in front of the NOUN
s, like "mad elmo"?
sentence: describe_wants | describe_feeling
noun_phrase: (ARTICLE " ")? NOUN
ARTICLE: "the" | "an" | "a" | "una" | "un"
NOUN: "ball" | "elmo" | "chalk" | "gusano"
noun_phrase
non-terminal, modifying it like so:
noun_phrase: (ARTICLE " ")? (EMOTION " ")? NOUN
Q2: Python integers
Consider this EBNF grammar for allowed integer formats in the Python language, adopted from the Python documentation.
?start: integer
integer: decinteger | bininteger | octinteger | hexinteger
decinteger: NONZERODIGIT DIGIT*
bininteger: "0" ("b" | "B") BINDIGIT+
octinteger: "0" ("o" | "O") OCTDIGIT+
hexinteger: "0" ("x" | "X") hexdigit+
NONZERODIGIT: /[1-9]/
DIGIT: /[0-9]/
BINDIGIT: /[01]/
OCTDIGIT: /[0-7]/
hexdigit: DIGIT | /[a-f]/ | /[A-F]/
Which of these are valid Python integers, according to the grammar? Select all that apply
- 0xF9
- 0x55
- 0xf9
- 0xG9
- 0xFFF
hexinteger
rule, except for 0xG9
. That one is not valid
because it contains a "G" and hexdigit
only matches [0-9]
, [a-F]
, [A-F]
.
Remember that you can paste the grammar in code.cs61a.org and then enter expressions in the terminal below to see the parse trees or errors for each expression!
Which of these is not a valid Python integer, according to the grammar? Select one
- 07543
- 0b0101
- 0o7543
- 0x1011
Which of these can be parsed by the grammar? Select all that apply
- 0
- 0b
- 0x
- 0o
decinteger
must start with a non-zero digit
and the other integer types require an additional letter.
The other options can't be parsed, since the bininteger
, hexinteger
, and octinteger
rules
require at least 1 digit after the "b"/"x"/"o".
Which of these symbols are considered a non-terminal symbol? Select one
NONZERODIGIT
DIGIT
BINDIGIT
OCTDIGIT
hexdigit
hexdigit
symbol is a non-terminal symbol, since its rule can expand to another symbol, DIGIT
.
The other symbols in the list are all terminal symbols, since they are defined by regular expressions,
not other symbols.
Q3: Scheme expressions
Consider this EBNF grammar for a subset of Scheme expressions, adopted from the Scheme documentation.
?start: expression
expression: constant | variable | "(if " expression expression expression? ")" | application
constant: BOOLEAN | NUMBER
variable: identifier
application: "(" expression expression* ")"
identifier: initial subsequent* | "+" | "-" | "..."
initial: LETTER | "!" | "$" | "%" | "&" | "*" | "/" | ":" | "<" | "=" | ">" | "?" | "~" | "_" | "^"
subsequent: initial | DIGIT | "." | "+" | "-"
LETTER: /[a-zA-z]/
DIGIT: /[0-9]/
BOOLEAN: "#t" | "#f"
%import common.NUMBER
%ignore /\s+/
Which of these Scheme expressions can be parsed by that grammar? Select all that apply
#t
- 193
x
-xy
-xy
cannot be parsed by the grammar - it doesn't match variable since it starts with a "-" and doesn't match the other symbols since it doesn't start with a parentheses.
What would be equivalent to the current regular expression for DIGIT
? Select all that apply
DIGIT: /\d+/
DIGIT: /\d*/
DIGIT: /\d/
DIGIT: /\d?/
/\d/
would be equivalent. The other options have quantifiers, so they would match more or less than a single digit as well.
Which of these Scheme expressions can be parsed by that grammar? Select all that apply
(define x 5)
(if (= x 5) x y)
(begin (print 1) (print 2) (print 3))
(print (+ 1 2))
define
and begin
expressions
would be parsed as an application
, even though they are special forms.
In the full Scheme EBNF grammar, there is a separate symbol for the special forms,
so that the parser knows they aren't just standard procedure calls.
For example, to parse the define
form, we could modify the grammar like so:
expression: constant | variable | "(if " expression expression expression? ")" | define | application
define: "(define" variable expression ")"
Which of these statements is true? Select all that apply
- If an expression parses with this grammar, then it is possible to run that code without error in a Scheme interpreter.
- This grammar contains all the information needed to build a working interpreter for this subset of Scheme.
- This grammar contains the information needed for the phase of the interpreter that parses an input string and turns it into an expression tree.
- If an expression parses with this grammar, it is considered syntactically correct Scheme code.
(1 2)
. Option #2 is not correct since a grammar does not contain information on how to evaluate the code.