BNF was invented in 1960 to describe the ALGOL language and is now used to describe many programming languages.
An example BNF grammar from the Python docs:
bnfdict_display: "{" [key_list | dict_comprehension] "}" key_list: key_datum ("," key_datum)* [","] key_datum: expression ":" expression dict_comprehension: expression ":" expression comp_for
A BNF grammar can be used as a form of documentation, or even as a way to automatically create a parser for a language.
BNF is more powerful than regular expressions. For example, regular expressions cannot accurately match a language (like Scheme) in which parentheses balance and can be arbitrarily nested.
In formal language theory, BNF can describe "context-free languages" whereas regular expressions can only describe "regular languages".
A BNF grammar consists of a set of grammar rules. We will specifically use the rule syntax supported by the Lark Python package.
The basic form of a grammar rule:
bnfsymbol₀: symbol₁ symbol₂ ... symbolₙ
Symbols represent sets of strings and come in 2 flavors:
To give multiple alternative rules for a non-terminal, use |
:
bnfsymbol₀: symbol₁ | symbol₂
A simple grammar with three rules:
bnf?start: numbers numbers: INTEGER | numbers "," INTEGER INTEGER: /-?\d+/
For the Lark library,
start
symbol.
What strings are described by that grammar?
bnf10 10,-11 10,-11,12
You can paste a BNF grammar in code.cs61a.org,
and it will be automatically recognized and processed by Lark
as long as the first line starts with ?start:
.
If the grammar is parsed successfully, then you can type strings from the language in the prompt.
bnflark> 10,-11
If the string can be parsed according to the grammar, a parse tree appears! 🥳 🎉 🤯
Terminals are the base cases of the grammar (like the tokens from the Scheme project).
In Lark grammars, they can be written as:
"*"
or "define"
)
/
on both sides (e.g. /\d+/
)
NUMBER: /\d+(\.\d+)/
It's common to want to always ignore some terminals
before matching. You can do that in Lark by adding an %ignore
directive at the end of the grammar.
bnf%ignore /\s+/ // Ignores all whitespace
bnf?start: sentence sentence: noun_phrase verb noun: NOUN noun_phrase: article noun article : | ARTICLE // The first option matches "" verb: VERB NOUN: "horse" | "dog" | "hamster" ARTICLE: "a" | "the" VERB: "stands" | "walks" | "jumps" %ignore /\s+/
What strings can this grammar parse?
bnfthe horse jumps a dog walks hamster stands
EBNF is an extension to BNF that supports some shorthand notations for specifying how many of a particular symbol to match.
EBNF | Meaning | BNF equiv |
item* | Zero or more items | items: | items item
|
item+ | One or more items | items: item | items item
|
item? | Optional item | optitem: | item
|
All of our grammars for Lark can use EBNF shorthands.
Parentheses can be used for grouping.
bnf?start: list list: ( NAME | NUM )+ NAME: /[a-zA-Z]+/ NUM: /\d+/ %ignore /\s/
Square brackets indicate an optional group.
bnfnumbered_list: ( NAME [ ":" NUM ] )+
Exercise: Describe a comma-separated list of zero or more names (no comma at the end).
bnfcomma_separated_list: [ NAME ("," NAME)* ]
Lark also provides pre-defined terminals for common types of data to match.
bnf%import common.NUMBER %import common.SIGNED_NUMBER %import common.DIGIT %import common.HEXDIGIT
A BNF for the Calculator language:
bnf?start: calc_expr ?calc_expr: NUMBER | calc_op calc_op: "(" OPERATOR calc_expr* ")" OPERATOR: "+" | "-" | "*" | "/" %ignore /\s+/ %import common.NUMBER
bnf?start: calc_expr ?calc_expr: NUMBER | calc_op calc_op: "(" OPERATOR calc_expr* ")" OPERATOR: "+" | "-" | "*" | "/"
"("
) but does show the values
of named terminals (like OPERATOR
) or unnamed regular expressions.
?
and have only one child, replacing them with that child (like calc_expr
).
Because the tree is simplified, we call it an abstract syntax tree.
Write a BNF that can parse simple Python comparisons between numbers:
5 > 2
, 3 < 5
, 32 == 33
, etc.
The comparison 5 > 2
should result in this parse tree:
bnf?start: comparison comparison: ______________________ __________: ______________________ %ignore /\s+/ %import common.NUMBER
Write a BNF that can parse simple Python comparisons between numbers:
5 > 2
, 3 < 5
, 32 == 33
, etc.
The comparison 5 > 2
should result in this parse tree:
text?start: comparison comparison: NUMBER COMPARATOR NUMBER COMPARATOR: "==" | ">" | "<" %ignore /\s+/ %import common.NUMBER
Write a BNF that can parse simple comparisons or
Python and
expressions with those simple comparisons:
5 > 2
, 5 > 2 and 3 < 5
, 5 > 2 and 3 < 5 and 2 < 4
.
5 > 2 and 2 < 3
should result in this parse tree:
Note: An and
expression may itself contain nested and
s.
Start from the previous solution.
Write a BNF that can parse simple comparisons or
Python and
expressions with those simple comparisons:
5 > 2
, 5 > 2 and 3 < 5
, 5 > 2 and 3 < 5 and 2 < 4
.
Note: An and
expression may itself contain nested and
s.
text?start: expression ?expression: and_expression | comparison and_expression: expression "and" expression comparison: NUMBER COMPARATOR NUMBER COMPARATOR: ">" | "<" | "==" %ignore /\s+/ %import common.NUMBER
Add support for or
expressions to previous BNF.
5 > 2
, 5 > 2 or 3 < 5
, 5 > 2 and 3 < 5 or 2 < 4
.
5 > 2 and 2 < 3 or 3 > 4
should result in this tree:
Add support for or
expressions to previous BNF.
5 > 2
, 5 > 2 or 3 < 5
, 5 > 2 and 3 < 5 or 2 < 4
.
5 > 2 and 2 < 3 or 3 > 4
should result in this tree:
text?start: expression ?expression: or_expression | and_expression | comparison or_expression: expression "or" expression and_expression: expression "and" expression comparison: NUMBER COMPARATOR NUMBER COMPARATOR: ">" | "<" | "==" %ignore /\s+/ %import common.NUMBER
Ambiguity arises when a grammar supports multiple possible parses of the same string.
Python infix expression grammar:
bnf?start: expr ?expr: NUMBER | expr OPERATOR expr OPERATOR: "+" | "-" | "*" | "/"
What tree should we get for 3+7*2?
One way to resolve this ambiguity:
bnf?start: expr ?expr: add_expr ?add_expr: mul_expr | add_expr ADDOP mul_expr ?mul_expr: NUMBER | mul_expr MULOP NUMBER ADDOP: "+" | "-" MULOP: "*" | "/"
That grammar can only produce this parse tree: