Study Guide: SQL
Instructions
This is a study guide with links to past lectures, assignments, and handouts, as well as additional practice problems to assist you in learning the concepts.
Assignments
Important: For solutions to these assignments once they have been released, see the main website
Handouts
Lectures
Readings
Guides
SQL
SQL is a declarative programming language. Unlike Python or Scheme where we write programs which provide the exact sequence of steps needed to solve a problem, SQL accepts instructions which express the desired result of the computation.
The challenge with writing SQL statements then is in determining how to compose the desired result! SQL has a strict syntax and a structured method of computation, so even though we write statements which express the desired result, we must still keep in mind the steps that SQL will follow to compute the result.
SQL operates on tables of data, which contains a number of fixed columns. Each row of a table represents an individual data point, with values for each column. SQL statements then operate on these tables by iterating over each row, determining if it should be included in the output relation (filtering), and then computing the resulting value which should appear in the table.
We can also describe SQL's implementation using the following code as an
example. Imagine the SELECT
, FROM
, WHERE
, and ORDER BY
clauses are
implemented as functions which act on rows. Here's a simplified view of how SQL
might work, if implemented in simple Python.
output_table = []
for row in FROM(*input_tables):
if WHERE(row):
output_table += [SELECT(row)]
if ORDER_BY:
output_table = ORDER_BY(output_table)
if LIMIT:
output_table = output_table[:LIMIT]
Note that the ORDER BY
and LIMIT
clauses are applied only at the end after
all the rows in the output table have been determined.
One of the important things to remember about SQL is that we always return to this very simple model of computation: looping, filtering, applying a function, and then ordering and limiting the final output.
The simple Python example above helps expose a limitation of SQL: we currently can't create output tables with more rows than in the input! There are a few methods for creating novel combinations of existing data: joins and SQL recursion. Aggregation allows us to find patterns and consider multiple rows together as a single unit, or group.
Joins
Joins create novel combinations of data by combining data from more than one
source. Given multiple input tables, we can combine them in a join. Following
the Python metaphor, the join is like creating nested for
loops.
def FROM(table_1, table_2):
for row_1 in table1:
for row_2 in table2:
yield row_1 + row_2
Given each row in table_1
and each row in table_2
, the join iterates over
each possible combination of rows and treats them as the input table. The same
idea extends to more than two tables as well.
The SQL lab also has a great visual demonstrating this exact result as well.
Joins are particularly useful when we want to combine data on a single column.
For example, say we have a table, dogs
, containing the name
and size
of
each dog, and a different table, parents
, containing the name
and parent
of each dog. We might want to ask, "What's the difference in size between each
dog and their parent?" by joining together the tables in a SQL statement.
The first question we should ask ourselves is, "Which data tables do we need to
reference to assemble all the data we need?" We'll definitely need the table of
parents
to determine the name of each dog and their parent. From their names,
we still need a way to get the size of each dog. That information is provided
by the dogs
table.
SELECT d.name, d.size, p.parent FROM dogs as d, parents as p WHERE d.name = p.name;
But referencing the dogs
table only once will leave us in a tricky situation.
We can find either the size of the dog or their parent, but not both!
SELECT d1.name, d1.size, d2.name, d2.size
FROM dogs as d1, dogs as d2, parents as p
WHERE d1.name = p.name AND p.parent = d2.name;
Joining the dogs
table twice provides the necessary information to solve the
problem.
Aggregation
We saw joins as a method for creating novel combinations of data, and recursion as an extension of joins. These methods combine data by extending the number of columns we have available to us and help us identify the patterns in data.
Aggregation functions allow us to operate on data in a different way by
combining results across multiple rows. Common aggregation functions to be
familiar with include COUNT
, MIN
, MAX
, SUM
, and AVG
.
Applying an aggregation function to an input relation results in a single row containing the aggregate result.
> SELECT AVG(n) FROM n5;
3.0
But oftentimes, we'd like to condition the groups and compute aggregate results
for smaller portions of the input relation. We can use GROUP BY
and HAVING
to split the rows into groups and select only a subset of the groups.
output_table = []
for input_group in GROUP_BY(FROM(*input_tables)):
output_group = []
for row in input_group:
if WHERE(row):
output_group += [row]
if HAVING(output_group):
output_table += [SELECT(output_group)]
if ORDER_BY:
output_table = ORDER_BY(output_table)
if LIMIT:
output_table = output_table[:LIMIT]
We take the results from the input tables, whether it's just a single table or
a join, and then apply the same row-by-row processing within a group. Before
adding the result of the group to the output table, we check to see if the
values of the group reflect the condition in the HAVING
clause which serves
as a filter on the groups, much like how WHERE
is a filter on the rows.
Practice Problems
Medium
Suppose that we have a table of positive integers up to 100, as in lecture:
CREATE TABLE ints AS
WITH i(n) AS (
SELECT 1 UNION
SELECT n+1 FROM i LIMIT 100
)
SELECT n FROM i;
Q1: Divisors
Define a table divisors
in which each row describes the number of unique
divisors for an integer up to 100. For example, the number 16 has 5 unique
divisors: 1, 2, 4, 8, and 16.
CREATE TABLE divisors AS
SELECT "REPLACE THIS LINE WITH YOUR SOLUTION";
SELECT a.n * b.n AS n, count(*) AS divisors
FROM ints AS a, ints AS b
WHERE a.n * b.n <= 100
GROUP BY a.n * b.n;
Here's an (incomplete) example of what the divisors
table should look like.
-- Example:
SELECT * FROM divisors LIMIT 20;
-- Expected output:
-- 1|1
-- 2|2
-- 3|2
-- 4|3
-- 5|2
-- 6|4
-- 7|2
-- 8|4
-- 9|3
-- 10|4
-- 11|2
-- 12|6
-- 13|2
-- 14|4
-- 15|4
-- 16|5
-- 17|2
-- 18|6
-- 19|2
-- 20|6
Q2: Primes
Define a table primes
that has a single column containing all prime numbers up
to 100.
CREATE TABLE primes AS
SELECT "REPLACE THIS LINE WITH YOUR SOLUTION";
SELECT n FROM divisors WHERE divisors = 2;
Here's what your output should look like.
-- Example:
SELECT * FROM primes;
-- Expected output:
-- 2
-- 3
-- 5
-- 7
-- 11
-- 13
-- 17
-- 19
-- 23
-- 29
-- 31
-- 37
-- 41
-- 43
-- 47
-- 53
-- 59
-- 61
-- 67
-- 71
-- 73
-- 79
-- 83
-- 89
-- 97
Hint: You may want to use your
divisors
table to solve this problem.