%>% # FROM df
df filter(var_1 > 0) %>% # WHERE var_1 > 0
mutate(chr_var = as.character(num_var)) %>% # SELECT CAST(num_var AS CHAR)
group_by(chr_var) %>% #GROUP BY fct_var
summarise( #Select COUNT(*) as n, AVG(sum_var) as mean
n = n(),
mean = mean(sum_var)
%>%
) select(chr_var, n, sum_var) #SELECT ...
Introduction to SQL for R Users w/Examples
SQL - Structured Query Language
First and foremost, unlike R
, SQL is a procedural language in that it is quite literally an ordered list of instructions or procedures for a computer to follow. It’s also a language that operates on tabular data, which in a not so technical way can be described as any data you can feasibly put into an Excel spreadsheet. Being procedural does make the language less flexible in many ways, but the trade-off in increased speed, efficiency, and ability to wrangle data without bringing it all into active memory is truly astounding – hence why it’s still around today!
Resources for Today’s Lecture
This lecture is by no means comprehensive and is just meant to serve as an introduction for those with an intermediate level of R
and tidyverse
knowledge. To that end, here are a list of resources from today’s lecture for more in-depth explanations and examples by people who are much smarter than me, as well as resources for the core R
packages implemented in the examples.
SQL Resources
- 10 Steps to Understanding SQL (Highly Recommended)
- SQL Explained (Highly Recommended)
- Code Academy SQL
- Indexes & Planning Queries with SQLite
- SQL Wikipedia
- More on the origins of SQL, relational databases, etc. (actually a decent read)
R Package Vignettes
Note – Personally, I find SQLite is easier to work with, RSQLite is better maintained, and Rstudio developers seem to prefer both as well. However coding assessment websites may not support SQLite, but seem to support MySQL pretty much across the board – hence why it’s included in the examples.
Practice for Job Interviews
- Free practice and preparation with Hackerrank.com
- If you have access to DataCamp through PUBH 7461/7462 lots of SQL practice
1. Intro to SQL
Let’s start with an anology and a simple example:
A query is to SQL as a pipe is to dplyr
.
Recall that a pipe operator %>% or |>
takes the output from the l.h.s. of the pipe and passes it as an input to the r.h.s. of the pipe. So when using a general dplyr pipeline flow, such as
the data wrangling pipe from df %>% $\cdots$ %>% select(...)
is equivalent to the MySQL query given by
COUNT(*) AS n, mean AS AVG(sum_var) #* means all `everything()`
SELECT chr_var, FROM(
*, CAST(num_var AS CHAR)
SELECT
FROM df> 0
WHERE var_1 #MySQL requires derived tables to be named
) q01 GROUP BY chr_var
While the output and individual data wrangling steps are the same, the order of operations (\(\S 1.1\)) and need for derived tables (\(\S 1.2\)) are super different – but have no fear, once you get the hang of it translating dplyr
back to SQL
isn’t too bad! Fundamentally, the goal of the R pipe and SQL query is the same: to wrangle data. Below, we discuss the differences in syntax vs. execution, Q & A about SQL and SQL within the R ecosystem, followed by a variety of examples for how to translating data wrangling in dplyr
to SQL.
1.1 Order of Execution \(\neq\) Syntax Order of SQL Procedures
Syntax Order of Operations
- SELECT [ DISTINCT ] (
select() %>% distinct()
/mutate
/summarise
)- Selects columns from table AND performs
mutate
/summarise
- AS/as for renaming, ex. SELECT var_1 * 100 AS new_var \(\iff\)
mutate(new_var = var_1 * 100)
- Cannot access a new ‘mutated’ variable in the same SELECT where it was created
- ex.
mutate(a = b + 1, c = a - 1)
\(\neq\) SELECT b + 1 AS a, a - 1 AS c (won’t run) - See \(\S 1.2\) Derived Tables below for more detail
- ex.
- Selects columns from table AND performs
- FROM (
df %>% .
)- Which table are we working with
- WHERE (
filter()
)- Remove obs. that are not of interest
- GROUP BY (
group_by()
)- Set up for group-wise summary stats
- SQL ex. SELECT sum_var = AVG(var) FROM table GROUP BY category
dplyr
ex. table %>% group_by(category) %>% summarise(sum_var = mean(var))`- SELECT is used for both
mutate
andsummarise
in SQL
- HAVING (*
filter()
)- Same as
WHERE
(filter
) but for post-grouping summaries (or just summariesover()
) - However, SQLite accepts WHERE for both types of filtering
- Same as
- UNION (
*_join()
)- Merging data from multiple sources by column/key
- ex. FROM data_a as a JOIN data_b on a.key = b.key
- ORDER BY [ DESC/ASC ] (
arrange(desc())
)- No factor structure, just ordering the table layout
Order of Execution
- FROM (
df %>% .
) - WHERE (
filter()
) - GROUP BY (
group_by()
) - HAVING (another
filter()
) - SELECT (
select()
) - DISTINCT (
distinct()
retain only unique obs.) - UNION (
*_join()
) - ORDER BY [DESC/ASC] (
arrange(desc())
)
Summary
You may have noticed that the order SQL procedures are actually run in is the same general workflow for data wrangling in R with dplyr
, and I don’t think that’s a coincidence! SQL, born in the 70’s, is hyper efficient at what it is designed to do
1.2 Derived Tables
Derived tables are going to feel unnatural compared to the way we can manipulate data with mutate
in dplyr
, but they are necessary for more complex queries. What is a derived table you ask? Similar to piping in R
where f() %>% g(.)
means takes the output of f()
and passes it as an input to g()
, when performing a series of wrangling functions like filter
, mutate
, *_join
, etc. statement that pipeline is producing at least 1 derived table in the SQL translation.
A derived table is a subquery in a FROM
statement which produces an intermediate table and in most SQL versions must be named with AS
. For example this is a simple derived table/subquery –
* FROM (
SELECT + var_2 AS sum_1
SELECT var_1
FROM df
) AS tbl_1> 0 WHERE sum_1
Now why do need a subquery? Why can’t we just simple say?
* FROM df WHERE (var_1 + var_2) > 0 SELECT
We need a subqueries (below)!
1.3 Sub-queries
Very similar in syntax and flavour to derived tables, sub-queries are just derived tables which only contain one column and are generally used within filter
/WHERE
/HAVING
. For example, in R
we can pass data manipulations, functions, and other variables within the filter
without thinking anything much of it –
%>%
df filter(var1 < max(var2))
However, SQL procedures/queries don’t operate like pipes and in order to mutate
any variable AND have access to it in the same procedure simultaneously, we need sub-queries. The MySQL sub-query equivalent to the above is given by
FROM df< (SELECT MAX(var2) OVER() as max_var2 FROM df LIMIT 1) WHERE var
Here, the sub-query is itself a stand alone query but it must have only 1 column and generally will only contain a single value/character. Example use of sub-queries in WHERE/HAVING statements is found near the end of the examples in \(\S 3\).
#Sub-query is a valid query (1 col, 1 val)
MAX(var2) OVER() as max_var2 #This is equiv. to mutate(max_var2 = max(var2))
SELECT #df %>%
FROM df 1 #Only want first occurrence (it's a column of the same number) LIMIT
2. But what is a Structured Query Language?
How does it work, are there alternatives, and when is dbplyr
sufficient? (Q & A)
Q: Why procedural instead of functional or dynamic?
Answer – because under the hood, SQL is actually a collection of hyper-efficient algorithms which seek to optimally complete the query you’ve requested. Since it doesn’t pull data into memory until the query is complete, this necessitates a pre-defined procedural set of instructions to be completed. This makes it less flexible but very fast!
Q: SQL is super efficient, but what can’t it do?
Answer – Can’t handle non-tabular data and/or more complex data structures like lists, arrays, etc.
Q: Are their alternatives to SQL?
Answer – Yes and no, it’s been around for so long that there are so many different versions and flavours of SQL that to be honest, if you are starting from square 1 it can be super confusing because fundamental syntax is slightly different between them. Kind of like how base R
is self-consistent syntactically but individual packages, like those in the tidyverse
, can be utterly unrecognizeable by comparison. There are also other data base languages / data structures like Apache arrow
parquette files (in R
it interfaces directly with dplyr
just like dbplyr
below), Apache Spark (sparklr
) is another way to handle data larger than memory, and then there are also a whole suite of proprietary and open source languages for interacting with cloud databases and cloud computing (like AWS).
Q: If I know dplyr
well, do I really need to know SQL?
Answer – If you want to be a well-rounded data scientist in a production environment, or even just want to speed up your own data wrangling workflow with SQL data bases, yes you need to know SQL (to the point that you can use the internet to create a more complex query if you need to for work). As mentioned above, yes there are tons of ways to handle data bigger than memory without leaving the R
ecosystem but SQL is standard in most all industrial environments.
Q: When is dbplyr
sufficient and when do I need to use SQL?
While dplyr
was surely built with SQL procedures in mind (the flow/verbs are very similar) and dbplyr
will work for most basic to intermediate SQL queries, there’s no such thing as a free lunch. One of the biggest downsides to relying on a translator is that there are lots of data wrangling functions, workflows, and the ability to supply user defined functions almost anywhere that I use everyday in R
but can’t use with dbplyr
because there is no SQL equivalent. However, though it may not be able to generate the exact SQL query you need in a giving setting, you can certainly build a pretty comprehensive skeleton for a query with show_query
(the syntax is also generally sub-optimal for both readability and reproducibility due to translation so copy and pasting queries is inadvisable).
3. How to use DBI
to connect to db and dbplyr
to generate queries
In general, you don’t really use SQL locally but for practice and learning it’s best to start there. However, Creating and maintaining databases is it’s own distinct field and trying to create one locally, depending on the SQL version, can be complicated and time consuming. At work most likely experts will handle that aspect of the data pipeline, you’ll just connect and query the data you need to model, visualize, report, etc.
#SQLite
<- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")
con_sqlite
#MySQL
<- pool::dbPool(
con_mysql ::MySQL(),
RMySQLdbname = "shinydemo",
host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
username = "guest",
password = "guest"
)
::copy_to(con_sqlite, nycflights13::flights, "flights",
dplyrtemporary = FALSE,
indexes = list(
c("year", "month", "day"),
"carrier",
"tailnum",
"dest"
)
)
::copy_to(con_sqlite, nycflights13::airports, "airports")
dplyr
#Pull table (FROM flights or FROM City)
<- dplyr::tbl(con_sqlite, "flights")
flights_db <- dplyr::tbl(con_sqlite, "airports")
airport_db <- dplyr::tbl(con_mysql, "City") city_db
4. Let’s Wrangle Some Data! (Examples w/Explanation)
4.1 NYC Flights ‘basic’ grouped summary stats (SQLite)
%>%
flights_db select(year:day, contains("time"), origin) %>%
filter(year == 2013, day < 30, origin %in% c("LGA", "JFK", "EWR")) %>%
mutate(
dep_dif = dep_tim - schedule_dep_time,
arr_dif = arr_time - sched_arr_time,
%>%
) group_by(origin) %>%
summarise(
mean_dep = mean(dep_dif, na.rm = TRUE),
mean_arr = mean(arr_dif, na.rm = TRUE),
%>%
) filter(mean_dep > 0) %>%
arrange(desc(mean_dep)) %>%
show_query()
4.2 Filtering by the max and min city name, by string length (MySQL)
Consider the question “Return the min and max length city name and it’s length. Return only the first alphabetical occurance if there are multiple.”
%>%
city_db mutate(city_len = length(Name)) %>%
filter(city_len == max(city_len) | city_len == min(city_len)) %>%
select(Name, city_len) %>%
show_query()
However, this query is a bit messy and not all verisons of SQL support WINDOW doesn’t – OVER() does seem to work fine though. Here was my solution to the question “return the min and max length city name and it’s length. Return only the first alphabetical occurance if there are multiple.” This was the solution that worked –
SELECT CITY, city_lenFROM (
SELECT*,
MAX(city_len) OVER() AS max_len,
MIN(city_len) OVER() AS min_len
FROM (
*, length(CITY) AS city_len
SELECT
FROM STATION
) q01
) q02 WHERE (city_len = max_len OR city_len = min_len)
ORDER BY city_len DESC, CITY ASC 2 #Top 2 LIMIT
Here you can’t just call max(len) or min(lin) directly in the filter/WHERE statement because they are two distinct procedures in SQL. Instead we need to use a WINDOW function via OVER() to obtain the aggregate max and min, have access to it later (mutate), without changing the current structure of the data (mutate vs. summarise).
Key Takeaways
- Derived tables are necessary to access new columns (mutate) or summary/WINDOW columns
- Continuously pass derived tables in ’reverse nesting-doll order nested fashion via SELECT
- Generally I don’t like using arrange/ORDER BY to ensure ordering of things as it can break fairly easily if the data isn’t immaculately clean (I trust factor structure/ordinal undercoding more in general), but it seems to work here.
- No user defined functions in SQL queries like
mutate(new_var = my_func(old_var))
, unfortunately
4.3 More complex group_by + summarise in (MySQL)
%>%
city_db filter(CountryCode %in% c("AFG", "NLD"), Population > 1) %>%
mutate(population_k = Population / 1000) %>%
group_by(CountryCode, District) %>%
summarise(
max = max(population_k),
avg = mean(population_k, na.rm = TRUE),
var = sd(population_k, na.rm = TRUE)^2
%>%
) filter(max > 0) %>%
rename(Country = CountryCode) %>%
arrange(desc(avg)) %>%
show_query()
%>%
city_db summarise(
n = count(CountryCode),
n_dist = count(distinct(CountryCode)),
%>%
) mutate(
n_diff = n - n_dist
%>%
) ::select(n_diff) %>%
dplyr
show_query
# SELECT n - n_dist AS n_diff
# FROM (
# SELECT count(*) AS n, count(distinct(city)) AS n_dist
# FROM STATION
# ) n_dist
4.4 Simple regex queries, makes me miss stringr
! (MySQL)
Select based on starts and/or ends with –
SELECT DISTINCT CITY
FROM STATION'^[aeiouAEIOU]'
WHERE CITY RLIKE
#IF starts AND ends with
'^[regex].*[regex]$'
#If ends with
*[regex]$
.#DOES NOT start with
'^(?![aeiouAEIOU])'
#DOES NOT end with
'.*(?![aeiouAEIOU])$'
#DOES NOT START or DOES NOT END WITH
SELECT DISTINCT city
FROM station'^[aeiou]' OR city NOT REGEXP '[aeiou]$' WHERE city NOT REGEXP
SELECT NAME
FROM STUDENTS> 75
WHERE MARKS RIGHT(NAME, 3), ID ASC #Order by last 3 chars ORDER BY
4.6 Table Joins on Table Joins (MySQL)
Joining multiple tables in a pipe/query
Here we need to use explicit JOIN’s and while I almost always use left_join
, as long as you’ve placed the JOIN’s in the right order to match the relationship between the data and question of interest, just JOIN seems to be fine –
#LAST select
SELECT h.hacker_id, h.name #FIRST, data to be joined
FROM Submissions AS s #Join with hacker data by hacker id
JOIN Hackers AS h = h.hacker_id
on s.hacker_id #Join with challenge data on challenge id
JOIN Challenges as c = c.challenge_id
on s.challenge_id #Join with difficulty on Difficulty_level
JOIN Difficulty as d = d.Difficulty_level
on c.Difficulty_level = d.score #Max score
WHERE s.score #group_by id and name
GROUP BY h.hacker_id, h.name count(*) > 1 #WHERE but for summarise
HAVING count(*) DESC, h.hacker_id #arrange ORDER BY
And here’s the dplyr
pipe translation of that SQL query –
%>%
submission.df left_join(., hacker.df, by = "hacker_id") %>%
left_join(., challenge.df, by = "challenge_id") %>%
left_join(., difficulty.df, by = "Difficulty_level") %>%
filter(x.score = y.score) %>%
group_by(hacker_id, name) %>%
summarise(
n = n()
%>%
) filter(n > 1) %>%
arrange(desc(n), hacker_id)
More complicated table joins and queries (and one sub-query!)
Here we’re joining a few tables to answer a more complicated question, which will require us to employ a sub-query to get the reproducible filter
/WHERE statement we need at the end. In R
, we don’t need sub-queries because we can simply perform a 2-step process: (1) run the sub-query or pipe prior and store the result as a variable in our global/local env and (2) pass that variable through the filter
/WHERE procedure in the main query/pipe.
SELECT w.id, p.age, w.coins_needed, w.power
FROM Wands AS w # Join
JOIN Wands_Property AS p = p.code
ON w.code = (SELECT min(coins_needed) #Filter by sub-query
WHERE w.coins_needed #From wand data
FROM Wands w2 #Inner join by wand property (join then filter missing)
INNER JOIN Wands_Property p2 w2.code = p2.code #Conditions below pertain to the specific question
ON p2.is_evil = 0 AND p.age = p2.age AND w.power = w2.power)
WHERE ORDER BY w.power DESC, p.age DESC;
A join and two sub-queries
count(c.challenge_id) AS cnt
SELECT c.hacker_id, h.name,
FROM Hackers AS h
JOIN Challenges AS c = c.hacker_id
ON h.hacker_id
GROUP BY c.hacker_id, h.name = (SELECT count(c1.challenge_id)
HAVING cnt
FROM Challenges AS c1
GROUP BY c1.hacker_id count(*) desc limit 1) OR
ORDER BY IN (SELECT count(c2.challenge_id)
cnt NOT
FROM Challenges AS c2
GROUP BY c2.hacker_id <> c.hacker_id) #not equal to (<>)
HAVING c2.hacker_id ORDER BY cnt DESC, c.hacker_id;