This tutorial is for you if you want to access or manipulate data in a database that may be on your machine or on a different computer on the internet, and you have found libraries that use a higher level of abstraction, such as dbplyr, are not suitable for your purpose. Depending on what you want to achieve, you may find it useful to have an understanding of SQL before using DBI.
The DBI (DataBase
Interface) package provides a simple, consistent
interface between R and database management systems (DBMS). Each
supported DBMS is supported by its own R package that implements the DBI
specification in vignette("spec", package = "DBI")
.
DBI currently supports about 30 DBMS, including:
For a more complete list of supported DBMS visit https://github.com/r-dbi/backends. You may need to install the package specific to your DBMS.
The functionality currently supported for each of these DBMS’s includes:
For more advanced features, such as parameterized queries,
transactions, and more see
vignette("DBI-advanced", package = "DBI")
.
The following code establishes a connection to the Sakila database
hosted by the Relational Dataset Repository at
https://relational-data.org/dataset/Sakila
, lists all
tables on the database, and closes the connection. The database
represents a fictional movie rental business and includes tables
describing films, actors, customers, stores, etc.:
library(DBI)
con <- dbConnect(
RMariaDB::MariaDB(),
host = "db.relational-data.org",
port = 3306,
username = "guest",
password = "relational",
dbname = "sakila"
)
dbListTables(con)
## [1] "actor" "address" "category" "city"
## [5] "country" "customer" "film" "film_actor"
## [9] "film_category" "film_text" "inventory" "language"
## [13] "payment" "rental" "staff" "store"
Connections to databases are created using the
dbConnect()
function. The first argument to the function is
the driver for the DBMS you are connecting to. In the example above we
are connecting to a MariaDB instance, so we use the
RMariaDB::MariaDB()
driver. The other arguments depend on
the authentication required by the DBMS. In the example host, port,
username, password, and dbname are required. See the documentation for
the DBMS driver package that you are using for specifics.
The function dbListTables()
takes a database connection
as its only argument and returns a character vector with all table and
view names in the database.
After completing a session with a DBMS, always release the connection
with a call to dbDisconnect()
.
The above example contains the password in the code, which should be avoided for databases with secured access. One way to use the credentials securely is to store it in your system’s credential store and then query it with the keyring package. The code to connect to the database could then look like this:
We can list the column names for a table with the function
dbListFields()
. It takes as arguments a database connection
and a table name and returns a character vector of the column names in
order.
con <- dbConnect(RMariaDB::MariaDB(), username = "guest", password = "relational", host = "db.relational-data.org", port = 3306, dbname = "sakila")
dbListFields(con, "film")
## [1] "film_id" "title" "description"
## [4] "release_year" "language_id" "original_language_id"
## [7] "rental_duration" "rental_rate" "length"
## [10] "replacement_cost" "rating" "special_features"
## [13] "last_update"
The function dbReadTable()
reads an entire table and
returns it as a data frame. It is equivalent to the SQL query
SELECT * FROM <name>
. The columns of the returned
data frame share the same names as the columns in the table. DBI and the
database backends do their best to coerce data to equivalent R data
types.
## film_id title
## 1 1 ACADEMY DINOSAUR
## 2 2 ACE GOLDFINGER
## 3 3 ADAPTATION HOLES
## description
## 1 A Epic Drama of a Feminist And a Mad Scientist who must Battle a Teacher in The Canadian Rockies
## 2 A Astounding Epistle of a Database Administrator And a Explorer who must Find a Car in Ancient China
## 3 A Astounding Reflection of a Lumberjack And a Car who must Sink a Lumberjack in A Baloon Factory
## release_year language_id original_language_id rental_duration rental_rate
## 1 2006 1 NA 6 0.99
## 2 2006 1 NA 3 4.99
## 3 2006 1 NA 7 2.99
## length replacement_cost rating special_features
## 1 86 20.99 PG Deleted Scenes,Behind the Scenes
## 2 48 12.99 G Trailers,Deleted Scenes
## 3 50 18.99 NC-17 Trailers,Deleted Scenes
## last_update
## 1 2006-02-15 04:03:42
## 2 2006-02-15 04:03:42
## 3 2006-02-15 04:03:42
To read a subset of the data in a table into a data frame, DBI
provides functions to run custom SQL queries and manage the results. For
small datasets where you do not need to manage the number of results
being returned, the function dbGetQuery()
takes a SQL
SELECT
query to execute and returns a data frame. Below is
a basic query that specifies the columns we require
(film_id
, title
and description
)
and which rows (records) we are interested in. Here we retrieve films
released in the year 2006.
df <- dbGetQuery(con, "SELECT film_id, title, description FROM film WHERE release_year = 2006")
head(df, 3)
## film_id title
## 1 1 ACADEMY DINOSAUR
## 2 2 ACE GOLDFINGER
## 3 3 ADAPTATION HOLES
## description
## 1 A Epic Drama of a Feminist And a Mad Scientist who must Battle a Teacher in The Canadian Rockies
## 2 A Astounding Epistle of a Database Administrator And a Explorer who must Find a Car in Ancient China
## 3 A Astounding Reflection of a Lumberjack And a Car who must Sink a Lumberjack in A Baloon Factory
We could also retrieve movies released in 2006 that are rated “G”.
Note that character strings must be quoted. As the query itself is
contained within double quotes, we use single quotes around the rating.
See dbQuoteLiteral()
for programmatically converting
arbitrary R values to SQL. This is covered in more detail in
vignette("DBI-advanced", package = "DBI")
.
df <- dbGetQuery(con, "SELECT film_id, title, description FROM film WHERE release_year = 2006 AND rating = 'G'")
head(df, 3)
## film_id title
## 1 2 ACE GOLDFINGER
## 2 4 AFFAIR PREJUDICE
## 3 5 AFRICAN EGG
## description
## 1 A Astounding Epistle of a Database Administrator And a Explorer who must Find a Car in Ancient China
## 2 A Fanciful Documentary of a Frisbee And a Lumberjack who must Chase a Monkey in A Shark Tank
## 3 A Fast-Paced Documentary of a Pastry Chef And a Dentist who must Pursue a Forensic Psychologist in The Gulf of Mexico
The equivalent operation using dplyr
reconstructs the
SQL query using three functions to specify the table
(tbl()
), the subset of the rows (filter()
),
and the columns we require (select()
). Note that dplyr
takes care of the quoting.
library(dplyr)
lazy_df <-
tbl(con, "film") %>%
filter(release_year == 2006 & rating == "G") %>%
select(film_id, title, description)
head(lazy_df, 3)
## # Source: SQL [3 x 3]
## # Database: mysql [[email protected]:3306/sakila]
## film_id title description
## <int> <chr> <chr>
## 1 2 ACE GOLDFINGER A Astounding Epistle of a Database Administrator And…
## 2 4 AFFAIR PREJUDICE A Fanciful Documentary of a Frisbee And a Lumberjack…
## 3 5 AFRICAN EGG A Fast-Paced Documentary of a Pastry Chef And a Dent…
If you want to perform other data manipulation queries such as
UPDATE
s and DELETE
s, see
dbSendStatement()
in
vignette("DBI-advanced", package = "DBI")
.
When finished accessing the DBMS, always close the connection using
dbDisconnect()
.
This tutorial has given you the basic techniques for accessing data
in any supported DBMS. If you need to work with databases that will not
fit in memory, or want to run more complex queries, including
parameterized queries, please see
vignette("DBI-advanced", package = "DBI")
.
vignette("spec", package = "DBI")