Introduction to DBI

Who this tutorial is for

This tutorial is for you if you want to access or manipulate data in a database that may be on your machine or on a different computer on the internet, and you have found libraries that use a higher level of abstraction, such as dbplyr, are not suitable for your purpose. Depending on what you want to achieve, you may find it useful to have an understanding of SQL before using DBI.

The DBI (DataBase Interface) package provides a simple, consistent interface between R and database management systems (DBMS). Each supported DBMS is supported by its own R package that implements the DBI specification in vignette("spec", package = "DBI").

DBI currently supports about 30 DBMS, including:

For a more complete list of supported DBMS visit https://github.com/r-dbi/backends. You may need to install the package specific to your DBMS.

The functionality currently supported for each of these DBMS’s includes:

  • manage a connection to a database
  • list the tables in a database
  • list the column names in a table
  • read a table into a data frame

For more advanced features, such as parameterized queries, transactions, and more see vignette("DBI-advanced", package = "DBI").

How to connect to a database using DBI

The following code establishes a connection to the Sakila database hosted by the Relational Dataset Repository at https://relational-data.org/dataset/Sakila, lists all tables on the database, and closes the connection. The database represents a fictional movie rental business and includes tables describing films, actors, customers, stores, etc.:

library(DBI)

con <- dbConnect(
  RMariaDB::MariaDB(),
  host = "db.relational-data.org",
  port = 3306,
  username = "guest",
  password = "relational",
  dbname = "sakila"
)

dbListTables(con)
##  [1] "actor"         "address"       "category"      "city"         
##  [5] "country"       "customer"      "film"          "film_actor"   
##  [9] "film_category" "film_text"     "inventory"     "language"     
## [13] "payment"       "rental"        "staff"         "store"
dbDisconnect(con)

Connections to databases are created using the dbConnect() function. The first argument to the function is the driver for the DBMS you are connecting to. In the example above we are connecting to a MariaDB instance, so we use the RMariaDB::MariaDB() driver. The other arguments depend on the authentication required by the DBMS. In the example host, port, username, password, and dbname are required. See the documentation for the DBMS driver package that you are using for specifics.

The function dbListTables() takes a database connection as its only argument and returns a character vector with all table and view names in the database.

After completing a session with a DBMS, always release the connection with a call to dbDisconnect().

Secure password storage

The above example contains the password in the code, which should be avoided for databases with secured access. One way to use the credentials securely is to store it in your system’s credential store and then query it with the keyring package. The code to connect to the database could then look like this:

con <- dbConnect(
  RMariaDB::MariaDB(),
  host = "db.relational-data.org",
  port = 3306,
  username = "guest",
  password = keyring::key_get("db.relational-data.org", "guest"),
  dbname = "sakila"
)

How to retrieve column names for a table

We can list the column names for a table with the function dbListFields(). It takes as arguments a database connection and a table name and returns a character vector of the column names in order.

con <- dbConnect(RMariaDB::MariaDB(), username = "guest", password = "relational", host = "db.relational-data.org", port = 3306, dbname = "sakila")
dbListFields(con, "film")
##  [1] "film_id"              "title"                "description"         
##  [4] "release_year"         "language_id"          "original_language_id"
##  [7] "rental_duration"      "rental_rate"          "length"              
## [10] "replacement_cost"     "rating"               "special_features"    
## [13] "last_update"

Read a table into a data frame

The function dbReadTable() reads an entire table and returns it as a data frame. It is equivalent to the SQL query SELECT * FROM <name>. The columns of the returned data frame share the same names as the columns in the table. DBI and the database backends do their best to coerce data to equivalent R data types.

df <- dbReadTable(con, "film")
head(df, 3)
##   film_id            title
## 1       1 ACADEMY DINOSAUR
## 2       2   ACE GOLDFINGER
## 3       3 ADAPTATION HOLES
##                                                                                            description
## 1     A Epic Drama of a Feminist And a Mad Scientist who must Battle a Teacher in The Canadian Rockies
## 2 A Astounding Epistle of a Database Administrator And a Explorer who must Find a Car in Ancient China
## 3     A Astounding Reflection of a Lumberjack And a Car who must Sink a Lumberjack in A Baloon Factory
##   release_year language_id original_language_id rental_duration rental_rate
## 1         2006           1                   NA               6        0.99
## 2         2006           1                   NA               3        4.99
## 3         2006           1                   NA               7        2.99
##   length replacement_cost rating                 special_features
## 1     86            20.99     PG Deleted Scenes,Behind the Scenes
## 2     48            12.99      G          Trailers,Deleted Scenes
## 3     50            18.99  NC-17          Trailers,Deleted Scenes
##           last_update
## 1 2006-02-15 04:03:42
## 2 2006-02-15 04:03:42
## 3 2006-02-15 04:03:42

Read only selected rows and columns into a data frame

To read a subset of the data in a table into a data frame, DBI provides functions to run custom SQL queries and manage the results. For small datasets where you do not need to manage the number of results being returned, the function dbGetQuery() takes a SQL SELECT query to execute and returns a data frame. Below is a basic query that specifies the columns we require (film_id, title and description) and which rows (records) we are interested in. Here we retrieve films released in the year 2006.

df <- dbGetQuery(con, "SELECT film_id, title, description FROM film WHERE release_year = 2006")
head(df, 3)
##   film_id            title
## 1       1 ACADEMY DINOSAUR
## 2       2   ACE GOLDFINGER
## 3       3 ADAPTATION HOLES
##                                                                                            description
## 1     A Epic Drama of a Feminist And a Mad Scientist who must Battle a Teacher in The Canadian Rockies
## 2 A Astounding Epistle of a Database Administrator And a Explorer who must Find a Car in Ancient China
## 3     A Astounding Reflection of a Lumberjack And a Car who must Sink a Lumberjack in A Baloon Factory

We could also retrieve movies released in 2006 that are rated “G”. Note that character strings must be quoted. As the query itself is contained within double quotes, we use single quotes around the rating. See dbQuoteLiteral() for programmatically converting arbitrary R values to SQL. This is covered in more detail in vignette("DBI-advanced", package = "DBI").

df <- dbGetQuery(con, "SELECT film_id, title, description FROM film WHERE release_year = 2006 AND rating = 'G'")
head(df, 3)
##   film_id            title
## 1       2   ACE GOLDFINGER
## 2       4 AFFAIR PREJUDICE
## 3       5      AFRICAN EGG
##                                                                                                             description
## 1                  A Astounding Epistle of a Database Administrator And a Explorer who must Find a Car in Ancient China
## 2                          A Fanciful Documentary of a Frisbee And a Lumberjack who must Chase a Monkey in A Shark Tank
## 3 A Fast-Paced Documentary of a Pastry Chef And a Dentist who must Pursue a Forensic Psychologist in The Gulf of Mexico

The equivalent operation using dplyr reconstructs the SQL query using three functions to specify the table (tbl()), the subset of the rows (filter()), and the columns we require (select()). Note that dplyr takes care of the quoting.

library(dplyr)

lazy_df <-
  tbl(con, "film") %>%
  filter(release_year == 2006 & rating == "G") %>%
  select(film_id, title, description)
head(lazy_df, 3)
## # Source:   SQL [3 x 3]
## # Database: mysql  [[email protected]:3306/sakila]
##   film_id title            description                                          
##     <int> <chr>            <chr>                                                
## 1       2 ACE GOLDFINGER   A Astounding Epistle of a Database Administrator And…
## 2       4 AFFAIR PREJUDICE A Fanciful Documentary of a Frisbee And a Lumberjack…
## 3       5 AFRICAN EGG      A Fast-Paced Documentary of a Pastry Chef And a Dent…

If you want to perform other data manipulation queries such as UPDATEs and DELETEs, see dbSendStatement() in vignette("DBI-advanced", package = "DBI").

How to end a DBMS session

When finished accessing the DBMS, always close the connection using dbDisconnect().

dbDisconnect(con)

Conclusion

This tutorial has given you the basic techniques for accessing data in any supported DBMS. If you need to work with databases that will not fit in memory, or want to run more complex queries, including parameterized queries, please see vignette("DBI-advanced", package = "DBI").

Further Reading