Computing

Wondering how to get your data into :R? Here are some basic functions that will often be sufficient for importing simple data structures, e.g. a single table in which rows correspond to sampling locations and columns correspond to variables and other site attributes. You’ll also find some simple data export functions. For a more thorough exposition of the basics of data input/output, see the R Data Import/Export Manual (or as PDF).

This wiki page also contains some useful functions from the RODBC package for interacting with data stored externally in an relational database system.

Basic data import into R dataframes

Often, you will find yourself starting out with data stored in some proprietary format. For example, your data may be contained in an Excel worksheet. It is usually easiest to clean up your datafile in it’s native application. For typical ecological datasets, this means making sure your data are in a simple tabular format, with each row corresponding to an observation or record, and each column corresponding to some variable or data descriptor; note that the first row can optionally contain column names, and the first column can contain row names. Then save the table to a file in raw text format, and use one of the R commands below (or a relative thereof) to import your data into a data frame object.

Import common-separated text: read.csv()

Read file named “mydata.csv”:

:::rsplus
 "siteID","siteName","date","elev.m","temp.C","hab.rank","sp.richness","comments"
 "101","valley1","12-Feb","105","22.4","A","14"
 "102","valley2","12-Feb","93","25.2","B","7"
 "103","stream1","09-Feb","200","23.1","A","9","only completed 2 samples"
 "105","trail3","11-Feb","255","29.5","C","5"
 > mydatatable `<- read.csv("mydata.csv")</code>`

####  Import tab-delimited text: read.delim()
Read file named "mydata.txt":
`<code rsplus>`
 siteID     siteName     date     elev.m     temp.C      hab.rank     sp.richness     comments
 101        valley1      12-Feb   105        22.4        A            14
 102        valley2      12-Feb   93         25.2        B            7
 103        stream1      09-Feb   200        23.1        A            9               only completed 2 samples
 105        trail3       11-Feb   255        29.5        C            5
 > mydatatable `<- read.delim("mydata.txt")</code>`


##  Basic data export
Exporting a dataframe to text is even easier than importing it! 

####  Export to common-separated text: write.csv()
Export to a file named "myouputdata.csv":
`<code rsplus>`
 > write.csv("myoutputdata.csv", mydataframe)`</code>`


##  Database interfacing
Several packages provide connectivity to external databases management systems (DMBS). This is helpful in cases where (a) you have a large amount of data stored in the database, but at each step of your analysis in R you only need to work with a small subset of the data, (b) your data structures are very complex and can be more efficiently manipulated within the DBMS, (c) the database is changing over time, and you want to be able to re-run your R procedures without having to manually export the data each time, and/or (d) you just don't feel like manually managing the exchange of data between R and your database. Note that use of these packages requires familiarity with SQL, the de facto database querying language, hence may require some extra investment up front if you haven't already encountered SQL.  Arguably the most versatile package is RODBC, which provide an R interface to (external) Open Database Connectivity drivers. RODBC can be used to retrieve, modify, and insert data into common database systems such as MySQL, PostgreSQL, Oracle. In Windows, RODBC also provides connections to Access databases and read-only access to Excel workbooks. The Rdbi and DBI packages (and their descendents) also provide methods for interacting with databases.


####  Using RODBC to interact with an external database
Once you've connected to your database, there are two approaches to manipulating database contents using R:

  -  Pass in SQL statements; manipulate data in the DBMS itself; retrieve desired result.
  -  Retrieve complete tables from DBMS; and manipulate as desired in R.  

Description of CRUD operations forthcoming. For setting up PostgreSQL connections in Linux, see below.


####  Using RODBC to connect to a PostgreSQL database in Ubuntu Dapper
RODBC is probably the most robust database connectivity package in terms of portability across both operating systems and database applications. On Windows machines the ODBC drivers should be fully installed and configured by default. However, on non-Windows platforms it takes some legwork to set up ODBC itself. Once you've done so, connecting from R is pretty easy.

1. First install the necessary ODBC drivers for *nix. Here we use unixODBC and odbc-postgresql. Either get these using Synaptic, or open a terminal window and submit the following:
`<code bash>`
 $ sudo apt-get install unixODBC odbc-postgresql`</code>`

2. Now manually edit the ODBC config files. Add the following lines to /etc/odbcinst.ini. (Note: You may want to confirm that the *.so files are in fact located in the directories indicated).
`<code bash>`
 [[PostgreSQL]]
 Description     = PostgreSQL driver for Linux & Windows
 Driver          = /usr/lib/postgresql/lib/psqlodbc.so
 Setup           = /usr/lib/odbc/libodbcpsqlS.so`</code>`

Note for Ubuntu 6.10 (Edgy) users: The Edgy repository has a newer version of the psqlODBC driver (08.01.0200). In this version, psqlodbc.so has been replaced by two variants: psqlodbcw.so (for Unicode support) and psqlodbca.so (for ANSI support). By default, these are located in /usr/lib/odbc/. In the config files, be sure to identify one of these as the driver. If you're not sure which to use, the Unicode driver should suffice (see developer comments [[http://archives.postgresql.org/pgsql-odbc/2006-02/msg00114.php|here]]).

Next add the following lines either to ~/.odbc.ini (as a user-level configuration) or /etc/odbc.ini (system-level). In the 'ODBC Data Sources' section you specify a data source name (dsn) for your database, and provide a quick label. You can include as many databases as you like, each with a unique dsn. Then for each dsn, there must be a section that gives at least a link to the PostgreSQL ODBC driver, the name of the database, the server ('localhost' if on your local machine), and the PostgreSQL version number. If your database requires authentication, the username and password can be specified here as well. A final section should give the installation directory for ODBC.
`<code bash>`
 [[ODBC|Data Sources]]
 mydb1 = My Awesome Database
 mydb2 = Some Other Database
 [[mydb1]]
 Driver = /usr/lib/postgresql/lib/psqlodbc.so
 Database = your_dbname
 Servername = localhost
 Username = your_username
 Password = your_password
 Protocol = 8.0.7
 ReadOnly = 0
 [[mydb2]]
 ...info for some other database here...
 [[ODBC]]
 InstallDir = /usr/lib`</code>`

In the text above, replace yourdb_name, your_username, and your_password with their actual values. Also, the driver should match a driver entry in the odbcinst.ini file discussed above. Lastly, the protocol is simply your version of PostgreSQL, which can be retrieved for example by using the psql client in a terminal window:
`<code postgresql>`
 $ psql -V
 psql (PostgreSQL) 8.0.7`</code>`

3. Add the appropriate user authentication line to your pg_hba.conf file; if you are simply working on your own machine, the example below should be fine. Depending on your version of PostgreSQL, the file should be located somewhere like: /etc/postgresql/8.0/main/pg_hba.conf. For version 8.0, read more about client authentication [[http://www.postgresql.org/docs/8.0/interactive/client-authentication.html|here]].
`<code postgresql>`
 local   your_db_name    your_username  trust`</code>`

4. If you haven't already, install RODBC in R (sudo R to run as root), then exit:
`<code rsplus>`
 $ sudo R
 > install.packages("RODBC")
 > q()`</code>`

5. Start up R again as a normal user, load the RODBC package, then connect to the your_dbname database (which we've given the DNS "mydb1"). Give command to fetch the some_table table, just as an example. Note "your_password" is whatever authentication password is associated with the PostgreSQL user "your_username"; this is a PostgreSQL setting, not an R setting. Also note the argument case="postgresql" addresses lower/uppercase compatibility issues. Finally, the believeNRows=FALSE argument seems to be critical for PostgreSQL connections -- without it, select/fetch statements don't return any rows of data!
`<code rsplus>`
 > library(RODBC)
 > chan <- odbcConnect("mydb1", case="postgresql", believeNRows=FALSE)
 > sqlTables(chan)  #List all tables in the DB
 > mydata <- sqlFetch(chan, "some_table") #Return a table as a dataframe
 > odbcClose(chan)`</code>`

NOTE: Instead of putting the your_username and your_password into the .odbc.ini file, we could have left it out, and then included the username and password directly in the connection call from R:
`<code rsplus>`
 > chan `<- odbcConnect("mydb1", "your_username", "your_password", case="postgresql")</code>`


####  Using Rdbi (with RdbiPgSQL) to connect to a PostgreSQL database in Ubuntu
Depending on your setup, this package might self-configure and run beautifully, or it might require considerable intervention to track down and configure paths, linker options, and the like. It has worked well for us on typical installations of Ubuntu 6.06 LTS (Dapper) and 7.10 (Gutsy). If you are using Ubuntu and have installed PostgreSQL from the binary .deb package (e.g. via Synaptic or Aptitude), give this a shot:

1. Install the Rdbi and RdbiPgSQL packages in R. Note that these packages are in the Bioconductor repository, not CRAN.
`<code rsplus>`
 $ sudo R
 > install.packages(c("Rdbi", "RdbiPgSQL"), repos="http://www.bioconductor.org")
 > q()`</code>`

2. Start R as a normal user, load RdbiPgSQL (this also loads Rdbi as a dependency), connect to database, then test the connection by listing tables in the database:
`<code rsplus>`
 > library(RdbiPgSQL)
 > conn <- dbConnect(PgSQL(), dbname="your_dbname")
 > dbListTables(conn) #List all tables in the DB
 > mydf <- dbReadTable(conn, "some_table") #Return a table as a dataframe
 > dbDisconnect(conn)`</code>`

Note: The RdbiPgSQL package provides a much faster connection than does RODBC, and is thus advantageous in that sense. In a recent test, a ~5000x100 table of primarily character data took ~4 seconds to load with RdbiPgSQL, but ~13 seconds with RODBC. However, RODBC is widely used, and the RdbiPgSQL interface is only implemented in Linux (as far as I know). Thus, if cross-platform compatibility of your R scripts is important, go with the RODBC approach (bearing in mind that it may require monkeying around with ODBC configurations!).


####  Using DBI to connect to a PostgreSQL database in Ubuntu Dapper
Actually, as of Sept 2006 there is no PostgreSQL connection available for DBI! However, this may change in the future?