RNASeek: 2013

So I discovered RStudio last year and I've been hard-put to find a better code-development software for R. I have also used the Rvim plugin with Tmux as very adequately explained here. Both have advantages -- RStudio offers a really nice working environment for scripting, debugging, reading vignettes, and the like, while Rvim really allows a coder to make use of all the goodies in vim.

Recently, as part of an effort to code something for Bioconductor, I had to switch from R 2.14 to the latest development version of R. Getting the code wasn't too bad. I just went over to the CRAN website and under the header in the picture below, I was able to download the code to a folder under my Documents. But from there, opening up RStudio was leading to an error saying "Unable to locate R binary by scanning standard locations". At the time, I still had Rvim so I just continued working with Rvim and did without RStudio.

But as time went on, I really began to miss the ease-of-use of the RStudio workspace. So I looked up the relevant RStudio support page and found that the key to the problem was two-fold:

I had forgotten to move the R-devel folder to "/usr/bin/R" which is where RStudio looks for R versions.
I had neglected to add the R-devel/bin/R to my PATH variable.

Fixing those two errors now gives me a functional RStudio workspace with the latest R version

For the first post, I thought it best to describe the aspirations of this blog. There are many goals that this blog would seek to achieve. The first would be to share the technical tid-bits needed to do RNA-Seq and exon microarray bioinformatic analyses. The second would be to offer some simple outlines of probability and statistics as needed to understand statistical genomics. And last but not least, I'm new to bioinformatics and find that a blog may be a good way to document both what works and what does not. I've only been working on RNA-Seq analysis for 5 months and exon microarrays for 1 month, so this blog will not contain high-level analyses -- only the basics that a beginner might need in order to make some progress.

At the present moment, I'm trying to produce graphics of RNA-Seq data in the UCSC Genome Browser. I happen to enjoy using IGV (Integrative Genome Viewer produced by the Broad Institute) but the UCSC Genome Browser is often needed for the production of publication-level graphics.

UCSC's approach to the uploading of RNA-Seq data requires that a user supply a URL that links directly to the data. This URL will typically be that of a local server used by an academic lab, for example. These URLs can then link to files of a variety of formats (BED, BAM, Bedgraph, BigBed, WIG, BigWig ... to name those that I'm most familiar with).

But the difficulty emerges when the server that hosts your data and the associated URL sits behind a firewall. If UCSC cannot access the URL because of the firewall, your upload will fail.

To get around this issue, some have used external public data-sharing sites such as Dropbox.com or Box.com. Using a public folder makes the URL public and available to UCSC and the firewall does not get in the way. However, these sites typically will not allow for more than a certain amount of "back-and-forth" querying of data -- according to personal acquaintances, you will get logged out of the site.

A more permanent solution -- so it seems to me -- is to host a local mirror of the UCSC Genome Browser. Explanations for how to do this are available at the UCSC website. However, it should be said that this process is long, complex, and memory-intensive. If no one within your academic center has yet achieved a local installation, it may be worth trying it -- but be sure to get the appropriate amount of disk space prepared in advance. The mm9 mouse genome is, according to my sources, in excess of 2 terabytes of data. The whole installation can take up to 2 weeks (most of which is download-time).

I'm fortunate in that a lab in a neighboring building has successfully installed a local mirror site and can create an account for new users who would like to upload data for graphics-related purposes.

Another approach, taken from a source I highly recommend, is to launch an Amazon EC2 instance and use an Apache server connection to upload data from my computer to the EC2 instance. Amazon has a 1-year, Free Tier contract, for which I am presently signed up, which is, as you will have guessed, free. As for the nitty-gritty of getting it to work, I have no reason to suspect that this approach would not function -- but I'm a newbie and some of proxy permissions I still can't seem to manage (a combination of using the wrong ports seems like the likely cause).

Future posts will most certainly deal with other aspects pertaining to the visualization of RNA-Seq in UCSC's genome browser.

RNASeek

Saturday, March 23, 2013

RStudio with R-devel

Thursday, March 21, 2013

Quaerens