Learning to use HPC clusters, R Markdown & GitHub pages

I had just spent the last 3 weeks or so trying to figure out how to use R on the university’s HPC clusters, and thought it would be helpful to produce a guide to help other beginners as well. I wrote it using R Markdown, since I thought I might as well learn how to write in R Markdown better at the same time.

To turn it from a R Markdown (rmd) file to something others can see, I had several options, as pdf or as html. I could convert it to a pdf, and either share it with the lab or put it up somewhere online. The conversion to pdf produced problems with image placement (the images would be placed somewhere else from where I wanted them to be, so that things would fit nicely on the pages). I didn’t want to learn the hack (incorporating some LaTeX formatting etc.) so I thought I could just leave it as a local html which opens on my browser and then use ‘Save as PDF’ to turn it to a pdf. That removed some formatting and didn’t look very nice either.

I found out I could use the RWordpress package to publish the html file rendered to my WordPress site, which I tried (and actually published briefly for about 20 hours before I took it down from this site). Apart from broken image links which I had to fix manually by uploading my images to the media gallery here, it worked quite alright. The formatting wasn’t as nice perhaps, especially the bits with the code, but it worked.

Then I thought about it a bit more and decided perhaps I should learn how to publish it on my GitHub, because then it’ll be easier to make changes to the code and have the website all synced to changes. So I did! I’m quite pleased to have learned this now, it’s been quite fun writing in markdown language and mastering git a bit more. Here’s the guide if you’re interested. 

Using R to search column in data frame for list of values

I made the jump from using Excel to R for data manipulation when I started on my Master’s project in 2015. I was pretty much forced to – I was looking at datasets of deforestation in the world, with 806,400 pixels. That was already aggregated by a factor of 1000 using Python (i.e. I had 805.4 billion pixels to start with). And I had about 14 such files. For me to load the file in Excel (apart from the fact that it was in raster GeoTIFF format, though it can be converted to a csv file) would already take a while, let alone trying to manipulate it (e.g. add/subtract/find mean etc.). Data manipulation can also be done in Python of course, with the pandas library, but I’m still trying to get my head around Python – maybe once I’ve somewhat mastered data processing in Python I’ll update this post. Here though, I just want to introduce using R for simple data processing, instead of using Excel.

There are many reasons to use R:

  1. It would probably be very useful for your career, if you’re after the kind of jobs that I am. (bit of a circular argument, but really, being proficient in R is very useful especially when many jobs these days require some level of data processing and manipulation). I.e. Looks good on your CV
  2. Coding sounds difficult but once you get a grasp of the syntax, it can and will make life easier (especially when it comes to large amounts of data). I.e. It’s pretty cool in a geeky way, in my opinion, and will save you time

I’m obviously no R expert, nor am I here to create tutorials for R beginners, I just wanted to write about bits of R programming that makes me go ‘that’s awesome’, or #:D 😀 😀 in my R script. As well as maybe help people trying to make the transition to using R as their main programme of choice for data analysis.

I wanted to find out if a list of countries are in this dataset that I have. What one could do in Excel would be to Ctrl+F the dataset for each of the countries on the list (or maybe there’s a more efficient way about it, I’ve stopped using Excel a while ago to learn the new techniques, but please comment if you know of something!). I only have 10 (the member states of the Association of South East Asian Nations), so that wouldn’t be too painful. But imagine you have a list of 100 countries – while the digital age and Microsoft Excel have made life easier for us, saving us from having to pore over pieces of paper – that quickly becomes tedious work.

[code language=”r”]
# I’m not attempting to teach basics of R here, if you are unfamiliar with how to load files, please do a Google search. Here, I am using a GIS shapefile, though the same can be applied for a normal csv file
# Obtain a shapefile of world boundaries
download.file("http://thematicmapping.org/downloads/TM_WORLD_BORDERS-0.3.zip", destfile="TM_WORLD_BORDERS-0.3.zip")
unzip("TM_WORLD_BORDERS-0.3.zip", exdir=".")

# I am using rgdal library to read in this shapefile

# read in the GIS file
world_border <- readOGR(dsn=".", "TM_WORLD_BORDERS-03")
# you can look at the data contents (i.e. attribute table)
world_border@data #prints everything, I just want to look at the first few entries

# at this point, I can ‘manually’ search the GIS files for the countries I want
grep(‘Singapore’, worldbound@data$CNTRY_NAME)
# and do the same for all the countries I’m interested in

# alternatively, I can create a list of the SEA countries
SEA <- c(‘Vietnam’, ‘Laos’, ‘Cambodia’, ‘Thailand’, ‘Myanmar’, ‘Malaysia’,
‘Indonesia’, ‘Singapore’, ‘Philippines’, ‘Brunei’)
# and create a loop to search the world boundaries file for these countries, and only report an error if the country can’t be found
for (i in 1:length(SEA)){
if (length(grep(SEA[i], worldbound@data$CNTRY_NAME))==0)
stop(paste(‘unable to find’, SEA[i], sep=" "))
# I had initially included ‘East Timor’ in the list of SEA countries, and my loop returned ‘unable to find East Timor’, so I know it works
# this would also work with a csv file

I did this search not so much to check if these countries exist on the world map, but more to check on the spelling of the countries. There are some databases which spell ‘Viet Nam’ instead of ‘Vietnam’, for example, and if I don’t check how it’s spelt, I might run into errors later on. I don’t presume to be an expert in R, and this code might be considered clunky by more proficient R coders, but this is the best I could come up with!

Apart from using Excel to Ctrl+F the countries I want, I could also have used QGIS to load the file and click on each country for information, or use a polygon to highlight all the countries in the region and obtain the information at once. Or used Python, or any other programme that allows data management. I just like R, and though I could easily and perhaps more quickly have done it in QGIS (with its GUI, without having to Google for the right R code to do what I want, because I’m still not all that familiar with R), I enjoy these little challenges, and it’ll eventually make me more proficient.