Using R to search column in data frame for list of values

I made the jump from using Excel to R for data manipulation when I started on my Master’s project in 2015. I was pretty much forced to – I was looking at datasets of deforestation in the world, with 806,400 pixels. That was already aggregated by a factor of 1000 using Python (i.e. I had 805.4 billion pixels to start with). And I had about 14 such files. For me to load the file in Excel (apart from the fact that it was in raster GeoTIFF format, though it can be converted to a csv file) would already take a while, let alone trying to manipulate it (e.g. add/subtract/find mean etc.). Data manipulation can also be done in Python of course, with the pandas library, but I’m still trying to get my head around Python – maybe once I’ve somewhat mastered data processing in Python I’ll update this post. Here though, I just want to introduce using R for simple data processing, instead of using Excel.

There are many reasons to use R:

  1. It would probably be very useful for your career, if you’re after the kind of jobs that I am. (bit of a circular argument, but really, being proficient in R is very useful especially when many jobs these days require some level of data processing and manipulation). I.e. Looks good on your CV
  2. Coding sounds difficult but once you get a grasp of the syntax, it can and will make life easier (especially when it comes to large amounts of data). I.e. It’s pretty cool in a geeky way, in my opinion, and will save you time

I’m obviously no R expert, nor am I here to create tutorials for R beginners, I just wanted to write about bits of R programming that makes me go ‘that’s awesome’, or #:D 😀 😀 in my R script. As well as maybe help people trying to make the transition to using R as their main programme of choice for data analysis.


I wanted to find out if a list of countries are in this dataset that I have. What one could do in Excel would be to Ctrl+F the dataset for each of the countries on the list (or maybe there’s a more efficient way about it, I’ve stopped using Excel a while ago to learn the new techniques, but please comment if you know of something!). I only have 10 (the member states of the Association of South East Asian Nations), so that wouldn’t be too painful. But imagine you have a list of 100 countries – while the digital age and Microsoft Excel have made life easier for us, saving us from having to pore over pieces of paper – that quickly becomes tedious work.

[code language=”r”]
# I’m not attempting to teach basics of R here, if you are unfamiliar with how to load files, please do a Google search. Here, I am using a GIS shapefile, though the same can be applied for a normal csv file
# Obtain a shapefile of world boundaries
download.file("http://thematicmapping.org/downloads/TM_WORLD_BORDERS-0.3.zip", destfile="TM_WORLD_BORDERS-0.3.zip")
unzip("TM_WORLD_BORDERS-0.3.zip", exdir=".")

# I am using rgdal library to read in this shapefile
library("rgdal")

# read in the GIS file
world_border <- readOGR(dsn=".", "TM_WORLD_BORDERS-03")
# you can look at the data contents (i.e. attribute table)
world_border@data #prints everything, I just want to look at the first few entries
head(world_border@data)

# at this point, I can ‘manually’ search the GIS files for the countries I want
grep(‘Singapore’, worldbound@data$CNTRY_NAME)
# and do the same for all the countries I’m interested in

# alternatively, I can create a list of the SEA countries
SEA <- c(‘Vietnam’, ‘Laos’, ‘Cambodia’, ‘Thailand’, ‘Myanmar’, ‘Malaysia’,
‘Indonesia’, ‘Singapore’, ‘Philippines’, ‘Brunei’)
# and create a loop to search the world boundaries file for these countries, and only report an error if the country can’t be found
for (i in 1:length(SEA)){
if (length(grep(SEA[i], worldbound@data$CNTRY_NAME))==0)
stop(paste(‘unable to find’, SEA[i], sep=" "))
}
# I had initially included ‘East Timor’ in the list of SEA countries, and my loop returned ‘unable to find East Timor’, so I know it works
# this would also work with a csv file
[/code]

I did this search not so much to check if these countries exist on the world map, but more to check on the spelling of the countries. There are some databases which spell ‘Viet Nam’ instead of ‘Vietnam’, for example, and if I don’t check how it’s spelt, I might run into errors later on. I don’t presume to be an expert in R, and this code might be considered clunky by more proficient R coders, but this is the best I could come up with!

Apart from using Excel to Ctrl+F the countries I want, I could also have used QGIS to load the file and click on each country for information, or use a polygon to highlight all the countries in the region and obtain the information at once. Or used Python, or any other programme that allows data management. I just like R, and though I could easily and perhaps more quickly have done it in QGIS (with its GUI, without having to Google for the right R code to do what I want, because I’m still not all that familiar with R), I enjoy these little challenges, and it’ll eventually make me more proficient.

7 thoughts on “Using R to search column in data frame for list of values”

  1. Based on

    > I did this search not so much to check if these countries exist on the world map, but more to check on the spelling of the countries. There are some databases which spell ‘Viet Nam’ instead of ‘Vietnam’, for example, and if I don’t check how it’s spelt, I might run into errors later on.

    You might find “countrynames = unique(world_border@data$NAME)” handy. You can then run “c(‘Vietnam’, ‘Laos’, ‘Cambodia’, ‘Thailand’, ‘Myanmar’, ‘Malaysia’, ‘Indonesia’, ‘Singapore’, ‘Philippines’, ‘Brunei’) %in% countrynames”.

    1. Or if you’re interested in a pythonic approach, I enjoy working with https://github.com/geopandas/geopandas.

      It’ll look like

      “In [1]: import geopandas as gpd

      In [2]: countrynames = gpd.read_file(“TM_WORLD_BORDERS-0.3.shp”).NAME.unique()

      In [3]: [sea for sea in (‘Vietnam’, ‘Laos’, ‘Cambodia’, ‘Thailand’, ‘Myanmar’, ‘Malaysia’,
      ‘Indonesia’, ‘Singapore’, ‘Philippines’, ‘Brunei’) if sea in countrynames]
      Out[3]: [‘Cambodia’, ‘Thailand’, ‘Malaysia’, ‘Indonesia’, ‘Singapore’, ‘Philippines’]”

      For more, I recommend the series of blogposts on modern pandas here: https://tomaugspurger.github.io/modern-1.html.

      1. You’re welcome! I didn’t realize Prototaxites had also given essentially the same comment (:

        It’s really cool to have fellow odacians who’re interested in programming/GIS too! If you’d like to learn more about processing data in R, there’s a good series of notes on it at http://r4ds.had.co.nz/. I also follow the blogposts over at http://r-spatial.org/.

        I recommend joining http://thespatialcommunity.org/ too — it has a diverse mix of developers/professionals from ESRI/Mapbox/Mapzen/Boundless/CARTO/etc and is welcoming to people of all backgrounds.

Leave a Reply

Your e-mail address will not be published. Required fields are marked *