STAT 29000: Project 14 — Fall 2020
Motivation: As we learned earlier in the semester, bash scripts are a powerful tool when you need to perform repeated tasks in a UNIX-like system. In addition, sometimes preprocessing data using UNIX tools prior to analysis in R or Python is useful. Ample practice is integral to becoming proficient with these tools. As such, we will be reviewing topics learned earlier in the semester.
Context: We’ve just ended a series of projects focused on SQL. In this project we will begin to review topics learned throughout the semester, starting writing bash scripts using the various UNIX tools we learned about in Projects 3 through 8.
Scope: awk, UNIX utilities, bash scripts, fread
Dataset
The following questions will use ENTIRE_PLOTSNAP.csv from the data folder found in Scholar:
/anvil/projects/tdm/data/forest/
To read more about ENTIRE_PLOTSNAP.csv that you will be working with:
Questions
Question 1
Take a look at at ENTIRE_PLOTSNAP.csv. Write a line of awk code that displays the STATECD followed by the number of rows with that STATECD.
-
Code used to solve the problem.
-
Count of the following `STATECD`s: 1, 2, 4, 5, 6
Question 2
Unfortunately, there isn’t a very accessible list available that shows which state each STATECD represents. This is no problem for us though, the dataset has LAT and LON! Write some bash that prints just the STATECD, LAT, and LON.
|
There are 92 columns in our dataset: |
-
Code used to solve the problem.
-
The output of your code piped to
head.
Question 3
fread is a "Fast and Friendly File Finagler". It is part of the very popular data.table package in R. We will learn more about this package next semester. For now, read the documentation here and use the cmd argument in conjunction with your bash code from (2) to read the data of STATECD, LAT, and LON into a data.table in your R environment.
-
Code used to solve the problem.
-
The
headof the resultingdata.table.
Question 4
We are going to further understand the data from question (3) by finding the actual locations based on the LAT and LON columns. We can use the library revgeo to get a location given a pair of longitude and latitude values. revgeo uses a free API hosted by photon in order to do so.
For example:
library(revgeo)
revgeo(longitude=-86.926153, latitude=40.427055, output='frame')
The code above will give you the address information in six columns, from the most-granular housenumber to the least-granular country. Depending on the coordinates, revgeo may or may not give you results for each column. For this question, we are going to keep only the state column.
There are over 4 million rows in our dataset — we do not want to hit photon’s API that many times. Instead, we are going to do the following:
-
Unless you feel comfortable using
data.table, convert yourdata.tableto adata.frame:
my_dataframe <- data.frame(my_datatable)
-
Calculate the average
LATandLONfor eachSTATECD, and call the newdata.frame,dat. This should result in 57 rows of lat/long pairs. -
For each row in
dat, run a reverse geocode and append thestateto a new column calledSTATE.
|
To calculate the average |
|
|
|
Here is some extra help:
|
|
It is okay to get "Not Found" for some of the addresses. |
-
Code used to solve the problem.
-
The
headof the resultingdata.frame.