STAT 19000: Project 5 — Spring 2022
Motivation: We will pause in our series of pandas and numpy projects to learn one of the most important parts of writing programs — functions! Functions allow us to reuse snippets of code effectively. Functions are a great way to reduce the repetition of code and also keep the code organized and readable.
Context: We are focusing on learning about writing functions in Python.
Scope: python, functions, pandas, matplotlib
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/whin/190/stations.csv -
/depot/datamine/data/whin/190/observations.csv
Questions
|
We are very lucky to have great partners in the Wabash Heartland Innovation Network (WHIN)! They generously provide us with access to their API (here) for educational purposes. You’ve most likely either used their API in a previous project, or you’ve worked with a sample of their data to solve some sort of data-driven problem. In this project, we will be using a slightly modified sample of their dataset to learn more about how to write functions. |
Question 1
First, read both datasets into variables named stations and obs.
Secondly, take a look at the head of both dataframes. You will notice, the station_id in the obs dataframe appears to correlate with the id column in the stations dataframe. This is a fairly common occurence when data has been normalized for a database.
For our current project we will work with a single dataset.
pandas has a merge method that can be used to join two dataframes based on a common column. Here the id column from the stations dataframe matches the station_id column in the obs dataframe. Here is the explanation on merge.
|
Use |
Once merged, you will notice in the new dataframe, dat. That the id column from the obs dataframe is now labeled id_x, and the id column from the stations dataframe is now labeled id_y.
Use the pandas drop method to remove the id_y column.
Use the pandas rename method to rename id_x to id, and name to station_name.
Great! We have cleaned up our dataframe so it is easier to work with, while learning a variety of useful pandas methods.
-
Code used to solve this problem.
-
Output from running the code.
Question 2
When looking at the new dataset, you may have noticed a mix of letters and numbers in the id column. Below are a few samples of the contents of that column.
obs_1NnyYGMtAHBFDYWOBlsDlqppzVI obs_1No0NHuqV4VjOK8p8FguPT02T5B obs_1NqnftCklLZHBCHyykvcuc8QvE9 obs_1NqpV058q10hGNBNvYOBzzwpqOx obs_1NqrK3mraUzaj2j7hg6VcB23RjJ
The use of numbers and letters in this column are a variation on ksuid — a K-sortable globally unique id. The reason they are beneficial is that Ksuids are sortable by time and unique identifers (there is a minimal chance that any two id’s would be the same). If you are interested you can read more here.
Next, write a function called get_datetime that accepts a ksuid (as a string) and returns the datetime.
|
You can use the
Don’t forget to remove the "obs_" from the beginning of the ksuid. |
The following code should result in the following output.
for k in ksuids:
print(get_datetime(k))
2019-07-10 04:00:00 2019-07-10 04:15:00 2019-07-11 04:00:00 2019-07-11 04:15:00 2019-07-11 04:30:00
To verify that the ordering claim is true, (for example,the sorting of ksuids resulted in obervations are in chronological order).
We must first, use the sample method to get 10 random id values from the dat dataframe.
Secondly sort the values, then loop through the sorted list of values, and use your get_datetime function to print the datetime.
Can you confirm that sorting the ksuids automatically sorts the observations by datetime?
-
Code used to solve this problem.
-
Output from running the code.
Question 3
In this dataset we are given latitude and longitude values in degrees. We want to convert the degrees to radians. Write a function called degrees_to_radians that accepts a latitude or longitude value in degrees, and returns the same value in radians.
The formula to do this is.
$degrees*arctan2(0, -1)/180$
|
|
|
Make sure to convert your result from a |
To test out your function you can use:
degrees_to_radians(88.0)
1.53588974175501
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Write a function that accepts two pandas Series containing a latitude and longitude value. Also needs to be able to return the distance between two points in Kilometers. Call this function get_distance.
You can do this by using the Haversine formula.
$2*r*arcsin(\sqrt{sin^2(\frac{\phi_2 - \phi_1}{2}) + cos(\phi_1)*cos(\phi_2)*sin^2(\frac{\lambda_2 - \lambda_1}{2})})$
Where:
-
$r$ is the radius of the Earth in kilometers, we can use: 6367.4447 kilometers
-
$\phi_1$ and $\phi_2$ are the latitude coordinates of the two points
-
$\lambda_1$ and $\lambda_2$ are the longitude coordinates of the two points
|
In the formula above, the latitude and longitudes need to be converted from degrees to radians. Your function from the Question 3 will be perfect for this! You can even put your It is common practice in the Python world to add an underscore as a prefix to helper functions. It is a sign that this function is just for "internal" use and should largly be ignored by the user. Follow this practice and prefix your |
|
|
Test your function on the 2 rows with the following id values.
obs_1amnn4xst3O9VOawmUHFiqBVnCK obs_1fwlznMZXXS8WBkmyTHRgWnHYYf
37.896692299010574
-
Code used to solve this problem.
-
Output from running the code.
Question 5
Great! Make sure to note these solutions for future use…
Next, write a function called plot_stations. plot_stations should accept a dataset as an argument and produce a plot with the station locations plotted on a map.
For consistancy we will use plotly to produce the plot. This stackoverflow post will show some samples. For further understanding here is the explanation for the function.
We want to be careful we don’t plot the same point over and over. To avoid that we want to make sure we reduce the dataset (inside the function), this will plot each pair of latitude and longitude values only once.
Set hover_name to "station_id" so that hovering over a point will displays the station id.
Set scope to "usa" to reduce the map to the USA. Be sure to zoom in on the map so you can see the the stations within Indiana!
-
Code used to solve this problem.
-
Output from running the code.
|
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |