TDM 20200: Project 11 — 2023
Motivation: Data wrangling is the process of gathering, cleaning, structuring, and transforming data. Data wrangling is a big part in any data driven project, and sometimes can take a great deal of time. tidyverse is a great, but opinionated, suite of integrated packages to wrangle, tidy and visualize data. It is useful to gain some familiarity with this collection of packages, in case you run into a situation where these packages are needed — you may even find that you enjoy using them!
Context: We have covered a few topics on the tidyverse packages, but there is a lot more to learn! We will continue our strong focus on the tidyverse (including ggplot) and data wrangling tasks. This is the second in a series of 5 projects focused around using tidyverse packages to solve data-driven problems.
Scope: R, tidyverse, ggplot
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/beer/beers.csv -
/anvil/projects/tdm/data/beer/reviews_sample.csv
Questions
Question 1
Let’s pick up where we left in the previous project. Copy and paste your commands from questions 1 to 3 that result in our beers_reviews dataset.
Using the pipelines (remember, the %>%), combine the necessary parts of questions 2 and 3, removing the need to have an intermediate reviews_summary dataset. This is a great way to practice and get a better understanding of tidyverse.
Your code should read the datasets, summarize the reviews data similarly to what you did in question 2, and combine the summarized dataset with the beers dataset. This should all be accomplished from a single chunk of "piped-together" code.
|
Feel free to remove the
|
|
If you want to update how you calculated your |
-
Code used to solve this problem.
-
Output from running the code.
Question 2
Are there any differences in terms of abv between beers that are available in specific seasons?
|
ABV refers to the alcohol by volume of a beer. The higher the ABV, the more alcohol is in the beer. |
-
Filter the
beers_reviewsdataset to contain beers available only in a specific season (Fall,Winter,Spring,Summer).Only click below if you are stuck!
This function will help you do this operation.
-
Make a side-by-side boxplot comparing
abvfor each seasonavailability.Only click below if you are stuck!
This function will help you do this operation.
-
Make sure to use the
labsfunction to have nice x-axis label and y-axis label.This is more information on
labs.
Use pipelines, resulting in a single chunk of "piped-together" code.
|
Use the |
Write 1-2 sentences comparing the beers in terms of abv between the specific seasons. Are the results surprising or did you expect them?
|
The |
-
Code used to solve this problem.
-
Output from running the code.
-
1-2 sentences comparing the beers in terms of
abvbetween the specific seasons. Are the results surprising or did you expect them?
Question 3
Modify your code from question 2 to:
-
Create a new variable
is_goodthat is 1 or TRUE ifbeer_goodness_indicatoris greater than 3.5, and 0 or FALSE otherwise. -
Facet your boxplot based on
is_good. The resulting graphic should make it easy to compare the "good" vs "bad" beers for each season.facet_gridandfacet_wrapare two other functions that can be a bit confusing at first. With that being said, they are incredible powerful and make creating really impressive graphics very straightforward.
|
Make sure to use piping |
How do beers differ in terms of ABV and being considered good or not (based on our definition) for the different seasons? Write 1-2 sentences describing what you see based on the plots.
-
Code used to solve this problem.
-
Output from running the code.
-
1-2 sentences answering the question.
Question 4
Modify your code from question 3 to answer the question based on summary statistics instead of graphical displays.
Make sure you compare the ABV per season availability and is_good using mean, median and sd. Your final dataframe should have 8 rows and the following columns: is_good, availability, mean_abv, median_abv, std_abv.
-
Code used to solve this problem.
-
Output from running the code.
Question 5
In this question, we want to make comparison in terms of ABV and beer_goodness_indicator for US states.
Feel free to use whichever data-driven method you desire to answer this question! You can take summary statistics, make a variety of plots, and even filter to compare specific US states — you can even create new columns combining states (based on region, political affiliation, etc).
Write a question related to US states, ABV and our "beer_goodness_indicator". Use your data-driven method(s) to answer it (if only anecdotally).
-
Code used to solve this problem.
-
Output from running the code.
-
Write 1-2 sentences explaining your question and data-driven method(s).
-
Write 1-2 sentences answering your question.
|
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |