Think Summer: Introduction — 2024
Submission
Students need to submit the following file by 10:00PM EST through Gradescope inside Brightspace.
-
A Jupyter notebook (a
.ipynbfile).
We’ve provided you with a template notebook for you to use. Please carefully read this section to get started.
|
When you are finished with the project, please make sure to run every cell in the notebook prior to submitting. To do this click . Next, to export your notebook (your |
Project
Motivation: SQL is an incredibly powerful tool that allows you to process and filter massive amounts of data — amounts of data where tools like spreadsheets start to fail. You can perform SQL queries directly within the R environment, and doing so allows you to quickly perform ad-hoc analyses.
Context: This project is specially designed for Purdue University’s Think Summer program, and is coordinated by The Data Mine.
Scope: SQL, SQL in R
Dataset
The following questions will use the imdb database found in Anvil, our computing cluster.
This database has 6 tables, namely:
akas, crew, episodes, people, ratings, and titles.
You have a variety of options to connect with, and run queries on our database:
-
Run SQL queries directly within a Jupyter Lab cell.
First, you must launch a new Jupyter Lab instance. To do so, please follow the instructions below.
-
Open a browser and navigate to ondemand.anvil.rcac.purdue.edu, and login using your ACCESS credentials.
-
Click on "My Interactive Sessions".
-
Click on Jupyter Notebook in the left-hand menu under "The Data Mine" section (near the bottom of the screen). Select the following settings:
-
Allocation: cis220051
-
Queue: shared
-
Time in Hours: 3
-
Cores: 1
-
Use Jupyter Lab instead of Jupyter Notebook: Checked
-
-
When satisfied, click Launch, and wait for a minute. In a few moments, you should get a note indicating that your session is ready to run.
-
When you are ready, click Connect to Jupyter. A new browser tab will launch.
-
Under the "Notebook" menu, please select the seminar (look for the big "S"; we do not want seminar-r).
You now have a running Jupyter Lab notebook ready for you to use. This Jupyter Lab instance is running on the Anvil cluster. By using OnDemand, you’ve essentially carved out a small portion of the compute power to use. Congratulations!
To run queries directly in a Jupyter Lab cell, please do the following.
-
In the first cell, run the following code. This code establishes a connection to the
imdb.dbdatabase, which allows you to directly run SQL queries in a cell as long as that cell has%%sqlat the top of the cell.%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db -
After running that cell (for example, using Ctrl+Enter), you can directly run future queries in each cell by starting the cell with
%%sqlin the first line. For example.%%sql SELECT * FROM titles LIMIT 5;