TDM 20100: Project 2 — 2022
Motivation: The ability to navigate a shell, like bash, and use some of its powerful tools, is very useful. The number of disciplines utilizing data in new ways is ever-growing, and as such, it is very likely that many of you will eventually encounter a scenario where knowing your way around a terminal will be useful. We want to expose you to some of the most useful UNIX tools, help you navigate a filesystem, and even run UNIX tools from within your Jupyter Lab notebook.
Context: At this point in time, our Jupyter Lab system, using ondemand.anvil.rcac.purdue.edu, is new to some of you, and maybe familiar to others. The comfort with which you each navigate this UNIX-like operating system will vary. In this project we will learn how to use the terminal to navigate a UNIX-like system, experiment with various useful commands, and learn how to execute bash commands from within Jupyter Lab.
Scope: bash, Jupyter Lab
Questions
|
If you are not a |
|
While it is not super common for us to push a lot of external reading at you (other than the occasional blog post or article), this is an excellent, and very short resource to get you started using a UNIX-like system. We strongly recommend readings chapters: 1, 3, 4, 5, & 7. It is safe to skip chapters 2, 6, and 8. |
Question 1
Let’s ease into this project by taking some time to adjust the environment you will be using the entire semester, to your liking. Begin by launching your Jupyter Lab session from ondemand.anvil.rcac.purdue.edu.
Open your settings by navigating to .
Explore the settings, and make at least 2 modifications to your environment, and list what you’ve changed.
Here are some settings Kevin likes:
Dr. Ward does not like to customize his own environment, but he does use the Emacs key bindings.
|
Only modify your keybindings if you know what you are doing, and like to use Emacs/Vi/etc. |
-
List (using a markdown cell) of the modifications you made to your environment.
Question 2
In the previous project, we used a tool called awk to parse through a dataset. This was an example of running bash code using the f2022-s2023 kernel. Aside from use the %%bash magic from the previous project, there are 2 more straightforward ways to run bash code from within Jupyter Lab.
The first method allows you to run a bash command from within the same cell as a cell containing Python code. For example.
!ls
import pandas as pd
myDF = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
myDF.head()
|
This does not require you to have other, Python code in the cell. The following is perfectly valid.
With that being said, using this method, each line must start with an exclamation point. |
The second method is to open up a new terminal session. To do this, go to . This should open a new tab and a shell for you to use. You can make sure the shell is working by typing your first command, man.
# man is short for manual, to quit, press "q"
# use "k" or the up arrow to scroll up, or "j" or the down arrow to scroll down.
man man
Great! Now that you’ve learned 2 new ways to run bash code from within Jupyter Lab, please answer the following question. What is the absolute path of the default directory of your bash shell? When we say "default directory" we mean the folder that you are "in" when you first run bash code in a Jupyter cell or when you first open a Terminal. This is also referred to as the home directory.
Relevant topics: pwd
-
The full filepath of the default directory (home directory). Ex: Kevin’s is:
/home/x-kamstutand Dr Ward’s is:/home/x-mdw. -
The
bashcode used to show your home directory or current directory (also known as the working directory) when thebashshell is first launched.
Question 3
It is critical to be able to navigate a UNIX-like operating system. It is likely that you will need to use UNIX or Linux (or a similar system) at some point in your career. Perform the following actions, in order, using the bash shell.
|
For the sake of consistency, please run your |
-
Write a command to navigate to the directory containing the datasets used in this course:
/anvil/projects/tdm/data. -
Print the current working directory, is the result what you expected? Output the
$PWDvariable, using theechocommand. -
List the files within the current working directory (excluding subfiles).
-
Without navigating out of
/anvil/projects/tdm/data, list all of the files within the themovies_and_tvdirectory, including hidden files. -
Return to your home directory.
-
Write a command to confirm that you are back in the appropriate directory.
|
|
-
Code used to solve this problem.
-
Output from running the code.
Question 4
When running the ls command (specifically the ls command that showed hidden files and folders), you may have noticed two oddities that appeared in the output: "." and "..". . represents the directory you are currently in, or, if it is a part of a path, it means "this directory". For example, if you are in the /anvil/projects/tdm/data directory, the . refers to the /anvil/projects/tdm/data directory. If you are running the following bash command, the . is redundant and refers to the /anvil/projects/tdm/data/yelp directory.
ls -la /anvil/projects/tdm/data/yelp/.
.. represents the parent directory, relative to the rest of the path. For example, if you are in the /anvil/projects/tdm/data directory, the .. refers to the parent directory, /anvil/projects/tdm.
Any path that contains either . or .. is called a relative path (because it is relative to the directory you are currently in). Any path that contains the entire path, starting from the root directory, /, is called an absolute path.
-
Write a single command to navigate to our modulefiles directory:
/anvil/projects/tdm/opt/lmod. -
Confirm that you are in the correct directory using the
echocommand. -
Write a single command to navigate back to your home directory, however, rather than using
cd,cd ~, orcd $HOMEwithout the path argument, usecdand a relative path. -
Confirm that you are in the corrrect directory using the
echocommand.
|
If you don’t fully understand the text above, please take the time to understand it. It will be incredibly helpful to you, not only in this class, but in your career. |
Relevant topics: pwd, cd, . & .. & ~
-
Code used to solve this problem.
-
Output from running the code.
Question 5
Your $HOME directory is your default directory. You can navigate to your $HOME directory using any of the following commands.
cd
cd ~
cd $HOME
cd /home/$USER
This is typically where you will work, and where you will store your work (for instance, your completed projects).
|
|
The /anvil/projects/tdm space is a directory created for The Data Mine. It holds our datasets (in the data directory), as well as data for many of our corporate partners projects.
There exists 1 more important location on each cluster, scratch. Your scratch directory is located at /anvil/scratch/$USER, or, even shorter, $SCRATCH. scratch is meant for use with really large chunks of data. The quota on Anvil is currently 100TB and 1 million files. You can see your quota and usage on Anvil by running the following command.
myquota
|
|
In a bash cell, please perform the following operations.
-
Navigate to your
scratchdirectory. -
Confirm that you are in the correct location using a command.
-
Execute the
/anvil/projects/tdm/bin/tokeicommand, with input/home/x-kamstut/bin.Doug Crabill is the compute wizard for the Statistics department here at Purdue.
~dgc/binis a directory (on a different cluster) he has made publicly available with a variety of useful scripts. I’ve copied over those files to~x-kamstut/bin. -
Output the first 5 lines and last 5 lines of
~x-kamstut/bin/union. -
Count the number of lines in the bash script
~x-kamstut/bin/union(using a UNIX command). -
How many bytes is the script?
Be careful. We want the size of the script, not the disk usage.
-
Find the location of the
python3command.
|
Commands often have options. Options are features of the program that you can trigger specifically. You can see the options of a command in the DESCRIPTION section of the man pages.
You can see -m, -l, and -w are all options for
|
-
Code used to solve this problem.
-
Output from running the code.
Question 6
Perform the following operations.
-
Navigate to your scratch directory.
-
Copy the following file to your current working directory:
/anvil/projects/tdm/data/movies_and_tv/imdb.db. -
Create a new directory called
movies_and_tvin your current working directory. -
Move the file,
imdb.db, from your scratch directory to the newly createdmovies_and_tvdirectory (inside of scratch). -
Use
touchto create a new, empty file calledim_empty.txtin your scratch directory. -
Remove the directory,
movies_and_tv, from your scratch directory, including all of the contents. -
Remove the file,
im_empty.txt, from your scratch directory.
Question 7
|
This question should be performed by opening a terminal window. . Enter the result/content in a markdown cell in your notebook. |
Tab completion is a feature in shells that allows you to tab through options when providing an argument to a command. It is a really useful feature, that you may not know is there unless you are told!
Here is the way it works, in the most common case — using cd. Have a destination in mind, for example /anvil/projects/tdm/data/flights/. Type cd /anvil/, and press tab. You should be presented with a small list of options — the folders in the anvil directory. Type p, then press tab, and it will complete the word for you. Type t, then press tab. Finally, press tab, but this time, press tab repeatedly until you’ve selected data. You can then continue to type and press tab as needed.
Below is an image of the absolute path of a file in Anvil. Use cat and tab completion to print the contents of that file.
-
The content of the file,
hello_there.txt, in a markdown cell in your notebook.
|
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |