Sorting Data
- We continue with the Chicago Movie Dataset.
- Please open this.
- Notice the about worksheet.
- The reference for this section starts on page 459 of the book.
- Move to the Sorting worksheet.
- We want to explore this data set.
- This will help us understand what data we have
- and to look for problems or errors in the data
- We may have to clean the data
- Or we may decide to discard data fields.
- It also helps us start to develop a plan for answering our question.
- As I said before, it is important to label everything.
- In this case, the labels for the columns came with the data.
- If we add any columns, we will need to label these as well.
- The tools I plan on exploring today REQUIRE column labels.
- I would like to try to understand the last five columns.
- So I am going to hide some columns
- Column D is probably just another way to state the park name so let's hide it.
- Right click on the column heading labeled D
- Select Hide
-
- This will hide the column.
-
- To show the column again
- Click near it and select Unhide
- I don't think that the last five fields have anything to do with the move so
- I also think that in this case columns D, I, J are equivalent to column C
- Another useful technique for larger data sets is to freeze the top row.
- On the View tab in the Window command group dropdown the Freeze Panes menu.
-
- Select Freeze Top Row
- This will allow you to scroll down while keeping row 1 visible.
- You can unfreeze the same way.
- By the way, you can freeze the first column, or even an arbitrary set of first rows and columns.
- Sorting the Data
- I suspect that the Park name is related to the last five fields.
- If I sorted the data by the park name it might help show this relationship.
- Click somewhere in the table.
- On the Data tab in the Sort & Filter command group select Sort
-
- Notice that this selected the entire table for you.
- This will work well if your tables
- Are contiguous or connected
- This means they have no blank rows or columns
- Are isolated
- This means that there are no extra rows or columns beside, above or below.
- Select the Park as the Sort by field.
-
- And click Ok.
- Notice, our suspicion is confirmed.
- All the fields for each park match.
- Except the Chicago History Museum
- There seems to be a missing line of data here.
- It would seem safe to copy the data from the previous entry down.
- BUT You should make a note of this in your methods document in the Data Cleaning section.
- Let's do this.
- Highlight the cells containing the data you wish to copy.
-
- Grab the little box in the lower left hand corner.
- Drag the data down one cell.
-
- Or you could copy and paste.
- Continuing down it looks like two other parks are missing this data
- Humboldt Park
- Osterman Beach
- You should note that this data is missing in the methods document as well.
- What else could you do about this missing data?
- If we knew the data contained in the fields, we could attempt to "find" the missing values.
- We could just let the values stay as blanks.
- Or we could fill in with a clear marker that the data is missing.
- NA is a standard value for most fields.
- Other tools like R and python sort of expect this.
- Let's do this, but in a moment.
- In any case, you need to document what you did.
- Since I want to "fix" the blank fields, let's sort so they are all in one place.
- Filtering might be more appropriate, but we are sorting right now.
- Click anywhere in the Wards column
- Click either of the Quick Sort buttons
-
- Move to the bottom of the column by pressing <ctrl>-down arrow
- Fill in the empty fields with NA
- Type NA in one cell.
- Drag fill each direction.
- Sorting by other criteria
- Sort by the date field, what happens?
- Sort by the day field, but
- Drop the Order list down and select Custom List
-
- Select Sun, Mon, Tue, ...
-
- A multi-criterion sort
- If we need to, we can sort on more than one field.
- The second field is used as a tie breaker.
- Sort by date low to high then by Ward low to high.
- Look at ward 41, there are multiple parks in that ward
-
- Notice all the parks in the same ward are still sorted by date.
- Sometimes you want to do this, but it is hard to remember the order to sort in.
- By adding multiple Levels to the sort, you can accomplish the same thing.
- Select the Sort box
- Sort by Wards
- Select Add Level
- Then by Park
-
- Look at ward 41 now.
- The data is sorted by ward.
- It is then sorted by park name.
- Then it is sorted by date.
-
- Sorting data is an extremely valuable skill.
- You should practice this.