Sorting

I would like to use the Chicago Movies in the Park dataset for today.
- Download a copy.
- Notice the about worksheet and the RawData worksheet.
We have multiple purposes today.
- We would like to begin to understand this dataset.
- We would like to look for errors and inconsistencies in the dataset.
- We would like to understand the filtering and sorting capabilities of excel.
As we look thought the data is is most likely that we will find problems.
- We will most likely have to make assumptions about the data to try and fix those problems.
- What happens if we guess wrong?
- Since it would be nice to be able to go back, we should make a copy of the raw data.
- Since we might need to reproduce our work, we should document what we do.
  - I assume you have a "methods document" started for the project,
  - So as you work through the exercise note the things you do to the data.
Making a copy of the RawData
- Click on the Raw Data worksheet.
- Select Move or Copy and make sure that Create a copy is checked.
- Rename this to be Sorting
If you want, you can start a new workbook.
- right click on raw data, select Move of Copy, but this time select (new book) in the To book: area.
- Create a copy of the about sheet in the New Book worksheet.
Deleting worksheets can not be undone
- most actions in excel can be undone
- But when you delete a worksheet, this can not.
- Not a major problem, but you need to be aware of this.
- So duplicate a sheet then delete it.
- I think it is best to practice everything just so you know.
We will discuss tabs more later.
I would like to understand what if the last five fields are related.
Arranging our data.
- For many of the functions we will be looking at, Excel likes tables.
- Most wants a record to be in a row
  - A record is a set of related data.
  - A single data point.
  - In this dataset, what is stored in a row or record?
- Records are made up of fields or the components of a record.
  - Fields are the columns of data.
  - The data in a field should all be of the same type.
  - A bad dataset may have multiple different kinds of information in a field.
- For excel, it is best if our data
  - Has column headings.
    - If these are missing, add them or you will be sorry later.
  - Has no blank rows
  - Has not blank columns.
    - Remove any of these.
    - This is called contiguous
  - Has no data right next to it (on the side or top)
    - As you do computations, be aware of this.
    - This is called isolated
On page 305 he gives a chart of quick movement keys.
- Because we will be dealing with large data sets, just scrolling is not the best.
- CTRL Arrow moves that direction until
  - The last in a set of data
  - The edge of the worksheet.
- For this reason it is best not to have blank cells in a workbook.
- CTRL home, end move to the extremes of a dataset.
- Adding shift to these allows you to select multiple cells.
Replacing blanks
- There are blanks in this dataset.
- It would be easier if there weren't.
- What should we replace them with?
  - Excel doesn't really care.
  - But other software you will use does.
  - NA is a good choice.
    - Unless that is a valid field.
- Let's replace all of the blanks in the dataset
  - Select Ctrl-Home to move to the top left of the dataset
  - Select Shift-Ctrl-End to select the entire dataset.
  - On the Home tab in the Editing command group select the Find & Select dropdown.
  - Choose Replace
    - Just leave the Find what: cell empty
    - Put NA in the Replace With field.
    - Select Replace All
  - Note that 171 fields were changed to be NA.
    - What should we do about this?
  - While we are here, look at other replace options.
  - Let's search for those na's
    - Perhaps click on Match Entire cell contests
  - Find all, notice we have some in Columns K, L, M that are NAs.
This is a little difficult to deal with.
- I don't remember what those columns are.
- Go to the View tab in the window commandgroup
  - Notice that we can Freeze Pains.
  - Freezing the top row might be helpful here.
  - This doesn't play well with some other options, but it is ok here.
Currently, I don't think looking at the phone number will help
- This is really probably the same as looking at the park name.
- So right click on column D and select Hide
- Notice that the column is shown as collapsed at the top.
- You can
  - Double click on this to unhide it.
  - Or you can select C-E, right click and select Unhide
  - Or you can just click near it and select Unhide
- I suspect that the last five fields are geographic, so let's hide columns E,F and G.
- I suspect that I and J are just other ways to represent the park name as well, so hide those.
Let's see if there is really a relationship between the last five fields and the park name.
- Click anywhere in column C
- On the Datatab in the Sort & Filter area click on the Sort A to Z or Sort Z to A tab.
  - Notice this sorts the entire table. (Because it is contiguous and isolated).
- Try sorting by Wards
  - Scroll to the bottom.
    - Notice that there are 7 parks here that have NA's
    - What can we do about that?
  - Let's try that Filter button.
    - Drop down the new bar on Park
    - Select Chicago History Museum
    - Note the other two entries are the same.
      - Should we copy the values down to the NAs?
      - If so what else should we do?
  - Can you fix Osterman Beach and Humboldt Park?
    - Doucment this.
Let's Sort by day of the week.
- Use A to Z
- Not What we really wanted
  - Try clicking on the Sort box.
  - Select Custom List and sort by week day.
- Create a custom list, sort Mon, Tue, Wed ... Sun
- Sort on the Date field.
Sort on the wards.
- Note the parks are now out of order.
- Sort by ward then by park.
Notice in the sort
- My data has headers
- We can sort left to right but that is not for our data.
- We can do case sensitive.
  - Excel normally ignores case.
Try filtering on
- Numbers
- Dates