Sorting
- I would like to use the Chicago Movies in the Park dataset for today.
- Download a copy.
- Notice the about worksheet and the RawData worksheet.
- We have multiple purposes today.
- We would like to begin to understand this dataset.
- We would like to look for errors and inconsistencies in the dataset.
- We would like to understand the filtering and sorting capabilities of excel.
- As we look thought the data is is most likely that we will find problems.
- We will most likely have to make assumptions about the data to try and fix those problems.
- What happens if we guess wrong?
- Since it would be nice to be able to go back, we should make a copy of the raw data.
- Since we might need to reproduce our work, we should document what we do.
- I assume you have a "methods document" started for the project,
- So as you work through the exercise note the things you do to the data.
- Making a copy of the RawData
- Click on the Raw Data worksheet.
- Select Move or Copy and make sure that Create a copy is checked.
- Rename this to be Sorting
- If you want, you can start a new workbook.
- right click on raw data, select Move of Copy, but this time select (new book) in the To book: area.
- Create a copy of the about sheet in the New Book worksheet.
- Deleting worksheets can not be undone
- most actions in excel can be undone
- But when you delete a worksheet, this can not.
- Not a major problem, but you need to be aware of this.
- So duplicate a sheet then delete it.
- I think it is best to practice everything just so you know.
- We will discuss tabs more later.
- I would like to understand what if the last five fields are related.
- Arranging our data.
- For many of the functions we will be looking at, Excel likes tables.
- Most wants a record to be in a row
- A record is a set of related data.
- A single data point.
- In this dataset, what is stored in a row or record?
- Records are made up of fields or the components of a record.
- Fields are the columns of data.
- The data in a field should all be of the same type.
- A bad dataset may have multiple different kinds of information in a field.
- For excel, it is best if our data
- Has column headings.
- If these are missing, add them or you will be sorry later.
- Has no blank rows
- Has not blank columns.
- Remove any of these.
- This is called contiguous
- Has no data right next to it (on the side or top)
- As you do computations, be aware of this.
- This is called isolated
- On page 305 he gives a chart of quick movement keys.
- Because we will be dealing with large data sets, just scrolling is not the best.
- CTRL Arrow moves that direction until
- The last in a set of data
- The edge of the worksheet.
- For this reason it is best not to have blank cells in a workbook.
- CTRL home, end move to the extremes of a dataset.
- Adding shift to these allows you to select multiple cells.
- Replacing blanks
- There are blanks in this dataset.
- It would be easier if there weren't.
- What should we replace them with?
- Excel doesn't really care.
- But other software you will use does.
- NA is a good choice.
- Unless that is a valid field.
- Let's replace all of the blanks in the dataset
- Select Ctrl-Home to move to the top left of the dataset
- Select Shift-Ctrl-End to select the entire dataset.
- On the Home tab in the Editing command group select the Find & Select dropdown.
- Choose Replace
- Just leave the Find what: cell empty
- Put NA in the Replace With field.
- Select Replace All
- Note that 171 fields were changed to be NA.
- What should we do about this?
- While we are here, look at other replace options.
- Let's search for those na's
- Perhaps click on Match Entire cell contests
- Find all, notice we have some in Columns K, L, M that are NAs.
- This is a little difficult to deal with.
- I don't remember what those columns are.
- Go to the View tab in the window commandgroup
- Notice that we can Freeze Pains.
- Freezing the top row might be helpful here.
- This doesn't play well with some other options, but it is ok here.
- Currently, I don't think looking at the phone number will help
- This is really probably the same as looking at the park name.
- So right click on column D and select Hide
- Notice that the column is shown as collapsed at the top.
- You can
- Double click on this to unhide it.
- Or you can select C-E, right click and select Unhide
- Or you can just click near it and select Unhide
- I suspect that the last five fields are geographic, so let's hide columns E,F and G.
- I suspect that I and J are just other ways to represent the park name as well, so hide those.
- Let's see if there is really a relationship between the last five fields and the park name.
- Click anywhere in column C
- On the Datatab in the Sort & Filter area click on the Sort A to Z or Sort Z to A tab.
- Notice this sorts the entire table. (Because it is contiguous and isolated).
- Try sorting by Wards
- Scroll to the bottom.
- Notice that there are 7 parks here that have NA's
- What can we do about that?
- Let's try that Filter button.
- Drop down the new bar on Park
- Select Chicago History Museum
- Note the other two entries are the same.
- Should we copy the values down to the NAs?
- If so what else should we do?
- Can you fix Osterman Beach and Humboldt Park?
- Let's Sort by day of the week.
- Use A to Z
- Not What we really wanted
- Try clicking on the Sort box.
- Select Custom List and sort by week day.
- Create a custom list, sort Mon, Tue, Wed ... Sun
- Sort on the Date field.
- Sort on the wards.
- Note the parks are now out of order.
- Sort by ward then by park.
- Notice in the sort
- My data has headers
- We can sort left to right but that is not for our data.
- We can do case sensitive.
- Excel normally ignores case.
- Try filtering on