Tables In Excel, Cleaning Data
- This information is from the first part of Chapter 15.
- This is something very different, so please follow along and ask questions.
- The data
- This data
- I downloaded this data from kaggle.com
- This is a source of data for machine learning and data science.
- In this case, the data is in the public domain.
- I believe that this data is synthetic.
- The data was originally in a csv format
- Comma Separated Value
- Excel reads these files in very well.
- I have added a second tab with information we might need.
- It represents sales data for a small import/export business.
- Look at the data table.
- Understanding your data is important before you can begin any explorations.
- In this case, it is reasonably straight forward.
- A little formatting and data cleaning might help...
- Click in the column C header.
- Format this column as money.
- Repeat this for columns E and L
- Expanding columns is sometimes useful,
- Just click to make them wide enough to see the data.
- I really don't care about columns O:T
- Select these column headings
- Right click and select hide.
- You can unhide these the same way.
- I don't like the month code in column i
- Let's start buy completing the Month table in the tables worksheet.
- In A9-A11 type 1,2,3
- In B9-11 type January, February, March
- Highlight A9:B11
- Drag this down to B20 using the fill handle.
- Back on the raw data
- Insert a new column between I and J
- Click on J and select insert
- Rename column I to be Month Code
- Name the new column Month
- Use a lookup to fill in the correct month names in the new column.
- Copy this down with the fill handle.
- Hide column I
- There is a note with the data that says the territory contains bad data.
- Japan is in some fields.
- We should replace that with APAC
- Select column w
- Find Japan and replace it with APAC
- This should result in 121 changes.
- We don't want to change Japan in the Country Column.
- Before we finish, let's create a new copy of this.
- Right click in the worksheet tab.
- Select Move or copy
- Select the correct worksheet to copy
- Check Create a copy
- Rename both sheets. (working data and raw data)
- From now on, work on the working data tab.
- Exploring the data
- I am not sure that I trust the prices.
- Let's move the MSRP column beside the price column
- Click on the MSPR column
- Select Cut
- right lick on the column left of price
- Click on Insert Cut cells
- Repeat this
- move Month, next to the order date
- Move Order Line number next to Order Number.
- Let's check the price vs MSRP
- Insert a column after MSRP
- Compute price -msrp
- Call this column Markup
- I don't like negative numbers in () so change the format to currency and select negative numbers.
- Insert another column to the right.
- Compute (price-msrp)/price
- Format this as a percent
- Call this Percent Markup
- A quick way to get a read on data in a single filed is conditional formatting
- Select column G, percent markup
- On the home tab, select conditional formatting
- Select icon sets and select any three color icon set (green, yellow, red)
- It didn't color things the way I wanted so go back and edit the rules
- Change the dropdown percent to number.
- Color it green if the number is bigger than 0
- Color it yellow if the number is bigger than -0.33
- Color it red otherwise.
- Look it over, we could use a countif to decide how we are doing on markups, but that is for later.
- I don't trust the Extended cost field.
- Insert a new column after extended cost.
- Label this Cost Check
- Compute a new extended cost by multiplying Quantity by price.
- Add another new column
- Call it cost difference
- Compute Extended cost - cost difference.
- In both cases there is something strange. We should probably talk to the source of the data about this.