What is Data Science?
- What is it?
- Wikipedia: "An interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured"
- There are some who say that data science is just statistics.
- Define algorithm: A set of steps to solve a problem in a finite amount of time.
- What did you get out of the articles?
- techtarget:
- Study of where data comes from? (I don't like this as much )
- how it can be turned into a valuable resource. (Full stop, not for business and IT strategies)
- Identify patterns--- Yes.
- Mathematics, Statistics, Computer Science, as well as Domain Knowledge.
- Johari
- Note, he really doesn't care about the name.
- "multiple mirrors in different forms and shapes, bug reflecting the same image"
- He specifically points out the importance of communications.
- I like Hal Varian "The ability to take data — to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it — that’s going to be a hugely important skill in the next decades.“
- He does a mini data mining project to produce
- Organization of data: Gathering data from relevant sources and organizing them in structured models that allow efficient data analysis.
- Analysis of data: Application of statistical and mathematical analysis methods towards data to derive patterns that can be applied to add value.
- Presentation of data: Communication of the result of data analysis using perceptual methods and delivered effectively to generate actions.
- Warden:
- What is science? Look at merriam-webster.com
- "Data and the tools to process it are suddenly abundant and cheap. Thousands of people are exploiting this change, making things that would have been impossible or impractical before now, using a whole new set of techniques."
- A venn diagram from 2010
-
- By Drew Conway
- How do we read this?
- There are many of these, this one amuses me
- I personally have no doubt that data science
- Involves electronic manipulation of larger datasets.
- Involves mathematical/statistical knowledge to validate the results.
- Involves domain knowledge to apply the results to answer questions and make recommendations.
- Clearly communicate the results and recommendations.
- But it is not about:
- The computations
- The size of the dataset.
- The statistics
- A last thought here
- While you may never be a "Data Scientist", you may be a data scientist.
- The ability to work with a larger dataset will provide a powerful tool
- You will be able to turn data into information
- Data
- Data is a set of values.
- Merriam Webster says it is
- factual information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation
- information in digital form that can be transmitted or processed
- information output by a sensing device or organ that includes both useful and irrelevant or redundant information and must be processed to be meaningful
- I would probably go with more of the third definition.
- The first doesn't work well for me.
- Hong says : Any computerized file that uses columns and rows ()a tabular structure) to organize information that's represented as text, numbers and data.
- I don't like this at all, it is too restrictive.
- Data does not have to be in a digital format
- Take a look at Ancient data, modern math and the hunt for 11 lost cities of the Bronze Age
- The paper is here
- Apparently they analyzed 12,000 cuneiform texts.
- There were 26 cities mentioned.
- The location of 15 were known.
- 11 were unknown.
- They developed an algorithm which predicted the location of cities based on frequency of trade.
- For test cases (cities with known locations) they are right on 2 out of three cities.
- For the unknown cities, they compared the results to the predictions of two historian's suspected location of the cities.
- Four are at or very near (50Km) suspected location
- Two match one historian's predictions but not the second.
- Four are "very far (more than 100Km) from the historians conjecture"
-
- The historians disagree on the location of Durhumit
- The model favors Barjamovic's prediction.
- All three agree on the location of Sinahuttum.
- I would love to read more about this, but time ...
- Nor does it have to be structured (See below)
- I think I will go with a set of values.
- Data is not information
- Information is data that has been processed, organized, structured or presented to make them meaningful or useful.
- So data is values in a raw form.
-
- Information is organized data which provides insight or meaning.
-
- But you will see these used interchangeably.
- Herzog frequently uses "raw data" to describe data.
Structured vs Unstructured Data
- Structured data is generally considered data that is broken into a number of regular fields for each record.
- The student data we looked at the first day was structured data.
- The superhero database is structured data.
- Unstructured data is data that can not be broken into fields easily
- Tweets, pictures, blog entries, pictures, ....
- But you can still do something with it.
- Currently, the news media has been paying attention to President Trump's tweets.
- In the article 11 months, 1 president, 2,417 tweets the Boston Globe presents and an analysis of his tweets for 2017.
- In this analysis, we could consider the data Semi-structured
- The dates and times are regular fields.
- But they are extracting information from the body of the tweet.
- The use of graphics is good for communicating the information.
- Here is a newer post, focused on who actually tweets for the president.
- Trump twets tend to come from either an iphone or an android.
- The author believes that the President uses the android
- Some R code, but also some nice graphs.
- Data can also be classified as primary or secondary
- Primary data is that data collected by the person doing the analysis.
- Secondary data is not.
Electronic Data is measured in bits and bytes
- A bit stands for BInary digiT and is either a 0 or a 1
- A byte is 8 bits.
- We frequently use a Metric prefix associated with bits and bytes to discuss data.
- Kilo, Mega, Giga, Tera, Peta, Exa, Zetta, Yotta
- You will deal with these in your homework.
- Sometimes you will hear the terms Kibi, Mebi, Gibi, ...
- Just because computer engineers and computer scientists like to use powers of 2, not 10.
- These are the binary prefixes
- Kibi means 1024 (210) not 1000 (103).
- But I am not really worried about those right now.
Just to conclude with data
We are awash in data.