What is Data Science?
- What is it?
- Wikipedia: "An interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured"
- There are some who say that data science is just statistics.
- What did you get out of the articles you read?
- A venn diagram from 2010
-
- By Drew Conway
- How do we read this?
- There are many of these, this one amuses me
- I personally have no doubt that data science
- Involves electronic manipulation of larger datasets.
- Involves mathematical/statistical knowledge to validate the results.
- Involves domain knowledge to apply the results to answer questions and make reccomendations.
- Clearly communicate the results and reccomendations.
- But it is not about:
- The computations
- The size of the dataset.
- The statistics
- A last thought here
- While you may never be a "Data Scientist", you may be a data scientist.
- The ability to work with a larger dataset will provide a powerful tool
- You will be able to turn data into information
- Data
- Data is a set of values.
- Merriam Webster says it is
- factual information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation
- information in digital form that can be transmitted or processed
- information output by a sensing device or organ that includes both useful and irrelevant or redundant information and must be processed to be meaningful
- I would probably go with more of the third definition.
- The first doesn't work well for me.
- Herzog says : Any computerized file that uses columns and rows ()a tabular structure) to organize information that's represented as text, numbers and data.
- I don't like this at all, it is too restrictive.
- Data does not have to be in a digital format
- Take a look at Ancient data, modern math and the hunt for 11 lost cities of the Bronze Age
- A draft of the paper is on line.
- Apparently they analyzed 12,000 cuneiform texts.
- There were 26 cities mentioned.
- The location of 15 were known.
- 11 were unknown.
- They developed an algorithm which predicted the location of cities based on frequency of trade.
- For test cases (cities with known locations) they are right on 2 out of three cities.
- For the unknown cities, they compared the results to the predictions of two historian's suspected location of the cities.
- Four are at or very near (50Km) suspected location
- Two match one historian's predictions but not the second.
- Four are "very far (more than 100Km) from the historians conjecture"
-
- The historians disagree on the location of Durhumit
- The model favors Barjamovic's prediction.
- All three agree on the location of Sinahuttum.
- I would love to read more about this, but time ...
- Nor does it have to be structured (See below)
- I think I will go with a set of values.
- Data is not information
- Information is data that has been processed, organized, structured or presented to make them meaningful or useful.
- So data is values in a raw form.
-
- Information is organized data which provides insight or meaning.
-
- But you will see these used interchangeably.
- Herzog frequently uses "raw data" to describe data.
- Structured vs Unstructured Data
- Structured data is generally considered data that is broken into a number of regular fields for each record.
- The student data we looked at the first day was structured data.
- The superhero database is structured data.
- Unstructured data is data that can not be broken into fields easily
- Tweets, pictures, blog entries, pictures, ....
- But you can still do something with it.
- Currently, the news media has been paying attention to President Trump's tweets.
- In the article 11 months, 1 president, 2,417 tweets the Boston Globe presents and an analysis of his tweets for 2017.
- In this analysis, we could consider the data Semi-structured
- The dates and times are regular fields.
- But they are extracting information from the body of the tweet.
- The use of graphics is good for communicating the information.
- On Saturday, NPR published President Trump's Description of What's Fake' is Expanding
- They looked for "fake news", "fake" and "phony" in the body of the tweets.
- They discuss how this has changed over the President's term.
- Data can also be classified as primary or secondary
- Primary data is that data collected by the person doing the analysis.
- Secondary data is not.
- Electronic Data is measured in bits and bytes
- A bit stands for BInary digiT and is either a 0 or a 1
- A byte is 8 bits.
- We frequently use a Metric prefix associated with bits and bytes to discuss data.
- Kilo, Mega, Giga, Tera, Peta, Exa, Zetta, Yotta
- You will deal with these in your homework.
- Sometimes you will hear the terms Kibi, Mebi, Gibi, ...
- Just because computer engineers and computer scientists like to use powers of 2, not 10.
- These are the binary prefixes
- Kibi means 1024 (210) not 1000 (103).
- But I am not really worried about those right now.
- Just to conclude with data
- We are awash in data.
Another Example from Twitter
- Weaponized Health Communication: Twitter Bots and Russian Trolls Amplify the Vaccine Debate
- You should assume this entire discussion is quoted from this paper.
- How have unauthorized twitter users influenced the online discourse concerning vaccination.
- Several terms
- A bot is a program that automates content promotion
- A troll is an individual who misrepresents their identity with the intent of promoting discord.
- A cyborg is a human-bot combination
- They studied a large set of (1,793,690) tweets.
- They looked for tweets containing "vax" or "vacc".
- And identified the source as bot or human.
- In previous work, researchers have used data science techniques to identify humans, bots and cyborgs.
- It was found that overall, bots/cyborgs related to accounts linked Russia
- Tweeted at a significantly higher rate than other users.
- This appears to be aimed at spreading discord.
- They also found that "The highest proportion of anti vaccine content is generated by accounts with unknown or intermediate bot scores."
- And suspect that this is cyborg generated.
- This study
- Used a very large dataset
- Used knowledge of vaccination to classify types of tweets
- Used political/communications/social knowledge to pose the questions and classify various results.
- Used sophisticated algorithms to identify data of interest
- Used statistical techniques to validate the discoveries