Ethics and Social Issues in Data Science

I am basing this on Sara Baase's A Gift of Fire
Ethics is a branch of Philosophy
- It involves the concepts of right and wrong
- And how to conduct yourself in a moral way.
- Baase: "What it means to do the right thing"
This is tough, there are many different ethical systems.
- A few of the systems Baase discusses.
- Utilitarianism
  - An action can be judged by how it increases the utility of the people involved.
  - Do what is best for society.
  - Judge actions by the results they produce.
  - Mill
- Deontology
  - There is a set of fundamental moral laws.
  - You must follow these laws, even if it leads to harm.
  - Judge actions by how they follow the rules.
  - Kant
- The golden rule (Biblical and Confucius)
  - Treat others as we wish them to treat us.
- There are many others.
Since ethical systems are somewhat abstract professional codes of ethics are developed.
- These tend to be specialized for a discipline.
- Discuss responsibilities when interacting with clients, customers, peers, employers, colleagues, employees and society in general.
- They are designed to help guide you in decision making as it relates to work.
A look at the The ACM Code of Ethics and Professional Conduct
- A well established computing society.
- Three parts:
  - General Principles
  - Professional Responsibilities
  - Leadership Principles
- Under each part is a set of statements
  - Each is followed by a brief discussion.
- General principles
  1. Contribute to society and human well-being
  2. Avoid harm
  3. Be Honest
  4. Be fair and do not discriminate
  5. Respect intellectual property
  6. Respect privacy
  7. Honor confidentiality
- Professional responsibilities
  1. Strive for high quality in process and results.
  2. Maintain high standards of professional competence, conduct and ethical practices.
  3. Know and respect rules (laws) pertaining to professional work.
  4. Accept and provide appropriate professional review.
  5. Give comprehensive and thorough evaluations of computer systems, impacts and analysis of risk.
  6. Perform work only in areas of competence.
  7. Foster awareness and understanding of computing, technologies and their consequences.
  8. Access computing and communication resources only when authorized or compelled by public good.
  9. Design and implement systems that are robustly and usable secure
There is no professional data science society right now.
- The field is too young.
- There are several drafts of codes of ethics out there.
- And several groups working on them.
A data science code of ethics will probably be close to ACM.
But it will probably include
- Stronger language about privacy of data.
- Stronger language related to biases built into data.
- Language about transparency and reproducibility of research and methods.
- Language about accuracy.
- Language about making decisions based upon models.
Some Illustrative Cases.
The Filter Bubble
- See his Ted Talk, or read his book.
- This was "discovered" by Eli Pariser
- Story 1
  - Two of his friends received radically different results on searches.
    - one was conservative and the other liberal.
    - When searching for BP
    - One received information about stock, investment, ...
    - The other received information about the recent oil spill.
- Story 2
  - He noticed on Facebook that his feed had dropped his conservative friends.
  - He was only seeing information from his liberal friends.
- The filter bubble is the personal ecosystem of information that's bee created by the algorithms that customize searches for individual users.
- A step back
  - You know that most search engines maintain a history of the searches you performed and where you went.
  - They use this information to decide "what you want to see"
- He claims that "there is no standard google any longer".
  - You will receive the results "you are interested in".
- And this is not just google
  - Facebook, amazon, netflix ...
  - News services (google news, but also for pay news sites)
- The internet is showing what we want to see, not what we need to see.
- He claims that we are moving from an era where "human editors" controlled the news we watched (most of you are too young for this)
  - To an era where algorithms (machine learning, big data) control what we see.
- As a result, we are only receiving the news "we want", and not the news "we need."
- And we are building digital tribes.
- What to do
  - There are ways to turn filters off.
  - There are ways to defeat the filters
    - Seek out and click on news stories that don't match your idea.
  - Seek out sites that don't filter. (allsides.com).
- There is apparently some correction going on
  - In some spheres companies are attempting to reduce this for news feeds.
  - But not in marketing.
COMPAS
- Correctional Offender Management and Profiling Alternative Sanctions (COMPAS)
- Software designed to predict the likeliness that a criminal will commit another crime.
- Based on an algorithm that uses 137 features for each person
  - Race is not a feature.
  - But apparently there are other features that can be correlated with race.
- In multiple studies this software has been found to mispredict rates.
  - Black defendants were overpredicted to recommit a crime.
    - And therefore were more likely to be denied parole or sentenced more harshly.
  - White defendants were underpredicted to recommit a crime.
  - A study of 10,000 criminals in Broward County Fl. By Propublica (look here for details).
    - Black defendants were predicted to recidivate but did not at a rate of 45%
    - White defendants 23%
  - The manufacturer, as well as at least one study (The age of secrecy and unfairness in recidivism prediction) do not agree with these findings.
    - The study finds that they are incorrectly using age as a predictor.
    - For the study data, African-Americans were more likely to commit a crime at a younger age.
    - And they believe that this causes the problem in COMPAS
- What is wrong?
  - Errors in data collection and entry cause people to be misclassified.
  - The algorithms used are proprietary, therefore hidden
    - For the most part, we have no idea how this software is making its predictions.
    - There is no way for a person to face this accuser.
  - We may not even know how the algorithms are making these predictions
    - Many machine learning techniques do not provide any justification for their decisions.
    - In fact, in some ways we don't know how they make the decisions.
College Ratings
- Weapons of Math Destruction by Cathy O'Neil
- US News and World Report College Rankings.
- Use a model to predict the "quality" of a college or university.
  - Includes 15 measurements of quality.
  - Currently listed here
  - But has changed over time.
    - Currently 20% is based upon Undergraduate Academic Reputation.
      - This is based upon a survey of deans, provosts and presidents at peer institutions.
    - Other factors include
      - Alumni giving. (5%)
      - Financial resources (10%)
      - Class size, faculty salary, faculty with terminal degree, ... (20%)
  - This seems OK but
    - The 20% reputation vote is equivalent to a "popularity contest"
    - There have been multiple studies that point out that most of those surveyed don't know enough about the peer institutions to correctly answer the survey.
    - There have been "campaigns" to raise the ratings in the past. (Advertisement to Deans, Provosts and Presidents of other schools)
  - Cheating is a problem
    - at least in the past, a major portion of the data was self-reported.
      - And schools misreported the data.
    - Or tried to fake the data.
      - At one point SAT scores were part of the ranking.
      - And schools paid students to retake the SAT after they were admitted to drive up this score.
    - In the past the acceptance rate was a factor.
      - And schools would inflate applications to decrease acceptance rate.
      - Schools would reject the most qualified?
        
        The reasoning went that these students used the school as a "back up" school.
        They weren't going to attend anyway
        So rejecting them would increase the rejection rate at no cost to the school.
  - And there is really no direct measure of student success
    - So spending money to improve facilities would help in the ratings.
    - But does this help with student success?
    - It helps with the 10% to Financial resources.
    - O'Neil points out that this actually led to huge tuition increases at many schools.
    - Note even now, cost is not a factor in the rating.
      - But Faculty salary (7%)
      - Class Size (8%)
      - And other measures relating to students to faculty ratio (5%)
      - All contribute.
  - The good:
    - The model is transparent.
  - But the data is a true problem.
So what is the point.
- There are many examples of data being used badly
  - This is no major change from the past, but we now have ways to
- Bad data is a major problem.
- Filtering data to create a bias is also a problem.
- Not questioning the data is also a problem.
So
- Be careful with your data, make sure you are accurate.
- Be honest with your data and procedures
  - Be clear and transparent, at least as much as possible.
  - Accused by your peers is one thing, accused by a hidden algorithm is another.
  - In the paper they say "Neglecting to use transparent models has consequences. We provide two arguments for why transparency should be prioritized over other forms of fairness. First, no matter which technical definition of fairness one chooses, it is easier to debate the fairness of a transparent model than a proprietary model. Transparent algorithms provide defendants and the public with imperative information about tools used for safety and justice, allowing a wider audience to participate in the discussion of fairness. Second, transparency constitutes its own type of procedural fairness that should be seriously considered (see (27) for a discussion). We argue that it is not fair that life-changing decisions are made with an error-prone system, without entitlement to a clear, verifiable, explanation."
- Remember the golden rule here.