rigélblu

me | writing
tom.hosiawa
Mar 04, 2024

How a data scientist thinks

Tom Hosiawa • 2 min read

Part two of a four-part series on how a Product Manager thinks about using machine learning in products. Before we get to the Product Manager part, we need to understand how a Data Scientist thinks.

Steps involved

  1. Understand your situation, challenge, and outcome if you solve it
  2. Prepare related information (aka data) to it
  3. Understand your information — how it’s spread out, where it’s concentrated, issues
  4. Iteratively fix the bad and vague information
  5. Answer your Questions through analyzing and modelling your information
  6. Share what you found by telling a story and make into a picture

Here’s how you do each step

  1. Understand your situation, challenge, and outcome if you solve it

    • Define your situation and the challenge: make clear the problem you are trying to solve and its context (who’s involved, what/where/when/how is it happening)
    • Define desired outcome: what does success look like and what metrics will measure it
    • Identify: information you have available, gaps and opportunities, and options
    • What type of information is right for the question you’re trying to answer
    • Gather the information that is relevant to the problem and that it’s accurate, complete, consistent
    • Identify all scenarios (i.e. factors and conditions) under where one piece of information is influenced, dependent on, or changed by other information or events
    • Arrange and structure your information so that it is easier to analyze and understand
  2. Understand your information — how it’s spread out, where it’s concentrated, issues

    • Characteristics: e.g. numerical, categorical, text, etc.
    • Distributions: mean, median, mode, and percentiles help you grasp where most of your information is concentrated
    • Variability: variance and standard deviation reveal how spread out your information is
    • Potential issues: imbalances, duplicates, noise, and missing values
  3. Iteratively fix the bad and vague information

    • Remove duplicates
    • Handle missing values
    • Fix or remove outliers
    • Transform invalid or inconsistent ones
    • Convert values that people represent/refer to in multiple ways to one way, e.g.
      • Terminology: “DOB”, “Birthdate”
      • Scale: meters, miles
      • Format: “May 23rd, 2024”, “2024-05-23”
      • Mapping: “Yes”, “Y”, “1”
      • Encoding: blue as hex (#0000FF), rgb(0,0,255)
    • Assess how accurate, complete, consistent, and how much you trust the information from the start to end of your situation
  4. Answer your Questions through analyzing and modelling your information

    With data analytics techniques

    • I’ll skip this

    With machine learning techniques

    • Tomorrow’s post
  5. Share what you found by telling a story and make into a picture

    • Start with the context
      • Every story has people, a date and time (in the past, present, or future)
      • Where and when is it happening
    • Don’t give facts and information (leave that for the appendix). Instead, tell it like you talk to your friends.
    • Don’t use jargon, technical terms, complex words unless your audience expects it
    • If it’s in the future, tell it the same way you tell it from the past
      • instead of, last Friday…
      • imagine it’s Monday, 9am in 2029; tell us what you see

Credits

  • My learning from: Google Data Analytics, Situation Challenge Questions Answers (SCQA framework)
  • My editors fixing gaps in my understanding: ChatGPT, Gemini