How a data scientist thinks

Tom Hosiawa • 2 min read

Part two of a four-part series on how a Product Manager thinks about using machine learning in products. Before we get to the Product Manager part, we need to understand how a Data Scientist thinks.

Steps involved

Understand your situation, challenge, and outcome if you solve it
Prepare related information (aka data) to it
Understand your information — how it’s spread out, where it’s concentrated, issues
Iteratively fix the bad and vague information
Answer your Questions through analyzing and modelling your information
Share what you found by telling a story and make into a picture

Here’s how you do each step

Understand your situation, challenge, and outcome if you solve it
- Define your situation and the challenge: make clear the problem you are trying to solve and its context (who’s involved, what/where/when/how is it happening)
- Define desired outcome: what does success look like and what metrics will measure it
- Identify: information you have available, gaps and opportunities, and options
Prepare related information to it
- What type of information is right for the question you’re trying to answer
- Gather the information that is relevant to the problem and that it’s accurate, complete, consistent
- Identify all scenarios (i.e. factors and conditions) under where one piece of information is influenced, dependent on, or changed by other information or events
- Arrange and structure your information so that it is easier to analyze and understand
Understand your information — how it’s spread out, where it’s concentrated, issues
- Characteristics: e.g. numerical, categorical, text, etc.
- Distributions: mean, median, mode, and percentiles help you grasp where most of your information is concentrated
- Variability: variance and standard deviation reveal how spread out your information is
- Potential issues: imbalances, duplicates, noise, and missing values
Iteratively fix the bad and vague information
- Remove duplicates
- Handle missing values
- Fix or remove outliers
- Transform invalid or inconsistent ones
- Convert values that people represent/refer to in multiple ways to one way, e.g.
  - Terminology: “DOB”, “Birthdate”
  - Scale: meters, miles
  - Format: “May 23rd, 2024”, “2024-05-23”
  - Mapping: “Yes”, “Y”, “1”
  - Encoding: blue as hex (#0000FF), rgb(0,0,255)
- Assess how accurate, complete, consistent, and how much you trust the information from the start to end of your situation
Answer your Questions through analyzing and modelling your information

With data analytics techniques
- I’ll skip this
With machine learning techniques
- Tomorrow’s post
Share what you found by telling a story and make into a picture
- Start with the context
  - Every story has people, a date and time (in the past, present, or future)
  - Where and when is it happening
- Don’t give facts and information (leave that for the appendix). Instead, tell it like you talk to your friends.
- Don’t use jargon, technical terms, complex words unless your audience expects it
- If it’s in the future, tell it the same way you tell it from the past
  - instead of, last Friday…
  - imagine it’s Monday, 9am in 2029; tell us what you see

Credits

My learning from: Google Data Analytics, Situation Challenge Questions Answers (SCQA framework)
My editors fixing gaps in my understanding: ChatGPT, Gemini

How a data scientist thinks

Steps involved

Here’s how you do each step

Understand your situation, challenge, and outcome if you solve it

Prepare related information to it

Understand your information — how it’s spread out, where it’s concentrated, issues

Iteratively fix the bad and vague information

Answer your Questions through analyzing and modelling your information

Share what you found by telling a story and make into a picture