How a data scientist thinks
Tom Hosiawa • 2 min read
Part two of a fourpart series on how a Product Manager thinks about using machine learning in products. Before we get to the Product Manager part, we need to understand how a Data Scientist thinks.
Steps involved
 Understand your situation, challenge, and outcome if you solve it
 Prepare related information (aka data) to it
 Understand your information — how it’s spread out, where it’s concentrated, issues
 Iteratively fix the bad and vague information
 Answer your Questions through analyzing and modelling your information
 Share what you found by telling a story and make into a picture
Here’s how you do each step

Understand your situation, challenge, and outcome if you solve it
 Define your situation and the challenge: make clear the problem you are trying to solve and its context (who’s involved, what/where/when/how is it happening)
 Define desired outcome: what does success look like and what metrics will measure it
 Identify: information you have available, gaps and opportunities, and options

Prepare related information to it
 What type of information is right for the question you’re trying to answer
 Gather the information that is relevant to the problem and that it’s accurate, complete, consistent
 Identify all scenarios (i.e. factors and conditions) under where one piece of information is influenced, dependent on, or changed by other information or events
 Arrange and structure your information so that it is easier to analyze and understand

Understand your information — how it’s spread out, where it’s concentrated, issues
 Characteristics: e.g. numerical, categorical, text, etc.
 Distributions: mean, median, mode, and percentiles help you grasp where most of your information is concentrated
 Variability: variance and standard deviation reveal how spread out your information is
 Potential issues: imbalances, duplicates, noise, and missing values

Iteratively fix the bad and vague information
 Remove duplicates
 Handle missing values
 Fix or remove outliers
 Transform invalid or inconsistent ones
 Convert values that people represent/refer to in multiple ways to one way, e.g.
 Terminology: “DOB”, “Birthdate”
 Scale: meters, miles
 Format: “May 23rd, 2024”, “20240523”
 Mapping: “Yes”, “Y”, “1”
 Encoding: blue as hex (#0000FF), rgb(0,0,255)
 Assess how accurate, complete, consistent, and how much you trust the information from the start to end of your situation

Answer your Questions through analyzing and modelling your information
With data analytics techniques
 I’ll skip this
With machine learning techniques
 Tomorrow’s post

Share what you found by telling a story and make into a picture
 Start with the context
 Every story has people, a date and time (in the past, present, or future)
 Where and when is it happening
 Don’t give facts and information (leave that for the appendix). Instead, tell it like you talk to your friends.
 Don’t use jargon, technical terms, complex words unless your audience expects it
 If it’s in the future, tell it the same way you tell it from the past
 instead of, last Friday…
 imagine it’s Monday, 9am in 2029; tell us what you see
 Start with the context
Credits
 My learning from: Google Data Analytics, Situation Challenge Questions Answers (SCQA framework)
 My editors fixing gaps in my understanding: ChatGPT, Gemini