How a data scientist thinks
Tom Hosiawa • 2 min read
Part two of a four-part series on how a Product Manager thinks about using machine learning in products. Before we get to the Product Manager part, we need to understand how a Data Scientist thinks.
Steps involved
- Understand your situation, challenge, and outcome if you solve it
- Prepare related information (aka data) to it
- Understand your information — how it’s spread out, where it’s concentrated, issues
- Iteratively fix the bad and vague information
- Answer your Questions through analyzing and modelling your information
- Share what you found by telling a story and make into a picture
Here’s how you do each step
-
Understand your situation, challenge, and outcome if you solve it
- Define your situation and the challenge: make clear the problem you are trying to solve and its context (who’s involved, what/where/when/how is it happening)
- Define desired outcome: what does success look like and what metrics will measure it
- Identify: information you have available, gaps and opportunities, and options
-
Prepare related information to it
- What type of information is right for the question you’re trying to answer
- Gather the information that is relevant to the problem and that it’s accurate, complete, consistent
- Identify all scenarios (i.e. factors and conditions) under where one piece of information is influenced, dependent on, or changed by other information or events
- Arrange and structure your information so that it is easier to analyze and understand
-
Understand your information — how it’s spread out, where it’s concentrated, issues
- Characteristics: e.g. numerical, categorical, text, etc.
- Distributions: mean, median, mode, and percentiles help you grasp where most of your information is concentrated
- Variability: variance and standard deviation reveal how spread out your information is
- Potential issues: imbalances, duplicates, noise, and missing values
-
Iteratively fix the bad and vague information
- Remove duplicates
- Handle missing values
- Fix or remove outliers
- Transform invalid or inconsistent ones
- Convert values that people represent/refer to in multiple ways to one way, e.g.
- Terminology: “DOB”, “Birthdate”
- Scale: meters, miles
- Format: “May 23rd, 2024”, “2024-05-23”
- Mapping: “Yes”, “Y”, “1”
- Encoding: blue as hex (#0000FF), rgb(0,0,255)
- Assess how accurate, complete, consistent, and how much you trust the information from the start to end of your situation
-
Answer your Questions through analyzing and modelling your information
With data analytics techniques
- I’ll skip this
With machine learning techniques
- Tomorrow’s post
-
Share what you found by telling a story and make into a picture
- Start with the context
- Every story has people, a date and time (in the past, present, or future)
- Where and when is it happening
- Don’t give facts and information (leave that for the appendix). Instead, tell it like you talk to your friends.
- Don’t use jargon, technical terms, complex words unless your audience expects it
- If it’s in the future, tell it the same way you tell it from the past
- instead of, last Friday…
- imagine it’s Monday, 9am in 2029; tell us what you see
- Start with the context
Credits
- My learning from: Google Data Analytics, Situation Challenge Questions Answers (SCQA framework)
- My editors fixing gaps in my understanding: ChatGPT, Gemini