OCT 30 Live: AI Evals: The Next Discovery Habit w/ Teresa Torres THU 10/3010:30 PM—11:15 PM (GMT+5:30) OPTIONAL Recordin
Notes
Recording
Live: AI Evals: The Next Discovery Habit w/ Teresa Torres
Oct 30, 202510:30 PM - 11:15 PM GMT+5:30
Audio Transcript
Chat Messages
Teresa Torres
Hey, Hamel!00:00:18
Hamel Husain
Hey, how's it going?00:00:20
Teresa Torres
Pretty good, how about you?00:00:21
Hamel Husain
Pretty good! Nice to see you. Did you move locations?00:00:23
Teresa Torres
No, I just rearranged furniture.00:00:26
Oh, cool, okay. We're selling our Portland place, so we had to move a bunch of stuff from Portland to Bend and…00:00:29
I meant rearranging some things.00:00:35
Hamel Husain
Okay.00:00:38
Teresa Torres
Yeah.00:00:39
Hamel Husain
Yeah, it'd be good to meet sometime,00:00:41
It's sad I didn't catch you when you were in Portland.00:00:44
Teresa Torres
Yeah, we're actually coming this weekend!00:00:47
Hamel Husain
Oh, okay.00:00:49
Teresa Torres
Yeah, I don't… I… we have plans… we're going to Winterhawks games.00:00:50
But I think, like, weekend daytime is pretty open. I don't know how flexible your schedule is, I know it's last minute, but if you want to meet up for.00:00:57
Hamel Husain
Okay, I'll check, like, what's going on with the kids, and I'll try to see if I can, get something going.00:01:03
Teresa Torres
But we'll be… we'll be back. I mean, even… we're selling our place, but, we're gonna… I mean, we have a ton of friends in Portland, and…00:01:09
Hamel Husain
Okay.00:01:18
Teresa Torres
Like, we'll be back on a regular basis. I'm terrible, like, we know enough people in Portland that, like, we start planning a trip and I get overwhelmed by, like, who should I tell, what plans should I make? Everything's last minute. I text friends that are like, we're on the road to Portland right now, what are you doing this evening? Which I realize doesn't work for people with kids.00:01:18
Hamel Husain
No, no, it makes sense. I would probably be doing the same thing. There's so much going on, I would probably text at the last second.00:01:38
Cool, this… I'm really excited about this talk, by the way. I actually prepared a short… Intro.00:01:45
Teresa Torres
Okay? I'm gonna tell you it's gonna be pretty rough.00:01:53
I meant what I said about… I'm gonna just riff a little bit. My plan is to, like… here's my plan. I'm gonna give an overview of the discovery habits.00:01:57
just to give context, I'm gonna set it up as, like, a little bit like how you guys did early on, of, like, this is just the scientific method.00:02:05
Because I want to talk about the overlap between, like, why evals just fit directly in.00:02:13
And then most of the talk is gonna be about how, like, a lot of what you learn in the class, if you don't know your customer well, doesn't really work.00:02:18
So, like, I'm gonna get into, like, how do you choose what use cases your AI00:02:28
should address, and when you're creating synthetic data, are you just making up dimensions, or is it informed by customers? Like, and what you know about customers? And so I'm gonna, like, kinda overlay00:02:33
Course concepts into, like, here's where knowing your customer well will make these steps better.00:02:44
Hamel Husain
Okay, I'm really excited. It's gonna be… it's gonna be great.00:02:51
Okay, we can… what time's… okay, we can go ahead and get started. Let me… let everybody in and all that good stuff.00:02:55
Okay, it says you're a co-host, right? Just wanna make you co-host, just in case. Did you get a notification?00:03:08
Teresa Torres
I didn't get a notification.00:03:13
Hamel Husain
Let me try again.00:03:15
No, it says you're co, so… Alright. That'll be fine. Okay, so, let me see… one second…00:03:22
Teresa Torres
Looks like we got a good amount of people.00:03:29
Hamel Husain
Yeah. Fun.00:03:31
Teresa Torres
Open up chat…00:03:34
Hamel Husain
Okay, we can go ahead and get started. Welcome, everybody. So, today we have a really special guest, Teresa Torres. So, many of you already know her as a product discovery coach, and the author of the book, Continuous Discovery Habits.00:03:49
Other product managers have described her as a national treasure, and I can see why. You know, her work has been really influential.00:04:04
helping teams move from, like, feature factories to, like, building products that create value. And it's one of the things that she really dives into in her teaching.00:04:13
So, Teresa was a student in her first cohort.00:04:22
And in her last session, that she did with us, Teresa demonstrated her end-to-end evals workflow for an AI discovery coach that she's building, all from within a notebook.00:04:26
And she used AI assistants to do that, and she found, like, important errors and used evals to systematically improve her app.00:04:38
I'll put the link to that video in the chat shortly. So this course has focused heavily on the how of evals.00:04:46
So you've learned about the analyze, measure, improve lifecycle.00:04:54
Today, Teresa is here to connect this technical process to the question of, like, what we should be evaluating, and she'll be explaining how her discovery habits can inform evals, but more importantly, she's gonna show how, like, the eval process can improve discovery.00:04:58
Enabling, like, faster assumption testing, things like that. So this is the session that bridges the gap from being an engineer who measures AI to being a product leader who can drive it towards value. So, really excited about this. Please give a warm welcome to Teresa.00:05:14
Teresa Torres
Thank you, Hamel, that was a very nice intro. I'm curious what… I know Hamel said that this cohort is a lot of product managers, thanks to Lenny. I'm just curious if people can put your role in the chat, just so I can get a sense. Are we mostly product managers? Do we have a mix?00:05:32
Well, we got a pretty good mix so far. I love it. Okay. And I saw from your reactions, I know some of you are already familiar with my work. If you're not, I'm gonna start with, like, a really high-level overview of the discovery habits. I'm gonna try to connect it to what you're learning in class.00:05:50
And then what I'm gonna do is I'm gonna go through some of the ideas that you've learned in this class, and show you where your discovery work can help, make it even better.00:06:06
And this is pretty… this is gonna be pretty informal. This is not, like, a polished talk that I've given a million times. You're just gonna see my slides are pretty rough. I want you to think about this as…00:06:17
We are just having an initial conversation about what this could look like. I want you to push back on some of these ideas. I want you to ask questions at the end. So think about this. I always tell the teams that I work with, like, you're always iterating on crummy first drafts.00:06:28
Hopefully it's not crummy, but consider this kind of my rough first draft, and then my goal is hopefully this starts a conversation about how do we… what does discovery look like with AI products?00:06:44
How does discovery inform evals? You're gonna see I'm gonna frame evals as the next discovery habit, because I do think it is a really important piece.00:06:56
And we'll just go from there. So, I'm gonna share my slides, and we're gonna dive right in.00:07:05
Alright, let's do it. So, if you're not familiar with my work, I've spent the last 14 years as a product discovery coach. I'm not gonna spend a lot of time on my bio. Some of the ideas that we'll talk through, like, I always share this slide at the beginning of my talks, because…00:07:12
product people tend to hear ideas, and they think that'll never work in my environment. And so this is my slide of, like, actually it's worked in a lot of countries, in a lot of organizations, big and small, regulated, consumer, B2B. It really is battle-tested, and we're gonna talk about why it's… why this works in so many broad contexts.00:07:26
Because I think this is going to be relevant to evals as well. So let's dig in! Like…00:07:45
I always start with, like, what in the world is product discovery, and, like, what is this jargon that we use? And so in the product world, discovery is just the work that we're doing when we decide what to build.00:07:49
And we contrast that with delivery, which is the engineering work that we're doing to write code, to ship production-quality products, to maintain those products over time.00:08:00
And the reason why this split is important to talk about is a lot of companies, overemphasize delivery and underemphasize discovery, and we're seeing that companies are finally starting to realize these are equally important domains. We could build the best version of the wrong product, and we're not going to succeed.00:08:10
I did write a book, Continuous Discovery Habits. I hate it when people, like, define terms, because I feel like we get into these, like, opinionated wars about what words mean.00:08:29
But in my coaching, I met a lot of teams that were like, Teresa, we already do continuous discovery. And really, they did discovery activities on a periodic basis, but not from a continuous mindset. And so I did define continuous discovery in my book as weekly touchpoints with customers.00:08:39
By the team building the product, where they conduct small research activities.00:08:57
in pursuit of a desired product outcome, and this is a mouthful. So to help folks understand it, I visualize it. And so this visual, I'm gonna spend, like, 3 minutes walking through this, just so you have the context, and then we're gonna dive right in to how does this relate to evals.00:09:01
So the whole world that I created around continuous discovery habits was really meant for empowered product teams, so teams being asked to deliver an outcome.00:09:18
So, just a little bit of history, especially for the younger folks in the group, and maybe even some of you work in these environments, because it's still very common. A lot of product teams were being asked to deliver just a feature list. Here's a roadmap, your job is to go deliver this feature list. Go build.00:09:27
Right? And there wasn't a lot of discovery in that process.00:09:43
A lot more companies, thanks to things like COVID, actually generative AI, these huge external trends that are disrupting roadmaps.00:09:46
Teams are shifting to be outcome-focused, right? They're saying… companies are saying, look, we don't know what you should build, that's your job to figure out, but we know we need you to have this impact on the business.00:09:55
And the challenge with starting with an outcome is this is a wide-open, unstructured problem. How do we know how we're gonna reach that outcome?00:10:05
And so with the continuous discovery habits, we're looking at small research activities. The first is we're interviewing customers, the second is we're assumption testing. And this is the piece that I want to talk through, because this is really just the scientific method. And so I want to highlight this in contrast with evals.00:10:14
So when you start with an outcome, we need to reduce customer churn, we need to increase retention. How do we know what's going to impact that number?00:10:30
We don't, right? And so we have to go interview customers. As we interview customers, we're not asking them, what should we build? We're interviewing them to learn about their world, their goals, their context, the environment in which they're trying to reach those goals, the tools that they're using.00:10:38
And as we do that, we map the opportunity space, and I want you to think about this as we're developing a hypothesis. This is literally just the scientific method. We're developing a hypothesis of how our customers work.00:10:54
So we're developing a mental model of who they are, what their needs are, what they're doing, and then when we start to design solutions, I want you to think about your solutions as an experiment of.00:11:07
Does it match our understanding of our customer? If it does, it should work for them, right? And so as we develop a solution, and we assumption test to evaluate that solution, what we're really testing is our understanding of the opportunity space.00:11:19
And so, if you're familiar with the discovery habits, you might not have heard me talk about it this way, but it really, like, the reason why these tactics work in so many environments is because it's really grounded00:11:34
in good scientific thinking, and the scientific method, and this idea of inductive and deductive, thinking. So if you're not familiar with those terms.00:11:45
Inductive thinking is this idea of, like, generating a theory of the world, and then deductive reasoning is like, okay, now let's look at a particular and test it. And there's this back-and-forth movement between induction and deduction.00:11:55
And so, if you're familiar with the opportunity solution trees, that's the visual in the middle, the opportunity space is our inductive theory of our customer's mental model and what they need and how they work.00:12:09
And our solutions are deductive tests of our understanding. And when we find problems with our solutions, we have to revise our understanding of opportunity space. We have to revise that inductive theory.00:12:19
Okay, so the reason why I'm framing it that way is because in this class, you learned…00:12:32
that error analysis and evals and analyzing and measuring, this is just another flavor of the scientific method, right? There's, like, an underlying pattern here that's consistent across both methods. It's really how do we be good critical thinkers.00:12:38
How do we really make sure that what we're building matters to our customer?00:12:57
And with, like, traditional discovery, we're looking at a really close match between our understanding of our customers and what we're building for them.00:13:01
And the whole point of continuous discovery is how do we make sure we're building the right solution that matches that target opportunity, that customer need, in a way that's going to create business value by driving that outcome.00:13:11
And so a lot of the discovery habits are, how do we align these things? And that's why we have this visual that's a decision tree, right? Where, like, as we do our discovery work, we're trying to keep these things really aligned.00:13:24
Well, this is really similar to eval work.00:13:36
So I really think about evals, when done well, as really just a discovery, part of the discovery habits, right? So this is a slide I shared in my original talk. This is a grid of my first eval results.00:13:39
We're looking at 1, 2, 3, 4, my first 5 initial evals. The column down the left is traces, and this is just whether or not they passed or failed.00:13:51
And by having this visibility, I have a fast feedback loop, right? Every time I make a model change, a prompt change, a temperature change, a change to my chunking strategy, or really any other change.00:14:00
I can now get a fast feedback loop and make sure my changes are working for my customers, and that's no different than, like, the traditional assumption testing we've been doing to evaluate, does my solution address this customer need in a way that's gonna drive my outcome for the business?00:14:12
Right? And so with evals, we're asking, is my solution any good? Is it actually meeting the customer's need?00:14:29
So for me, evals are just another way of evaluating our assumptions, evaluating our solutions, evaluating if they're a good fit for our customer.00:14:36
The challenge is.00:14:47
They only do this if they're designed well, right? So this, in parentheses, when done well, is really critical. And so that's what I want to talk about, and this is where I'm going to connect the discovery habits to evals. So now we're going to get into some of the things you learned in class.00:14:50
But before I do that, I want to talk about the biggest mistakes I see AI teams do.00:15:06
And I'm now seeing this a lot, because I have a new podcast called Just Now Possible, where I'm interviewing product teams about the AI products that they're building. And I'm learning, there are some teams that are really good at choosing the right customer problem to solve.00:15:11
And are building pretty successful, amazing AI products.00:15:27
There's a lot of teams that are just sort of sprinkling AI across their whole product platform.00:15:32
And it's just sort of table stakes, like, think about all these summarization and translation and sort of the low-hanging fruit of AI.00:15:37
And then, unfortunately, I've had some false positives, where I've spent 90 minutes interviewing teams.00:15:46
And what came out of it, like, I didn't publish these episodes.00:15:52
Because they didn't really choose the right customer problem.00:15:56
And they're building something that just doesn't make any sense.00:15:59
And so, one thing I want to highlight00:16:02
Is, it doesn't matter how good your evals are if you picked the wrong customer problem to solve.00:16:05
Right? And so, I actually think AI products, because it's so trendy, and there's so much FOMO, and we're being all asked to add AI features to our roadmaps.00:16:11
It's really easy to just cut corners.00:16:20
But building AI products is really hard. I hope this class is exposing how hard that is.00:16:23
Don't spend all this time on evals if you're solving the wrong problem. So this is just your reminder to just go back to the basics, make sure you really have an understanding of the opportunity space, you really understand what your customers need, and you're using that to inform what AI products to build.00:16:28
And one trend that has come out in my interviews, and I know Hamel will, this will probably resonate, because I've heard him say similar things.00:16:45
When you're building a chatbot, And people can type in anything.00:16:53
One of the most important decisions you need to make is what use cases are you going to cover.00:16:58
And how are you gonna handle when someone puts something in that isn't a use case you can handle? And literally every team I've talked to, the very first step in their AI workflow is, should my agent even respond to this input?00:17:04
And this is where the only way you're gonna get this right is if you have an understanding of the opportunity space, and what use cases your customers need you to address. And I saw Hamel came on video, so let me let him interject here.00:17:19
Hamel Husain
No, I just came on my video because you, I thought you maybe saying something else, but no, I agree with you. The things that gives me nightmare most when working with teams building AI products is when the product says, ask me anything.00:17:33
Teresa Torres
Yeah.00:17:48
Hamel Husain
It gives me the feeling that, oh, Maybe they don't know…00:17:48
What they really want to build here.00:17:53
Teresa Torres
Yeah.00:17:56
Hamel Husain
And.00:17:56
Teresa Torres
I need to refer them to you, I think.00:17:57
Hamel Husain
At that point.00:18:00
Teresa Torres
One of the things we talk a lot about on Just Now Possible, like, this is part of my list of questions.00:18:01
is like, okay, you have a big vision for your AI product. Where did you start? Like, you can't, on day one, address every use case. So, like, how did you know where to start, and how did you…00:18:06
iterate your way through a big… all these use cases, and in my head, the way that I visualize this is, what was your understanding of this really big opportunity space?00:18:17
How did you choose where to start? What small use case did you start with? And how did you iterate your way through the opportunity space? Which, by the way, good product teams have always been doing this. This is what good iterative, continuous development looks like.00:18:27
Right? So I just want to call this out, because AI is so hot right now, we're forgetting some of these fundamentals.00:18:42
Okay, so now let's get into your course content. So I stole some images from the course reader. This is not a criticism of this course. I cannot tell you how much of a fan I am of this course, and the work that Hamel and Streya are doing.00:18:50
I just see lots of opportunities where understanding your customer makes this way better. So I want to make sure that you're not missing this connection.00:19:03
So one of the first things you learn before you can do any error analysis is you have to generate some initial traces. And I noticed in the course… this version of the course reader, they hammer home, if you have real data, that is the best case scenario. Real evidence of what your customers are doing.00:19:13
But if you don't have that, like, you're a zero-to-one product, nothing's in production, you need somewhere to start, and there's this idea of generating your initial traces with synthetic data.00:19:30
And when I learned this, I was really excited, because I was like, I don't have an initial set of traces, and I need to generate some.00:19:40
And I loved this idea of identifying dimensions.00:19:47
Coming up with values for each of those dimensions, and then creating your tuples. Like, amazing. I loved how structured it was. But in the back of my head was, yeah, but we can't just guess what these dimensions should be. Like, what features should we support? Are we just guessing?00:19:51
And, like, what client personas should we support? How do we know who our clients are? And that we have first-time buyers and experienced investors? And, like.00:20:07
what scenarios should we support? And especially, like.00:20:15
how clearly the user expresses intent? Like, how do we know that they're gonna type in something as lazy as, show me homes, right? Like, we know this by talking to our customers. So, like.00:20:19
If you're building an AI product, and especially a 0 to 1 product, don't just make up dimensions. Like, you have to rely on your discovery foundation of understanding, these are the opportunities we're gonna go after. By the way, those are your use cases. Your AI is gonna address those opportunities. Those are your features.00:20:32
And then you have to be interviewing people that have those needs so that you understand the variation in customer segments. And you could even run assumption tests about, like.00:20:50
You can show someone a chat window, or whatever the interface of your AI is gonna be, and ask them, like, hey, you're looking for a new apartment, what would you type in here? And start to get a sense for what's that variation in how people express their intent and their interest.00:21:01
So, it's… here's the deal, like, your initial synthetic data is gonna generate your initial traces, and this is the foundation for your error analysis, and how you're gonna evaluate quality.00:21:18
Don't guess here. Like, ground it in something you actually know about your customers.00:21:30
So.00:21:35
Hamel Husain
I really love this, by the way. One thing that we say in our course is, like, you know, you need to come to synthetic generation with a hypothesis.00:21:36
But I kind of leave it at that. Yeah. And I think it's a little bit vague to some people, and I really like the way that you are couching it, and, you know, that you're going to be explaining right now. So I think you're gonna give a lot more meat to this, like, what is… what do you mean, hypothesis?00:21:44
Teresa Torres
Yeah, like…00:21:59
Okay, let's say you follow Hamel and Shrea's process perfectly, and you've built all these evals, but you started with, like.00:22:01
a guess on your synthetic data. Like, what's the risk of that? You're gonna end up with an initial dataset of traces that doesn't reflect what happens in production.00:22:09
Which means, like, yeah, you evaluated quality, but you evaluated quality on the wrong thing.00:22:21
So, like, this process that you're learning in this course is only as good00:22:27
as the data set you're using to drive your evals. Like, period. Right? And so, if you just do a little bit of discovery to, like, start… and, like, maybe you do start with some guesses, like, go ahead and write down what you think00:22:32
the right feature set is, and the right client persona is, and how you think they're gonna express intent. Like, go ahead and be explicit about what that hypothesis is.00:22:46
But now, go out and talk to customers, and run some assumption tests, and learn about, are those the use cases they care about? Is that how they express their intent?00:22:55
Are you seeing those segments in the people that you're talking to? And I would do that before you even start generating your initial traces. Like, why pay for that LLM activity? And don't start doing error analysis. Error analysis is a lot of work.00:23:03
First, make sure you have a really rich understanding00:23:18
Of these… like, what this represents, what these dimensions represent, is how well do you understand your customer, and how they might want to use your product.00:23:23
So, I would spend some upfront time making sure that I really understand this. And what's hard about this is, like.00:23:32
If everybody on this call, like, works on their product and has worked on their product for a while, like, you already know something about your customer. It's gonna feel really easy, like, I can just make a good guess and it's probably pretty good.00:23:39
But I'm gonna remind you, you build a lot of things that your customers don't use.00:23:51
And that's not a criticism, we all build a lot of things our customers don't use. That's just a reminder that, like, your first guess is probably wrong in some way. So, like, let's just do some iterations on that guess with a customer feedback loop.00:23:55
Before we invest all this time and energy in the evals, because evals are hard, they take a lot of work.00:24:09
So, this is really, like.00:24:14
This is the foundation of your eval, so let's start and make sure that foundation is really strong.00:24:17
Alright.00:24:24
Hamel Husain
What do you suggest to people who are building chat bots?00:24:25
And this really is the case, that there's so many chatbots out there, where the input box has a ghost text in it that says, ask me anything.00:24:29
Teresa Torres
Yeah.00:24:37
Hamel Husain
Do you… what do you suggest, like, people take a hard look at their scope and say.00:24:38
Teresa Torres
Yeah!00:24:43
Hamel Husain
What do you do.00:24:44
Teresa Torres
I think the first thing, this has come up a lot in my podcast, where I'm… again, so many teams went through this learning curve of, like.00:24:45
We all default to a chat box because we're used to using ChatGPT.00:24:52
And, like, that might be the right interface, but a lot of people, they see a chatbot.00:24:56
Box, and they don't know what to type in.00:25:01
So, like, you have to give people… you have to remove the blank page problem, right? Like, if you just say, ask me anything, like, you're making me, the user, do all the work.00:25:03
If you understand your customer, and you have a really good hypothesis of this feature dimension.00:25:14
You can have a better default text there. You can say, ask me about any of the following things, right? And now you're steering them towards an area where you might be successful.00:25:21
And you see, even ChatGPT over time has started doing this. When you go to, the ChatGPT interface, right below the open chat box, where you literally can type in anything, there's suggestions on what ChatGPT can help you with.00:25:31
But these suggestions have to be grounded in what your customer needs.00:25:44
Because if you suggest things they don't care about, you're not really solving the problem.00:25:48
And then I'll tell you, every team I've interviewed, literally their first step, like, they do this, like, scrubbing step before they even send it to the agent, of, like, is the intent behind what the user entered something we can adequately respond to?00:25:52
And only if it is do they pass it on to the agent. If it's not, then they respond with something generic, like, here's… let's connect you with a human, or go visit our knowledge base, or whatever, because they don't want the agent hallucinating, making something up. So I think this is a theme of, like, if you want to build a good00:26:08
general agent, you have to be really clear about what you can adequately address, and then you have to steer people towards those use cases.00:26:27
Hamel Husain
I like that.00:26:37
Teresa Torres
Okay, let's look at the next step. So, let's assume we've gotten to the point where we're confident in our dimensions, we've gotten some feedback from our customers, we're generating our initial traces based on our tuples.00:26:39
It's now time to do error analysis.00:26:50
Now, I also took this from the course reader, and it is true some errors are obvious. You don't need to know very much about your customer to do error analysis. In this first situation, like, the user put in00:26:53
They have a pet, and the response didn't include the pet. Like, I don't need to have any domain knowledge, I don't need to have any customer knowledge, that's clearly wrong to me. And, like, we're all gonna agree that's wrong.00:27:06
Right? Same with this second one, like, there's no availability on Saturday, and it's suggesting Saturday, obviously wrong.00:27:17
But you're gonna have error analysis cases where if you don't know your customer and what they need, it's not gonna look wrong, and you're not gonna catch the error. And this is why, like, a lot of this work is being moved from engineers to product managers, because engineers don't always have the domain knowledge00:27:25
to, like, identify the error in the first place. And I think this third example is a good example of this.00:27:42
Send a property list to my investor client in San Mateo, and the error analysis is there's a persona mismatch, and there's a tone and property mismatch.00:27:48
How do we know that?00:27:59
How do we know what investors want? How do we know the right tone to use with them? How do we know they don't want starter homes? Right? This is the domain knowledge that's required to do good error analysis.00:28:00
And, like, what's hard is we can go find a domain expert, like, maybe a realtor.00:28:12
But does that realtor, like, they have good, intuitive, expert knowledge?00:28:17
But, like.00:28:22
Do they really know who your customers are and your customer segments? And, like, no two real estate firms are going after the exact same customer segments? Like, this is really unique to your company and your company's strategy of which customers we're going after.00:28:23
And so it's not just domain knowledge about real estate, it's domain knowledge plus company strategy knowledge plus your specific customer's knowledge.00:28:39
And so this is where, like.00:28:50
Discovery has to play a role. Like, we have to know who our customer is, not just people who buy houses.00:28:52
The people who buy houses that our company is trying to reach.00:28:59
Right? So…00:29:03
When we talk about domain knowledge, like domain expertise, how do we know what good looks like? And I love, like, Hamill's term of, like, the benevolent dictator.00:29:06
But, cause, like, I've been in those meetings where people quibble over wording, and we can take forever, and it's really nice to have a benevolent dictator just say, no, this is the decision. This is not good. We're gonna mark this an error.00:29:16
The challenge is, is who should be that benevolent dictator? And, like, this is where I really want to push everybody on the call to think about, maybe it should be your customers.00:29:29
And what does that look like? Like, why are we letting somebody arbitrarily inside our building walls be the benevolent dictator of what good looks like, when ultimately we're trying to serve a customer?00:29:38
Now, this is gonna sound a little ideal, I realize when we're going from 0 to 1, like, we may not have a ton of time to, like, have a customer look at every trace, and there probably is a need for internal00:29:50
dictator. But here's the thing, like, in Discovery, we have super-fast feedback loops now. So what if, like, we showed some traces in a really customer-friendly form, not all the tool calls and whatnot, but, like.00:30:01
So, imagine you were trying to do this, and you got this response. What would you think?00:30:16
Right? And we let a customer annotate it.00:30:20
We can do that with our assumption testing tools. We can use unmoderated testing platforms to get, like, hundreds of traces annotated by customers in a single day.00:30:24
Right? And so for our really tricky, ambiguous cases.00:30:34
Why don't we let the customer be the benevolent dictator? What does that look like? And I'll tell you, like, this is rough for me, like, I haven't experimented with a lot of these tactics yet.00:30:37
My nature is I want to push as much… I want to get my feedback loop as close to the customer as possible. And this seems like an error… an area where doing this with customers makes a lot of sense to me.00:30:46
Hamel Husain
I really love this idea, and I've been talking with friends about even making some tools that, when you onboard your customer to your product.00:31:00
you kind of… Go through… you tell them, like, help us train.00:31:12
Teresa Torres
Yeah. The AI.00:31:16
Hamel Husain
To you, and, like, have them annotate some things deliberately.00:31:18
Teresa Torres
Yeah.00:31:22
Hamel Husain
with an incentive, like, we're gonna… we're gonna make it better for you, and we're gonna align the AI to you, and, like, kind of show them a progress bar, and then you can, like, keep surfacing things to them.00:31:22
Like, as they discover errors. So I think this is an excellent idea. I love that you're bringing this up.00:31:33
Teresa Torres
Yeah, and you know, when we have a production product, this is actually really easy to do, right? Almost all of us are already integrating these, like, thumbs up, thumbs down feedback.00:31:40
I think the key is, we want them to also annotate it. So, like, in my interview coach, when you get feedback, you can give a thumbs up and thumbs down, but you can also tell me why. And I think that's really key. You're actually getting a customer annotation on your trace.00:31:48
Alright.00:32:06
When we're defining the right metrics, it's the same exact thing. I'm not gonna belabor this point, because it's very similar to here. Like, how do we know what good looks like?00:32:08
When you get to the point where you're, defining the right metrics and you're deciding what to measure, it's really easy to rely on these, like, industry benchmarks.00:32:18
One of the hard things with metrics is, are you measuring what you think you're measuring? Like, in research, there's this idea of validity. Like, does our instrument actually measure the concept we're trying to measure?00:32:27
And this… it's… it's a hard concept to wrap your head around, because, like, it's so easy to make jumps as a human thinker, like, as we're reasoning.00:32:39
We think about… we don't always think about the ideal thing to measure. We're really influenced by what we think we can measure, and what we think we can measure isn't always the right thing to measure. And so there's this, like.00:32:48
I'll give an example of this. Like, I used to work at recruiting companies, like job boards, and almost every job board, their measure of success is, did you apply for a job? That's a really crappy measure of success, because it leads to a lot of, bad applications to jobs people are irrelevant for. The job seeker doesn't care if they applied, they care if they got the job.00:33:01
The employer doesn't care if you applied, they care if they want to hire you. And so, like, a better measure of success is00:33:20
Did we help somebody find a job? Did they get hired? But that measure is hard to measure. That metric is hard to measure, and so people are afraid of it. But it is the right metric of value.00:33:27
And so I think the evals have the same exact problem. Like, if we just limit our measurements to, like, these benchmark metrics, or these traditional ML metrics, there are gonna be cases where those are the right metrics.00:33:39
But the better you understand your customer, the better you're gonna be able to come up with what is the right metric.00:33:52
And there are gonna be times when you're designing an eval that's, like, for the core part of your product, the core part of the value that it delivers, that you have to come up with a custom metric.00:33:58
Like, yeah, maybe there's benchmarks on, like, a concise tone or a helpful tone, but is that for your customer? Like, what's the right persona tone match for your specific customer? And that's probably a custom metric.00:34:10
And so that really requires that you understand your customers and what they need.00:34:26
And this is why, like, all these off-the-shelf eval tools are tough. Like, these out-of-the-box evals might get you started, but you're gonna outgrow them pretty quickly. And we see this with product metrics, right? Like, when we're measuring engagement.00:34:31
We start with, like, monthly active users, and then we get a little more sophisticated, and we do DAO over Mao, so daily active users divided by monthly active users. Then we get a little more sophisticated, and we say, look, not all engagement is the same.00:34:43
What are the behaviors in our product that are actually really important? And we start to get to an activation metric, and that's very unique to our product.00:34:57
I think evals are the same. Like, you might start with a generic eval, but you better iterate your way to something that's really tightly tied to the value your product delivers.00:35:05
Okay, so this is my point. Evals are the next discovery habit. They really fall into this category of assumption testing. How do I evaluate if my solution is actually meeting my customer's needs?00:35:18
But if your evals aren't grounded in your understanding of customer needs, it's garbage in, garbage out. Like, we can measure a lot of the wrong stuff, and it looks like our product is doing great.00:35:29
But it really only is as strong as the inputs. And so this is where I just want to encourage you, and especially as you launch.00:35:41
Build in customer feedback loops, like, make sure they can rate your product thumbs up, thumbs down. Like, Claude Code is doing this really well right now, I constantly get asked, how is Claude doing? 1, 2, 3, 4, 5, it's like, fine, good, I don't remember what else.00:35:50
Cloud Code never asked me to annotate it, though, so I feel like there's a missed opportunity there. So definitely be thinking about, even once you go into production, where is that feedback loop coming from?00:36:05
Alright, and then, if you are interested in diving deeper on this, I will share… I didn't create a slide for this, but I do blog about discovery, and now more AI products at producttalk.org.00:36:18
And then I do have the new podcast just now possible, where I'm interviewing product teams, and we go deep. Like, we get into the, architecture maps, how people are orchestrating, what's an agent, how they're using RAG steps, like, we really get into the nitty-nitty detail, nitty-gritty detail, including evals. We get, we go deep on evals.00:36:30
Super fun, so definitely check that out. And then we've got some time for questions.00:36:50
Hamel Husain
I just want to also say, you should check out Teresa's newsletter. I read every single…00:36:56
a post she does very carefully, and it's excellent. You should really pay attention to everything that Teresa puts out there. If this talk has, you know, you could just tell from this talk, like, this talk should convince you in itself. Yeah, I couldn't be a bigger fan of Teresa. I was just saying in the chat.00:37:02
This is my favorite… this is now my favorite evals talk.00:37:21
And probably my second favorite evals talk is Teresa's last talk. So, you know, if you really want to kind of learn more about evals as well, like, please, you know, check out everything she's doing.00:37:25
Teresa Torres
I'll share on that front, I just, yesterday, I kicked off this new series that I'm really excited about, about Cloud Code, and it doesn't have to be Claude Code specific. Literally everything I'm writing applies to a command line interface tool, so if you prefer Gemini's or,00:37:39
Codex, it's fine, all the same things are gonna apply.00:37:56
It's specifically for product managers, and I'll share the story behind this.00:37:59
Over the last, like, 4 months, I have gone deep on command line interface tools, and not just for coding. I am using them for coding, but, like, I now do my task management out of, Claude Code. Like, I built my own little roll your own00:38:04
task management tool. I have this post coming up about, like, how to set up context files so you can write really lazy prompts, and Claude still knows everything it needs to know to be a good… to do a good job.00:38:18
I'm gonna get into, like, safety and how to run it safely on your computer, and, like, how to evaluate what code to let it run and what packages are safe, and how to evaluate if it's okay to install packages. I really think that, like, engineers are pushing the envelope on, like, how we collaborate with AI.00:38:30
And it's time for this to, like, spill over into the non-technical world.00:38:47
So I'm kicking off this series, it's probably gonna be, like, 4 to 6 posts.00:38:51
And the very first one went live yesterday, and I'll tell you how… what my goal with the post was.00:38:55
You could be a total beginner, never even heard of a terminal.00:39:00
And I'm gonna walk you through why you should care about Cloud Code, how to install it, and by the end of the blog post, you're gonna have00:39:03
a competitive analysis of all of your competitors, where you get a pricing comparison table and a feature comparison table. It's gonna happen in, like, 2 minutes. Cloud Code is gonna do all of the work, and it's gonna be set up in a way that if you want to add a competitor tomorrow, you're just literally adding it to a text file, and then rerunning a slash command.00:39:11
And if some of that makes no sense to you, the article explains all of it.00:39:31
And my goal with that blog post was to go from beginner to magic.00:39:35
In one blog post. Like, it was very ambitious, but I think I pulled it off, and I think you should check it out, and then there's gonna be much more coming in the next few weeks. I'm gonna talk about, like.00:39:39
I now get a research report every morning of every academic article that's related to anything that I do. When I save a PDF, it automatically gets… of one of those articles, it automatically gets summarized, and the summary gets added to my to-do list the next day. Like…00:39:49
I've just built out… I'm framing this as I've built out my own personal operating system.00:40:05
And I now want to teach other people how to do it. So check that out.00:40:10
Hamel Husain
And what's the best way for people to find this? Is it to go to Product Talk.00:40:14
Teresa Torres
Producttalk.org.00:40:17
Yeah. Sign up to your newsletter and they'll get all this information, or how do you find… what's it, like, how do you… Yeah, the articles are just on producttalk.org, like, you don't have to sign up to get them. For the Claude Code series.00:40:19
There is a paywall at some point in the article. Here's my goal. My goal is to give the concept away for free, and then if you become a supporting member, what you get is literally step-by-step instructions on how to put it into practice. So if you're one of those learners where, like, just understanding the concept is enough and you're gonna run with it, that is completely free.00:40:31
If you want, like, a little more hand-holding, it's on par with, like, most Substack subscriptions. It's not crazy expensive.00:40:51
Hamel Husain
That's great.00:41:01
Teresa Torres
Yeah.00:41:02
Hamel Husain
I'm gonna sign up right after this if I haven't already. I think I may already be a Substack subscriber, but we'll see. I'm gonna make sure.00:41:04
Okay, we can, take questions now.00:41:13
If you'd like, you could put some questions in the chat, and I can… sort of…00:41:17
Ask. Oh, by the way, have you started using skills, the new…00:41:22
Teresa Torres
Kind of thing from Claude? Yeah, so you'll see in the blog post I released yesterday, I announced, like, as I was writing, the skills came out, and it might change things a little bit.00:41:27
Skill… I'm excited about skills, I'm also a little worried about skills. So, what skills do is they allow you to package, sort of, context files plus slash commands.00:41:36
plus Scripts, so deterministic code together, and then you can share it with other people.00:41:48
One of my concerns with Claude Code is that, like, if you don't have a technical background, there are some security things you need to be aware of, like.00:41:56
you're giving Claude access to your entire computer.00:42:04
And if you're letting it run code, and especially install packages, there is this, like, package hacking thing that happens, where people are, like, overloading package names, and you end up not getting the package you think you get. And you could end up installing malicious code on your computer.00:42:08
And so this is particularly… could be a problem with skills, and Anthropic even acknowledges this. So, like, what I would recommend, if you don't have a coding background.00:42:26
Don't run code on your computer that you don't understand. Like, that's just rule number one.00:42:36
With skills in particular, you may not even know there's code in there. Like, you need to know a skill could have code in there. And you're not really…00:42:41
deciding when it gets executed. So, like, as an engineer, like, someone with an engineering background, I like the idea of skills, because, like, I want to package things and create things, and I plan to release skills.00:42:48
But I do think this is the Wild West. Like, this is a frontier, you have to understand the dangers.00:42:59
And even Anthropic acknowledges this, like, Anthropic says, don't install skills from people you don't trust. And, like, there's a little bit of danger here. People are putting out some pretty cool skills.00:43:06
But there are gonna be malicious actors, and when it's running on your local machine, you gotta be careful about that. And I am gonna… I'm gonna write a blog post about security, too. And I… and I am excited about skills, just…00:43:16
Be cautious, is my right-now answer.00:43:27
Hamel Husain
That makes sense. Okay, let me take a look at the chat.00:43:31
Okay, so one question from Eileen is, for teams you've seen doing the nightmare of trying something AI without doing enough product discovery.00:43:41
How… what have you found to, like.00:43:51
How do you get them back on track? What are, like, some first steps you can kind of take the team? Like, yeah, how do you… what do you do?00:43:53
Teresa Torres
Yeah, I sighed because this is a hard question. I… this is not specific to AI products. Like, the sad reality is most product teams don't do enough discovery, or they don't do discovery in reliable ways, right? They're not getting reliable feedback.00:44:01
I think what's hard… like, we already waste a ton… this is gonna sound really depressing, but I'm gonna try to bring it back to something optimistic.00:44:16
What's hard is, like, today and today's reality, product teams already waste a ton of engineering time on the wrong product to build. And actually, AI making it faster and easier to build is gonna make this worse.00:44:25
Right? Like, I'm reading about companies where, like, the marketing team is pushing code into production.00:44:40
Okay, that's kind of awesome, and it's also kind of a nightmare, right? Like, we don't want anybody and everybody to be able to build and release their idea. Like, this is what leads to terrible products.00:44:46
And so, for me, like, the easier delivery gets, I think the more important discovery gets. We should not build every idea. Our AI products should not cover every use case.00:44:57
Like, and I think a really, emerging skill is, like, understanding not just what use cases we should address, but, like, where do we need AI, and where do we need deterministic code? And we always need some of both, and what does that interplay look like?00:45:10
And I think, really understanding, like, the end goal is creating customer value. Like, we need to create something for our customer.00:45:25
And we're not…00:45:34
Like, to be honest, we're not great at this. Now, we weren't great at this 20 years ago, and we're getting better, but we're not great at this. And I think with AI products, because of FOMO and all the hype, and are we in a bubble, and like…00:45:35
companies over-investing in, like, in really shallow ways, where they're like, release something next quarter, even though we've never done this before. Like, this is gonna get worse before it gets better.00:45:47
But I also… here's where it's gonna get positive.00:45:59
I firmly believe that organizations change and get better at discovery when individuals change, and individuals get better at discovery. So, like, I'm a big fan of high agency product people, because we can influence change.00:46:02
And so I think the, like, real answer to your question is not, how do I get a company back on track, it's how do I get myself back on track? How do I get my team back on track? And then, we influence by showing what good looks like. And I think that's, for me, that's really empowering, because I can change the way I work.00:46:18
So, especially for product…00:46:37
Hamel Husain
How do you show what good looks like with discovery? Like, how do you kind of say, hey, like, this is good? Or this, like, I'm, you know, listen to me.00:46:40
Teresa Torres
Yeah! What is it?00:46:48
Hamel Husain
How does that manifest?00:46:49
Teresa Torres
I think we increase our hit rate.00:46:50
So… There's a few things that have to be in place.00:46:53
If you're not instrumenting your product, and you have no idea if it's working or not.00:46:56
Start instrumenting your product. Like, that's step one. Right? And that doesn't mean stop everything and instrument every part of your product. It could be as simple as, for the feature you're building right now, instrument it.00:47:01
What do you… what do you expect to happen when it releases? How do you measure that? Are people using it? Are they finding it? Are they using it the way that you expect? Are they using it all the way through to the value creation moment?00:47:12
The other thing I'd recommend, if you're gonna start, if you're new to instrumenting your product.00:47:24
Before you release that feature, as a team, write down what impact you expect it to have.00:47:29
Write it down. How many customers are gonna use it?00:47:36
How many are gonna get all the way through to the value creation moment? This is gonna feel really uncomfortable. You're gonna be like, how the hell do I know?00:47:39
You're making assumptions. Your assumptions are really idealistic. Everybody's gonna use it, they're gonna use it exactly the way that we wanted them to use it, they're all gonna have the value creation moment, and as a result, our retention is gonna go up by 10%, and we're all gonna get rewarded, and it's gonna be amazing. Write it down.00:47:45
Then, a month, 2 months, three months, whatever the right cadence is, take a measurement.00:48:02
Compare it to what you wrote down.00:48:07
This is where it's hard, because there's gonna be a giant gap between what you thought and what actually happened. It doesn't mean you did something wrong. Every single product team, 100% of us, there's a gap between what we think happened and what actually happened.00:48:09
But by building this habit, what you're doing is you're, training your brain. You're improving your intuition. So the next time you build a feature, if you are way too optimistic, what you expect to happen is gonna come down a little bit.00:48:22
It's also a feedback loop on your discovery. When our expectations fall short, it means there's an assumption that our idea depended upon that wasn't quite true. So it'll help you uncover that assumption, which will make you better at your next round of assumption testing.00:48:36
Which, by the way, just the scientific method, right? This is what scientists do. We come up with a hypothesis, we run an experiment, the experiment tells us.00:48:51
Alex Strick van Linschoten
Dinner's almost ready.00:48:58
Teresa Torres
Whether we're on the right track.00:49:00
Alex Strick van Linschoten
You'll need to.00:49:04
Hamel Husain
to mute somebody, hold on a second.00:49:05
Alex Strick van Linschoten
6 brilliant fried onion.00:49:06
Teresa Torres
Somebody's dinner is almost ready, that's amazing.00:49:08
Right? This really is just the scientific method, and it's just… but what's hard is that, like, we're not all trained scientists, and we have to build this muscle of, like, take a measurement, adjust, take a measurement, adjust. Your best feedback loop on how good your discovery is, is, is that gap closing?00:49:11
Are you getting closer to understanding the impact of what you're building? And then as you get closer, you build the right stuff more often.00:49:29
You're never always gonna have a 100% batting average. You're never always gonna build the right thing. There's always gonna be some gap, but we should be able to close that gap.00:49:37
Hamel Husain
That's really interesting. That kind of mirrors the answer to the question, how do we get our teams to get excited about evals? And what I say is.00:49:46
don't talk about evals, talk about the results of all the things you fixed, and all the bugs you squashed, and whatnot, you know, and, like, kind of show the evidence. So that's… it's very interesting how it mirrors each other.00:49:55
Another genre of question we're getting is, okay, so this class has a mixture of people, Product managers, engineers.00:50:08
Ever since I got to know you, Teresa, I've actually become a lot more interested myself in product management.00:50:18
I think it's, like, you got me excited about product management in a way that no one else has. That's awesome, because of the way…00:50:25
Teresa Torres
excited about data science.00:50:33
Hamel Husain
Yes, yeah. I think it's because, like, I see the connection, and I see the power of threading the needle all the way through, and00:50:35
A lot of questions are… so, okay, like, for example, is… or in evals, like, a really good entry point is error analysis, so we picked that on purpose as an entry point, so people can, like, sort of get into evals.00:50:43
For all the engineers and other people.00:50:58
You know, do you have any intuition on, like, good entry points? Where to start if you want to branch into, you know, this, like, doing better discovery and all the things you talk about? Where… what… where should you go? What should you do?00:51:01
Teresa Torres
Yeah, so I think error analysis is a good entry because it exposes what's not working.00:51:16
And so I think the same is true in product management. You gotta… you gotta expose what's not working. And I think there's two ways to do this. We talked about, one, you can instrument your product.00:51:21
Here's the deal, I know that 50% of you do not have an instrumented product. The way I know is because we run a survey every two years, and you tell me only 50% of you have access to behavioral analytics in your product. That is insane to me.00:51:31
Hamel Husain
That number seems high to me, actually.00:51:46
Teresa Torres
Yeah, well, any, any instrumentation.00:51:47
Hamel Husain
Okay.00:51:50
Teresa Torres
Google Analytics, any instrumentation, right? So, like, 50% of us are flying completely blind.00:51:51
Okay, if you can get any instrumentation in place, do that.00:51:57
But I also know this is hard. Like, a lot of us work in organizations that are not quantitative, they don't care about measurement. What they care about is stakeholders telling you what to build, and they want their pet feature built. So I'm gonna give you another entry that's probably easier, which is talk to your customers.00:52:02
So, interview customers. You can… I literally… I think you can do this in as little as one customer conversation every week.00:52:19
And a customer conversation can be 20 to 30 minutes.00:52:27
The key here is you can't just go talk to a customer, like, I'm gonna call up Hamel and have a human-to-human conversation. Like, it is true that customer interviews are just human-to-human conversations. We don't need to be afraid of them.00:52:30
But if you're gonna use them as a feedback loop, you have to learn a little bit about how to get effective feedback from another human. And so this is where, like, we have to have a basic understanding of cognitive biases and what questions to ask.00:52:43
And I'm gonna tell you this one secret that you need to remember when you're interviewing a customer. Don't ask them what they think, don't ask them what they like, don't ask them what they do. These are all really unreliable questions. They're all very speculative. The human will give you a response, the response will not match what they do in reality.00:52:57
What you want to do instead is you want to ask me, tell me about a specific time when you did a thing.00:53:15
And that could be, tell me about the last time you used my product. It could be, tell me about the last time you had a problem my product was designed to solve.00:53:22
It could be, tell me about the last time you used my product on the go, if you're on the mobile team, right? But you want to keep your interview grounded in a specific story about past behavior.00:53:29
The key is you want to invoke their memory. When we invoke their memory, if you're familiar with Daniel Kahneman and Amos Tversky's work, we're now engaging System 2. System 2 is our slow, deliberate brain00:53:39
Far more reliable responses. All those other questions, System 1 responses, rife with cognitive biases.00:53:53
So, if you can't instrument your product, a different feedback loop, and actually, you should do both of these, ideally, just talk to a customer on a regular basis. But ask for past stories.00:54:00
Hamel Husain
Wow, that's really amazing. Like, there's so many parallels with evals, because in evals, you want to look at very specific data, and you don't want these, like, generic kind of analysis. And that's really great that, okay.00:54:13
ask for specificity and specific experience. Experience is kind of… it's kind of like… A trace, in a way.00:54:27
Teresa Torres
I mean.00:54:34
Hamel Husain
This is what happened.00:54:34
Teresa Torres
Your look at the data mantra is exactly equivalent to my talk-to-your-customers mantra. Like, it… it all starts with, like, what's really happening.00:54:35
Right? Like, you can't… don't just make a guess, like, what is actually happening in reality? So, like, looking at a trace, this is telling you what's actually happening in reality. Talking to a customer and collecting a specific story about past behavior.00:54:45
This is just empiricism. Like, we're observing… still, it's all grounded in the scientific method, we're observing what happened.00:54:58
And we're using that as a finding to then inform our next hypothesis.00:55:06
There's so many analogies here. This is why I loved this class.00:55:11
I was like, oh, somebody else that cares about rigorous scientific thinking in the product world. Yay!00:55:15
Hamel Husain
That's really great.00:55:21
So, yeah, thank you so much for coming, Teresa. This is very… this is, you know, always a pleasure to have you.00:55:25
we should team up at some point, somehow, because I really love your thinking, and everybody loves your thinking, and yeah, I really recommend everyone to…00:55:33
Check you out, and learn more about all the things that you do.00:55:45
Teresa Torres
Yeah, thank you, everybody, and I really am super excited about my new podcast, so if you're familiar with my discovery work, but you don't know that I have a podcast.00:55:52
We geek out on all things AI, and I am… it is the one thing that brings me so much joy to, like, nerd out with builders. So, I feel like this whole AI wave has helped a lot of people reconnect with being makers and builders. If that's you, check out Just Now Possible.00:56:00
Hamel Husain
That's great.00:56:20
Thanks so much.00:56:22
And… Thank you, everyone, for coming.00:56:24
Teresa Torres
Thanks, everybody!00:56:28
AI Evals allow us to create a fast feedback loop on how well our AI products are performing. In this talk, we'll look at how the traditional discovery habits can inform our evals: by addressing the right use cases, generating better synthetic data, and even collecting trace feedback directly from customers. We'll also look at how evals can improve our discovery habits: by enabling faster assumption testing and speeding up our discovery cycles.
[
Home
](/parlance-labs/evals/2025-3/home)[
Community
](/parlance-labs/evals/2025-3)