OCT 24 Optional: Live Office Hours 9 FRI 10/249:00 AM—10:00 AM (GMT+5:30) OPTIONAL Recording
Notes
Recording
Optional: Live Office Hours 9
Oct 24, 20259:00 AM - 10:00 AM GMT+5:30
Audio Transcript
Chat Messages
Shreya Shankar
Cave.00:00:38
Hamel Husain
How's it going?00:00:39
Shreya Shankar
Good. Oh, I see you started early.00:00:41
Hamel Husain
Yeah.00:00:45
I have, anxiety about not… Making it so that.00:00:46
Shreya Shankar
What if you fall asleep before it starts? No.00:00:49
Hamel Husain
Cause I just, like, get into the…00:00:57
you know, putting the kids to sleep, and then I forget.00:00:59
Shreya Shankar
No, that makes sense. I have an alarm on my phone for 825.00:01:03
Hamel Husain
Yeah, that's what I started doing, too.00:01:08
Shreya Shankar
I forgot. But it's not that I forget, it's, like, just to feel safe.00:01:10
Hamel Husain
Oh, like, I would forget.00:01:17
So, I don't know. Yeah, like, you just…00:01:20
get into the zone of whatever else I'm doing.00:01:24
Shreya Shankar
Yeah, that's fair.00:01:27
Okay.00:01:30
Maybe we made too many office hours.00:01:37
That is…00:01:40
Hamel Husain
You're gonna end up with an office hour with, like, one person in it at the end.00:01:41
Shreya Shankar
There are plenty.00:01:48
Hamel Husain
That's totally crazy.00:01:49
Shreya Shankar
do it next week.00:01:50
Hamel Husain
Yeah.00:01:52
Shreya Shankar
Oh, boy.00:01:53
Hamel Husain
All right.00:02:02
We just pick on people now.00:02:03
Happy here. Sean, what's the question?00:02:08
The opportunity.00:02:14
Shan Arif
Brilliant.00:02:16
Now, by the way, I don't have any questions. I haven't caught up to the previous work yet.00:02:19
Hamel Husain
Oh, no worries.00:02:26
Shan Arif
region.00:02:27
Hamel Husain
I'm just giving you a hard time. Steve is here! Steve might have a question.00:02:29
steve.man
I do not have… I'm sorry.00:02:35
Hamel Husain
No worries.00:02:41
steve.man
I mean, my… I mean, my team is working on RAG, but we're still not there in terms of, like…00:02:42
I mean, it's hard to catch up with the lessons.00:02:49
Shreya Shankar
That's true.00:02:53
Actually, like, 2 months ago, Hamel thought we needed more content, so, you know…00:02:56
steve.man
Yeah.00:03:03
It's good that you say this.00:03:06
Shreya Shankar
Continue to remind him that actually everyone's overwhelmed all the time.00:03:07
steve.man
I mean, because on our side, we are like… because I told Emma, we are working on the evaluation, so we try to…00:03:12
based on the knowledge that we… we have, try to implement at the same time, but it's actually not feasible.00:03:20
Because there's so many things that we want to iterate again and again.00:03:28
And we are trying to… because I think everyone on our side is just learning about evaluation, and what is the level of quality that we want to have, and this we haven't defined, because the upper… the top management team really wants it to be00:03:34
perfect in a kind of way, but it's never perfect, and this is, like, something that we need to educate them, and it's basically what you guys are trying to tell us, like, it cannot be, it's just, like, what is the best00:03:49
Pros and cons of… doing this.00:04:03
Hamel Husain
Yeah, it's true.00:04:11
steve.man
So today, there's no one?00:04:16
Oh, this… okay, this is just some people.00:04:18
Shreya Shankar
There's 14 people!00:04:21
And Leti says that he's… That they are here because they have FOMO. Well… I like that.00:04:25
Hamel Husain
Yeah. We don't think.00:04:43
Shreya Shankar
I have something to do, or, like, talk about whenever people don't have questions.00:04:44
It's never happened before, yeah.00:04:49
Yeah.00:04:51
Let's see…00:04:53
I feel like I'm, like, racking my brain now for, like, recent…00:05:03
things that I can discuss in a recorded Zoom.00:05:08
Hamel Husain
How about evals? How about…00:05:15
Shreya Shankar
Yeah, or just, yeah, like, or projects that I've worked on pretty recently, Well, okay, so…00:05:17
I guess I can…00:05:23
talk about one of the things I've been thinking about, like, I work in AI-powered data processing, obviously, as you know.00:05:25
And a lot of these workloads are… users come in with lots and lots of documents, like PDFs, long text, transcripts, that stuff, and they want, kind of, high-level analyses that require LLMs to read almost every token, so it's not like…00:05:32
point-wise, question-answering RAG system. It's like, extract all the themes out of these things, or like, find all the medications that patients are complaining about, and what are the side effects. Like, as deep research as you can get, I feel like, in this setting.00:05:48
And so we've been trying to create benchmarks for some of these kinds of workloads in different domains, like medicine and law,00:06:03
And one of the things I've been realizing is the benchmark methodology00:06:11
So, in doing this, I'm adopting, like, what the foundation model providers are probably doing, because these aren't, like, evals for, like, my pipeline that I'm building. It's like, I want to create good benchmarks for, like, foundation models to do these hard tasks.00:06:17
And one of the things that's been interesting is if you ask domain experts what tasks that they want automated, they'll give you tasks.00:06:32
But they're just, like, small solutions that are part of their bigger workflow, and often, like, what's more interesting is, like, trying to collect all of the data involved in the bigger workflow.00:06:40
So I don't know how to, like, formalize that.00:06:52
Or, like, design benchmarks around that.00:06:55
Hamel Husain
How did you discover that the more interesting thing was the bigger workflow?00:06:58
Shreya Shankar
Yeah, so with the public defenders that I was working with, they were like, oh, here's all of these documents, like, there's, like, 10,000 documents here, like, the benchmark really is, like, extract these types of factors from… I won't, like, go into detail about it, because it's probably, like, a sensitive workload, because it's used in a case.00:07:01
And so we were like, okay, we'll do it, and then we had a bunch of, like, law school interns over the summer go through this, like, annotation process to create, like, ground truth label data. It's this big effort. And then the public defenders look at the outputs, and then, like, within a glance, they're like, oh, wait, I think there's, like, stuff that's missing, because I'm expecting to see things00:07:21
And I didn't really think about it, because, you know, they've never been presented with this, like, deluge of insights at this volume before. But after reading it, they realized, like, oh, there's some data that's missing that you didn't run your pipeline on.00:07:44
Because, like, it's making these references in the outputs, and those references don't, like, exist in the corpus, so something must be missing.00:07:58
And then I was like, I don't know, man, like, it's all legal j- it's stuff to me that I don't know.00:08:06
And then it turns out that, like, they get their data from, like, the DA, who might, like, omit stuff.00:08:12
Hamel Husain
Hmm.00:08:21
Shreya Shankar
So, the bigger interesting problem is, like.00:08:21
Is all the data in this?00:08:25
And so that's, like, a harder problem of, like, oh, find all references within this dataset, and then, like, make sure that you can follow all references. Like, you can basically do, like, a self-join on it. And, like, if you can do that.00:08:28
And everything feels consistent.00:08:43
Then, yeah, go forth and do extractions, or, like, whatever.00:08:46
But otherwise, you know, like, tell us what data's missing so that we can, like.00:08:50
people have been talking about, like, building automated agents to, like, file these requests to the DA automatically to, like, get the data that's missing. So… and it's crazy, it's, like, two things. It's, like, one… one error is, like.00:08:55
the DA said that they've given the data, but the data is not there, and then the other error is nobody even knew that this data should have existed. It was just referenced in some other documents.00:09:09
And so nobody knew to ask for it.00:09:21
So… Yeah, that's a really detailed example of, like.00:09:24
You should just design the AI to automate the big task, and not the little task.00:09:31
Anyways…00:09:36
Hamel Husain
I'm working on a legal thing right now, too, actually. It's my one legal… one client right now.00:09:39
Oh.00:09:45
Talk about it sometime.00:09:46
Abhishek has a question, though.00:09:48
Abhishek Panda
Yeah, I am Elisha.00:09:51
Good evening to you guys. So, I have a question. So, yesterday, I completed watching the videos for Ag. My first question is on the evaluation of generation. Like, from the tutorial, hammer and shit, I couldn't really understand. So, the evaluation would be very similar how we did in LLM judge itself?00:09:53
Hamel Husain
Yeah, yeah, the G part is the same.00:10:15
Shreya Shankar
Same.00:10:17
Abhishek Panda
Yeah.00:10:18
Shreya Shankar
It's the R part that's different.00:10:19
Abhishek Panda
Okay, got it, got it. So, then, here also we need to do the error analysis first, right? I mean, manually looking and trying to label it out, and then we need to check, okay, how good is our generation, and then, automate the process, right?00:10:21
Shreya Shankar
You always have to err.00:10:40
Hamel Husain
Yeah, first step is, like, you do error analysis on the whole thing.00:10:42
If you find that retrieval is a problem.00:10:45
Then you go into the R, the retrieval.00:10:48
Abhishek Panda
Okay.00:10:52
Hamel Husain
Once you get to that, like, let's say retrieval is one of the… so when you're doing error analysis, you, like…00:10:53
Remember, you focus on the most upstream error.00:10:59
So if, like, your first error that you're hitting all the time is retrieval issue.00:11:02
Then, you know, okay, then it's time to kind of go into the retrieval00:11:07
mode, where you're gonna, like, treat retrieval. When you fix retrieval, Then you go back.00:11:12
And do the same thing again. But you've already, like, solved retrieval, now you just…00:11:19
the same thing. Then you just… that's… then now you're back on the normal path.00:11:24
Abhishek Panda
Okay,00:11:30
Understood. Like, I was just wondering, like, I mean, the example that you guys took, it was more on the document.00:11:32
So, we have, one of the agents which is actually generating an image. Okay, like, we have discussed,00:11:38
last office hours, like you have mentioned about it, yes, the modality could change, and when we are evaluating generation, so we need to use those, predefined metrics, right? I mean, any of, like, let's say FID score or something, or we need00:11:46
I mean, how is the quality of the image, or some other metric we need to define? That's what I was wondering, in case of generation of an image, let's say diffusion model kind of thing.00:12:01
Hamel Husain
Oh, yeah, diffusion models. I have to look it up, there's some…00:12:13
Stuff out there for, like, quality of diffusion models.00:12:20
Abhishek Panda
I think it's.00:12:24
Hamel Husain
similar to an LLM judge. It's like some kind of model that helps you, whatever.00:12:24
I don't recall exactly what it is. I would have to…00:12:32
find out, but there's some stuff. I would still do some kind of error analysis, honestly. I would, like…00:12:36
Write down, like, what it is exactly, and see if we can build some themes around, like, why images are good or bad.00:12:43
Abhishek Panda
Okay, okay.00:12:51
I have one more last question, Hammer and Sriya, so I'll just, explain you the situation, maybe you guys can suggest something. So, on the multi-agent system that now in my team, they're focusing on, so what I observed, right.00:12:52
I mean, that error analysis is not done in depth, and the common pitfall, Samuel, as you have mentioned in the lecture, right, that people sometimes directly jump into automated evaluators, and they think that, okay, we can't spend time over there. So, I tried to raise this thing, but,00:13:06
I'm in…00:13:25
So, the setup, I mean, they're still focusing, but now the team focuses more on the automated evaluators, and then manually they are trying to judge it. So I suggest that, you know, first we should actually identify all those failures, and once we have categorized, then our LLMS judge is a reflection of what we are seeing over there.00:13:27
So…00:13:47
As an engineer, I know my question is a bit different. As an engineer, like, how to, I mean, convince the folks over there, or what would you suggest? Because it's not some…00:13:48
you know, like, traditional deep learning or some software engineering, that I will do some test cases and say, hey, see, this is the failure, and that's why I was suggesting this strategy.00:14:02
So… Yeah, if you could finish something over.00:14:11
Hamel Husain
Yeah, I mean, you know,00:14:19
That's why I'm teaching this class, you know? I was like, I mean, I'm hoping that other people will get convinced, you know, that's why Shay and I are doing it, but…00:14:24
You know, yeah, I mean, there's not…00:14:33
Any magic thing that I can think of, I mean, you know, be as bold as you can.00:14:37
Be confident.00:14:44
Explain, like, why this is.00:14:45
Show them some videos from… one.00:14:49
Abhishek Panda
Yeah, but there's no way escaping, like, I mean, we need to have design partners, because, I mean, the AI team only can help as much as, you know, in terms of, okay, building the pipeline or something, but we need the design partners, because they are the domain experts, right, who can actually help us label the failures.00:14:53
So, and there is another question also, Hamil, that how many such labels do we want? Because, how many is enough? Like, here in the course, yes, you guys gave an example of 2,000 traces.00:15:13
But in real life, like, how many such traces we need to find that we know, okay, this is…00:15:25
A decent one, or maybe if it's trustable.00:15:31
Hamel Husain
I mean, you know, it's not about a magic number, you know, we tell you 100 just to…00:15:40
Give you some motivation, but…00:15:45
Really, it's about getting actionable insights. Like, if you go through 30 traces and they're all the same error, it's, like, obvious. You can stop and just fix it, you know? It's totally fine.00:15:48
Abhishek Panda
Oh, God.00:16:01
Hamel Husain
the whole thing is meant to serve you. It's not like…00:16:01
It's just a sig… it's a suggestion, you know?00:16:05
And then, like, stepping back a little bit, like…00:16:08
I will say, there's a meta thing, like, if you're finding that you're having a hard time convincing people00:16:14
To do error analysis, or you're finding a hard time convincing management00:16:21
To involve domain experts, and they just want to outsource it to developers?00:16:26
There's different levels of… things that are happening often. One is… So, like.00:16:32
Okay, one is, like, they might… someone might not understand, like.00:16:42
what is the best way to do things? And that's, like, as good-natured people, that's what we think the problem is.00:16:46
Oftentimes, when I… because I've done it so many times.00:16:53
You can go one step above that and say, okay, like, you're not able to convince anyone. You also need to be aware of the fact that it could be.00:16:57
That no one actually cares.00:17:05
They just want to have an AI00:17:07
For the sake of having AI.00:17:10
I know I'm being… I'm just being… I'm being serious, because I've seen that many times. Probably, like.00:17:13
Hmm, a third of the time?00:17:19
that I interact with some organization.00:17:21
And then, you know, Then you can't do anything.00:17:25
Because… you're not really building AI, you're just building a marketing.00:17:30
So, that happens a lot, that's why I'm mentioning it.00:17:34
So you have to kind of try to see if you can understand00:17:38
sort of what's going on. No one's gonna tell you that. Like, yeah, I'm just doing this, just check the box, but you can… you'll…00:17:42
You can suss it out.00:17:50
Abhishek Panda
Got it, got it, got it. Cool, you can go ahead with other questions.00:17:53
Hamel Husain
Steve's back.00:18:00
Good question.00:18:02
steve.man
Yeah, I have a question, just on my specific use cases, so…00:18:04
I mean, maybe, Shreya, you don't know, but I'm working in the luxury industry, and I'm working for… in mainland China, but in mainland China.00:18:09
compared to the rest of the world, we have a lot of data that we can use, especially on WeChat, which is, like, kind of like the super app.00:18:17
in China, and basically, we can have all the conversation between our vendors and the client itself. So there's a lot of unstructured data that are really, really powerful. And right now, we have our agentix system where00:18:27
We generate outputs, for example, to help our seller associate to understand more the client itself.00:18:44
With all the data points that we have.00:18:51
But the thing is…00:18:54
I think I read something on the course, and this is something that our team is, like, and we are trying to solve, is also about imbalanced data sets that we have, because there's so many different scenarios, for our industries, and right now, what we have done to create a dataset is, like.00:18:56
we try, in the business sense, to cover, like, as many scenarios as possible. For example, like, someone that has no purchase transactions.00:19:16
but have an online behavior, something like this. Or someone that only have online behavior, but doesn't have any transaction or whatever. So we try to, in a business sense, to try to cover as many scenarios as possible. But the thing is.00:19:26
If we run the output and we give to the annotate team, and we've seen that, for example, there's 95% of pass and 5% of fail, how do you define that?00:19:42
Is it because we are biased in the way that we have selected, like, these specific test cases? Do we need to act… do we need to add more randomness in terms of the data set that we're choosing?00:19:56
Or we say that, okay, we have 95% of00:20:09
pass, doesn't mean, like, our data set is really imbalanced in a kind of way, and we need to find more, like, cases within this 5% of fail.00:20:12
Shreya Shankar
How are you getting those initial…00:20:25
trace, or, like, initial data points in the first place. Like, you said you're just… you're coming up with them, like, thinking of these… like, doing synthetic data generation, or is it actually user conversation?00:20:27
steve.man
So… the fields that we use, basically, the data points that we're using, we have a lot. We have around… I mean, for us, we have 4,000, but for each specific use cases.00:20:40
we filter it down. And how do we filter it down? It comes from an iteration between our business sense about what kind of fields are the most interested, and filter it down with LLM to remove maybe the duplicates of fields that create more noise for the…00:20:53
for the LLM. So for each specific use cases, so for example, right now, I'm just, like, talking about one use case is about understanding more the clients.00:21:13
about the transaction, about their behavior, about the relationship between the vendors and the clients. And once we have these fields, we create scenarios.00:21:22
For example, imagine… I see. Yeah.00:21:33
Shreya Shankar
Yeah, so I think it… because you haven't, like, deployed your application yet, right, there's really no way of knowing how it's gonna do. Like, if you're getting…00:21:36
95% good traces. Like, you could try to look at your 5%, sample more that's similar to that 5%, and then continue doing some error analysis on that, seeing if you uncover, like, new failure modes or new axial codes eventually.00:21:45
But I don't…00:22:02
The goal… you're not, like, ultimately successful if you're able to, like, get perfectly 50-50 or anything. Like, that's not the goal of this, right? It's, like, really to feel like, okay, you've really tried to see as many different failure modes00:22:05
That have come up as possible, and correct them, or fix them in some way, and now you're gonna, like, ship it, and then new things are gonna come up as you've deployed it, and…00:22:21
it is what it is, and you'll continue to do a cycle, right? It's never perfect.00:22:29
steve.man
That's my take on it.00:22:33
Because I feel… I agree with you, Shreya, because I feel like we are… maybe we have the wrong mindset right now. I think everything that we are doing is trying to kind of do overkill.00:22:35
is, like, trying to… because we don't know what is… because it's new for us, right? So we try to be as good as what we can, and because we're also thinking, like, how we can scale, because right now, I'm just talking about one use case, but we have.00:22:46
Shreya Shankar
Yeah.00:23:02
steve.man
20 use case that is coming, right? And we see, like, oh, the evaluation part plus the manual refinement will take so much time. What is the baseline that we need to have? And this, we haven't defined it. And we… I think it's case by case, in a kind of way.00:23:03
Shreya Shankar
Yeah, I think one thing that might really help is, like, try to pick a first application, or what people call design partner, that you're, like, you tell them, hey, you know, this is the first deployment of this application, and…00:23:20
just establish an open line of communication, and hope that, you know, they don't get mad at you if anything is off with the AI, and use that as a learning experience.00:23:32
steve.man
Okay.00:23:44
Because for us, at the end of the day, compared to other industries, we're kind of, like, lucky in a way, because we have a human in the loop at the end of the day. So for us, like, the use case is, like, the vendor… no, no, the seller associates or our vendor, will make his own judgment, whether this information is good or not.00:23:45
Shreya Shankar
Sure. Yeah, so I would say, like, also track all of this, right? Like, have buttons where that sales associate could say, like, this was useful, or, like, I used this, or, like, this was not useful, stuff like that. Like, anything that can help you get the signal there, yeah, but definitely don't just, like, keep endlessly generating synthetic data and trying.00:24:04
To… see, get it to 100%, because it's gonna be very hard.00:24:23
steve.man
Okay. Thank you, guys.00:24:28
Hamel Husain
One instinct that was triggered within me when you were talking is…00:24:30
Okay, so you described a use case where you… that is very… you said it's very broad in nature, there's a lot of different scenarios that can possibly happen, you have lots of data.00:24:35
from WeChat, which is, like, everything that you do in life.00:24:47
Like, everything that you eat, everything, everywhere you go, like, it's like everything, right?00:24:51
And you're saying that you're trying to give the sales associate some kind of contextual awareness of this person to prepare them00:24:56
Whatever, to interact with the customer.00:25:04
And you're saying that it's 95% pass, that just makes me very suspicious. But my job is to be suspicious. Like…00:25:06
And you should be suspicious.00:25:14
And, you should try to be as skeptical as you can.00:25:16
and say, Okay, why is it 95% passing? Are we saying, like, 95% passing is…00:25:20
like, are we being… are we scrutinizing it enough? Can it be better? Can we raise the bar? Like, what is… what do you mean, 95% passing? You know, just be really skeptical of that, because I find it…00:25:28
Surprising?00:25:40
steve.man
That is 95%… like, I just wouldn't expect something to be 95% bad. 95% come out of the blue. It's just, like, a random number that I threw it away. It's like, on our side right now, we're still, like, we have done all the dataset test cases or whatever. Now we are on the…00:25:41
we give the annotate, annotation team, or the business team, whatever, like, annotate anything you want. We give them, like, a plain test.00:26:00
Because we don't want to…00:26:09
give some kind of bias about what kind of hallucination we're looking forward. We just let them, and after, we are going to collaborate again to see, like, what should be the rubric or the criteria that we want to fix for the specific use cases.00:26:12
to annotate.00:26:29
Hamel Husain
See?00:26:30
Okay.00:26:31
It's really interesting, right? Because I'm imagining, okay, if I'm a sales associate.00:26:33
I kind of won't know if the prep is good until… like…00:26:37
Much later. I have to, like, talk to the person, get to know them…00:26:43
But usually all that stuff. But usually, like, in luxury.00:26:48
steve.man
we… one vendor has managed, like, 100 people, and the issue is, like… is because we are doing, like, we're selling, like, hard jewelry, so it's kind of expensive in a kind of way, so the experience that we need to provide to each client needs to be really, really high.00:26:52
But right now, the essay is, like, you cannot, as a human, manage, like, 100 people. It's not possible. 100… 100 clients. So what they're going to do is, like.00:27:09
what in this 100 clients, what are the clients that has the most potential to… to buy your new pieces? So just… so they just focus on 5 to 6 interactions.00:27:19
But what we want is, like, oh, how with AI system we can scale this intelligence to 100 people.00:27:30
So, we have a clientealing tool that we integrate with AI, where basically each essay for each client can know really deeply00:27:38
what this client is. Like, in terms of who is really this client, what is his passion, what really drives him into buying a new product, how can I00:27:50
improve the relationship with this client, not only in the commercial way, but more… really, like, deeply know about these clients. And the beauty in China compared to the rest of the world is we have everything. We… because even on WeChat, we know00:28:00
the browsing behavior on the mini program, which is an app embedded into WeChat. So we know all the client footprint is scary in a kind of way, but for… as a data point, it's kind of, like, cool to have all this data, and to be able to use this.00:28:18
Hamel Husain
Can you backtest it somehow? Like, meaning…00:28:39
steve.man
Okay, like…00:28:42
Hamel Husain
If you have a sales associate, or you have, like, some retail store, and you have, you know, some sales that have occurred before, and you ask the sales associate, okay, for this last sale that you had, write down everything that you think was important to know about this person.00:28:43
That would've… that, like, that you already knew.00:29:00
that is inherent to, like, the sale or whatever. And you could, like, build this, like, and see if the, like, the AI could help… could, like, come to the same…00:29:04
Kind of conclusion, or, you know…00:29:14
I don't know, something like that, like, sort of backtest.00:29:17
In a way.00:29:20
steve.man
So, basically, this, we kind of have it, it's like the knowledge that we have built along the years, working with business and working with these vendors, to know00:29:22
What is the good principles?00:29:32
to… and we kind of have some figures or KPIs that is mostly, like, relationship-based.00:29:34
Like, really understanding, like, the client as he's your friend, in a kind of way, but still have00:29:43
The line between, but really know the clients, and really, like.00:29:51
affect the emotional part with the clients that triggers the sales. So, for example, if I… I always say this example is a really bad example. It's like, if we know that the parents want… his dad is dead, right?00:29:57
And he's a really good client of our brand, and this… his child never reached out to us. He doesn't know about our brand and so on, so we reach out to this person, and we say, like, oh.00:30:11
We know that your dad was a really good client about us.00:30:24
this is a gift, or this is something that we give for you to just, like, thanks for everything. And this triggers, like, a potential client for us in the future, because we trigger, like, emotional aspects between the vendors and the…00:30:28
and these clouds.00:30:43
Shreya Shankar
Wow, sorry, that's, like, next level.00:30:46
steve.man
Yeah.00:30:51
Shreya Shankar
science may or may not be legal in some countries, I don't know.00:30:52
steve.man
Yeah, yeah, yeah. Yeah, but we try to be on our side, we… all this… this is a terrible example, Shreya, it's so bad.00:30:58
We're not going to do this, just like we also implement a lot of guardwares in terms of what is the line to not… because compared to the rest of the world, in China we have so many data, we cannot be creepy.00:31:07
Because sometimes, like, we taught the…00:31:20
the client's like, oh, we know something XYZ, that it happened, like, two months ago, and the client will say, how do you know that? And so, like, we need to have… and it's really cool, it's like, to have this subtlety.00:31:24
What kind of… what is the line that we.00:31:41
Shreya Shankar
Come on.00:31:43
steve.man
And this is really interesting. So that the evaluation is really also on that. It's like, what kind of information we can display, what is the line that we cannot pass outside of the hallucination itself?00:31:43
Shreya Shankar
It's like the inverse of hallucination. It's like, you want to actively prevent yourself from knowing relevant information.00:31:59
steve.man
Yes.00:32:08
Even though you give to the cell associate the last word, but they will say, like, oh, but it's AI that told me to say that.00:32:09
Shreya Shankar
Yeah, of course.00:32:16
Pardeep
I think.00:32:18
Hamel Husain
know that someone died. Is that, like, a WeChat API? Crazy. The fire's like, you are dead.00:32:19
steve.man
Because in the conversation, you can see everything.00:32:25
Hamel Husain
You see that?00:32:29
steve.man
I'm not going to say, like, stuff to you guys. We see, like, crazy stuff.00:32:30
And I'm like, I don't want to read that.00:32:35
Hamel Husain
So is WeChat data made available to any…00:32:38
app on the WeChat ecosystem? Like, anybody can see any data? Like, how… what's the setup?00:32:41
steve.man
So, in WeChat, and after I let Pardip talk, because I think he has something to say, on WeChat, it's basically superb. Basically, you combine, like, Facebook, Instagram, everything, payment system into the same. It's kind of similar to Line in Japan, I don't know whether you know about this.00:32:48
And on WeChat, basically, you have what we call official account and mini-program. And mini-program is all the applications embedded on WeChat. So WeChat, you can pay, you can chat with everyone, and you can also access to this,00:33:07
application. And each application, if you want to know, we have… we have what we call, like, an open ID.00:33:25
And each application has its own OpenID, and you can link it with what we call this UniFi ID.00:33:33
And once we have this unified ID, you basically, you can map, for example, this client went to this specific, like, mini-program that belongs to us. We cannot track the mini-program that… that doesn't belong to us, for example. So we know, for example, this client went to our mini-program, checked00:33:40
Each article for each amount of time, each event for each amount of time, and we can also link it back with the conversation between this client and these vendors.00:33:58
So there's a lot of digital footprint that we can just, like, map and see what00:34:12
Actually, who is this client?00:34:19
At a really deep level.00:34:22
Pardeep
It is like, Cambridge Analytica is possible and legal.00:34:27
Wouldn't, wouldn't, wouldn't care.00:34:33
steve.man
Yeah.00:34:38
Pardeep
I think my question, you know, was actually literally related to this, which is, you know, are there any best practices for data governance? Like, things like here, what he's talking about? I remember when I…00:34:40
When I was working at Meta, we had this, you know, whole DMA requirement of, you can't have certain amount of… certain data training your applications for recommendations, or even for ads.00:34:52
And that is basically how you segregate your data and how do you give permissions for applications and stuff like that, right? I think…00:35:04
we did a pretty hacky way, but I'm wondering if, you know, Hmerald, you mentioned that you are working on some legal projects, so I'm wondering if there are any, like…00:35:12
industry standard patterns around how do you manage this kind of, you know, data when it comes to ingestion, data produced by applications, and how do you use those in a better way? I think WeChat example is crazy in my mind right now.00:35:22
But I'm wondering what is the best way to do it?00:35:38
Hamel Husain
I don't think I'm the expert on that, to be honest. I don't really know what the best way…00:35:44
I mean, I can just say, like, what I've seen, but I don't… wanna make any…00:35:48
claims that it's the best way, because I actually don't know, I haven't thought about Data security too much.00:35:52
Pardeep
Okay, okay.00:36:00
Shreya Shankar
Yeah, same, unfortunately. Well, other than, like, just trying to use vendors that…00:36:03
I don't know, are pretty well known, but I don't feel like I have, like, extra expertise.00:36:10
Maybe somebody in Discord does.00:36:22
Hamel Husain
Vignesh has a question.00:36:28
Vignesh Iyer
Yeah, hi, Hamelin, Shreya. So maybe this was covered in one of the office hours, or the lectures, but…00:36:31
Like, the evaluations themselves, in terms of their life cycle, like, once they've been created.00:36:39
I guess, so you identify your failure modes, right? Your axial codes, your failure modes, and then you, do the alignment and get an evaluator, and I guess the use, the first use of the evaluator,00:36:45
you kind of found out it's not a specification kind of failure, that's why you moved into, creating an evaluator for it. It's kind of a generalization thing. So you're using this evaluator for that particular failure mode to quantify, how00:36:59
kind of… How prevalent… how prevalent it is, and what action you're gonna go take on it.00:37:16
In the future.00:37:24
what role does that evaluator play? Is it more when you're adding maybe a new feature onto your app or something, you just run it past the same evaluator to see whether it kind of00:37:26
was on the same… it's still on par of where you left it previously. Maybe you left it at a specific place, and that's where it is. Is that the kind of future role it plays? So, mainly the questions are, is… am I correct that the first is just to…00:37:40
Quantify that failure mode to see what can be done, if any, and then in the future, it's more, you make changes, and then you see, whether those changes00:37:55
Moved it in any way.00:38:06
Shreya Shankar
I think you have it pretty accurately. What I've seen sometimes is I might change the model, or make, like, a pretty big change to my pipeline, and then suddenly my automated evaluator fails a lot more traces. And so having that is very helpful, right, because00:38:08
Otherwise, I wouldn't know. I would think I had fixed it, because, you know, I had fixed the prompt, or I had broadened some more context or something, but now there's this, like.00:38:26
you know, different models behave differently, so I find that having these automated evaluators is very helpful. Another reason why I think they're helpful sometimes is if you're doing, like, consumer… or you're working with directly with end users. Like, for example, I advise, like, an AI fashion startup.00:38:34
Sometimes, like, their preferences for certain things might change, or…00:38:53
An example that I like to give is,00:38:58
We had, like, a criteria for weather-appropriate,00:39:02
And when the season changes, the definition of weather appropriate00:39:07
also might shift. You know, like, if it's winter, you might want something different. Some people run hot or cold or whatever, and then, you know, just…00:39:12
Seeing that automated evaluator's performance on, like, traces generally was very helpful to uncover this.00:39:23
problem of, like, I don't know, shifting.00:39:31
I don't know if what I'm saying makes sense, but it's like, you never really know, and it's, like, good to have it.00:39:34
Vignesh Iyer
So, so if I get you correctly, that eval that you created was… it wasn't towards hot or cold, it was just a general one, and you're just now.00:39:41
Shreya Shankar
Weather appropriateness.00:39:49
Vignesh Iyer
Okay, weather appropriateness. Got it.00:39:52
Shreya Shankar
Yeah, and it turns out, like, as something goes from, you know, August to, like, November, I don't know.00:39:54
maybe, like, we realize… the app is basically a stylist app, so it's, like, recommending outfits. Maybe you realize that you're recommending outfits that your LLM judge believes are not weather appropriate, and then it's worth looking into. And it's not like you did anything, it's just the season changed, or some, like, external thing happened.00:40:02
Vignesh Iyer
Okay, so here, again, it's not necessarily, that there were failure modes that you kind of… you maybe found failure modes in one kind of weather condition, but then you expect the seasonality kind of trend to be there, and you just want to keep it along, and .00:40:22
Shreya Shankar
I mean, I think that's pretty generous, like, we created an LLM judge. There was a weather-appropriate failure mode. We, like, fixed it, but we had this LLM judge to just, like, measure the prevalence of it. When we fix it, it's never going to, like, 100%, right? It's like…00:40:40
It goes from, like, maybe 30% to, like, 75% or 80% or something, and we're like, okay, you know what, this is the best we can do, and we'll just keep it there and see how it progresses.00:40:55
And we didn't think that… maybe if we had spent some time thinking about it, we would have realized it might…00:41:06
behave differently in the winter, or, like, might become helpful again the next season, but we just, like, kept it there.00:41:12
And then…00:41:20
It just was helpful. Yeah, it's like, we had so much going on, right? You're, like, building a product, you can't always, like, anticipate every…00:41:21
Data drift in the world.00:41:29
Vignesh Iyer
That makes sense, that definitely makes sense. So, just a quick question on that is,00:41:32
when you say, so you've got these failure modes, and then you said you fixed it. So when you say you fixed it, maybe it was for that weather condition at that point where the failure mode occurred, you kind of went and did some specification kind of thing? But then you still felt that this would generalize, so…00:41:36
You kept it along.00:41:55
Shreya Shankar
Yes, I think I was… when I say we fixed it, like, it wasn't 100%, and I think…00:41:59
what the real issue was is that, some people run hot, some people run cold, and the way to solve this problem was to get extra context from the users. Like, in the summer, for example.00:42:05
it's very hot, and so, like, people… the AI stylist likes to recommend, like, shorts and t-shirts and stuff, but oftentimes these outfits are not work-appropriate, so there's this weird balance of, like.00:42:19
Okay, we have to, like, explicitly go against the default behavior of the model, because this is, like, a real-world styling application.00:42:31
And so, like, what we're gonna do is just try to elicit context from the user of, like, okay, do they work in an office? Do they run hot or cold? Like, get all this context. Some users provide it, obviously some users don't, not everybody checks their notifications. And then fixing it meant using this context to try to get a better outfit recommendation.00:42:40
And so, probably, yeah, you're right, it's, like, very summer-geared, or whatever it was, but, like, that's kind of… we did analyze, measure, improve, and, like, that was the…00:43:00
Natural trajectory of how we attempted to solve the problem.00:43:09
Vignesh Iyer
No, thanks, that was a great example. And just considering there's no other questions, I just would like to go on a follow-up. So, we spoke about, you know, first finding failure modes and using the evaluator for the failure modes, and why you would want to keep it continually, so now I got those two parts.00:43:14
Now, in terms of, like, monitoring, like, in production, I did have a question about this in Discord as well, where…00:43:32
Do the traces that run through your app in production also just…00:43:40
run through the evaluators? Apart from you picking and sampling out to, do the next error analysis, do you just let everything run against all your evaluators? And one of my questions in the Discord, I think Hamil helped answer that, was.00:43:46
How do you know… let's say it was very specific to a trace that you did something, maybe it was a shopping cart that needed to be generated right at the end, but you don't always get to the shopping cart generation, in your app, but your,00:44:03
you know, your evaluator was geared towards that. Does your evaluator, like, when you run it just against everything.00:44:21
Is it generalistic? Can you do that?00:44:29
Hamel Husain
Yeah, I like to make the evaluators generalistic, so, like, I mean, you know, you want to have the logic in there, because, like, the evaluators are binary, right? So if it doesn't apply, then it passes. If it applies to that, then it, you know, then it can fail.00:44:37
In… sometimes you can be smart about it and say, like, look, like, you're not gonna even bother computing anything if it's not relevant, if it's not, you know, like, there's no shopping cart, then you don't want to, like, make an LLM call or whatever.00:44:54
But, you know, generally speaking, like, you can do that. You do want to make sure you have a good enough sample size that you're gonna trigger that particular failure.00:45:10
If you're not confident that you do have that sample size, then, you know, you kind of maybe sample more smartly for that, to make sure you're testing that.00:45:19
By the way, you did say something earlier on that I just wanted to make sure we clarify. You said something about00:45:29
okay, you create evaluators for things that are not specification. I just want to, like, clarify, like, You…00:45:38
You can still, create evaluators for specification issues. It's just a spectrum of, like, what is difficult and what is hard.00:45:46
And…00:45:54
Vignesh Iyer
Hmm.00:45:54
Hamel Husain
On the very one side of the spectrum, there's, like, very easy things. So, like, there's very easy, trivial specification things that are very obvious, and there's, like, harder specification things, like, I'm not really sure how to clean up the prompt. I might have to iterate a lot on this. I'm not really sure.00:45:55
And then there's, you know, other kinds of specification things.00:46:12
And, like, the harder the issue is, like, that…00:46:17
signals that you will benefit from an eval. So, just want to make sure that you know that it's, like.00:46:21
It's not just like, hey, if it's… it's like, you fix your specification, then you do evals, no. I just want to make sure, like, it's clear.00:46:28
Shreya Shankar
Yeah, that's a good point. In the course suite, because most of the specification errors we see00:46:36
are very easily fixable. But there are still sometimes specification problems where it's like, you… you don't know how to articulate or describe the thing that you want the LLM to do, like.00:46:42
this happens a lot in document processing. The best way to describe the test is, like, to show you examples, but documents can be very long, and it's… it's really hard, like, actually what you should do is onboard for hours and, like, understand what I'm trying to say, and then, like, you'll get it.00:46:55
I don't know, there are some, like, really difficult specification issues, and maybe evals are a way to help with that, if you find that you're struggling to articulate that.00:47:11
Vignesh Iyer
That makes a lot of sense, and also, I'm thinking about… somebody had mentioned this, I think, the other day, where, you've got things that00:47:21
you've got 2 or 3 specifications that go against each other, almost. They're, like, very similar, so then you can't…00:47:30
You know how to specify it, but you don't know how to specify it with the other combination thing to make sure they both live, like, harmoniously, so…00:47:37
Yeah, okay. That… that makes sense.00:47:45
Yeah, thank you.00:47:50
Hamel Husain
Hi, Jake?00:48:00
Abhishek Panda
We think the only one.00:48:01
Shreya Shankar
Award for most questions asked today.00:48:03
Abhishek Panda
Yeah, I just, I mean, I remember one more question I wanted to ask. Yeah. So,00:48:06
I mean, earlier I explained about our agents, right, where we, for certain agents, we have existing users, and for certain users, we have new users. It is very new to the market. So, from the scenarios, we created synthetic query.00:48:11
But right now, the engineer has accessed us the external API, where, you know, I mean, we can interact with the assistant via LLM, and00:48:25
we can also access, like, it is a website builder agent, so we can access what changes is happening within the website. So, Hamil engineer, what would you suggest to us? Because right now we have this external API access as well, because sometimes as a user, when we can see and chat, right.00:48:35
that's a different context when we are just having the context in terms of JSON, right? Because only scenarios is not enough. Now, we can also see what changes is happening.00:48:51
within the content, okay? I am, talking in terms of, like, holistic context. So here, how would you suggest to design the prompt? I mean, to put those external information, like, whatever changes are happening, you mentioned I am at this kind, we have this kind of business.00:49:02
And this, this, is it? And, could you design a website? It will design a website. You know, some design changes you would suggest.00:49:18
And now, via those external APIs, like, Playwright has been integrated, so we can access those external information as well. So,00:49:27
In my synthetic query pipeline, right, for offline evaluation, what would you suggest, like, to put the entire information in a prompt, or what is the best practice to design it when you have access to the external API as well?00:49:36
Shreya Shankar
It really depends, I think, on what the API is.00:49:51
like, Playwright or a browser, I don't know why you would use… or you need to have a little bit of understanding of what00:49:55
you're using the tool for, what are the kinds of outputs the tool is going to have, and, like, then only does it make sense to think about, okay, how do I represent that in a prompt, right? Like, if I'm doing web development.00:50:05
And I have a browser tool.00:50:16
the best way to represent that in prompt is actually showing sequences of, like, screenshots, for example. But if I'm hooking up to a browser tool because my application has something to do with, like, deep research, I don't know if, like, screenshot is the best way to represent that, or, like, to pull that back, pull that information.00:50:18
in the prop. Does that make sense, what I'm saying?00:50:38
Maybe I'm misunderstanding your question.00:50:42
Abhishek Panda
Yeah, so, what I'm trying to ask is, like, we can create scenarios to, you know, generate our synthetic query. I'm saying for, you know, agents, like, for example, website builder, designer.00:50:44
Where you can see, I mean, that context. Earlier, we didn't have the access, but right now, we have the access to the content as well. So, in my synthetic00:50:58
query generation pipelines, I mean, when we have that, what is the best practice to design it so that, you know, you can better00:51:08
generate queries, because it will create a complete,00:51:17
Hamel Husain
Yeah, so, okay, like, this one, like, I'm imagining it's, like, lovable, okay? And if you have lovable… if you're trying to evaluate lovable, I don't think you're gonna have good success being synthetic queries.00:51:21
Just being honest with you.00:51:34
Because, I mean… Real-life situation is gonna involve a lot of back and forth.00:51:36
Abhishek Panda
Ugh, yeah.00:51:43
Hamel Husain
You know what I mean? Like, you're not gonna just be like.00:51:45
to one-shot… one-shot websites, like, I don't know. Like, if you think you can do one-shot websites, great, but probably is not a thing right now.00:51:48
And so… You really need to get…00:51:57
I mean, you can… so, like, synthetic data, it would involve simulating a human.00:52:01
Which in itself is non-trivial.00:52:07
So, like.00:52:09
you can do it. There's some solutions, but I don't… I would honestly say don't do that, and try to get some early users.00:52:11
That's the only choice, because it's very common.00:52:21
Abhishek Panda
Are you a Sunday?00:52:24
So, with design experts, whatever the rubric you have designed for the scenarios, stick to that.00:52:25
don't get external information to generate the synthetic query? Did I understand correctly, Helena?00:52:31
Hamel Husain
I would just say, I'm not sure that synthetic…00:52:37
Synthetic query's gonna work for you.00:52:41
You know?00:52:43
Because, like, what kind of synthetic query can you…00:52:44
Abhishek Panda
I'm in this…00:52:46
Hamel Husain
Hold on.00:52:47
Abhishek Panda
Yeah. How could we… I mean, that is the need, like, because it is a new user's problem, right, before getting into the market. So, how can we do for these kind of agents, offline evaluation, where the agent is completely new and we don't have conversations? So, for this kind of application, how we can00:52:48
leverage, I mean, synthetic data. I mean, upfront, you have mentioned don't use synthetic00:53:06
And data, or don't trust it, but… what would you suggest in this kind of case?00:53:11
For New Year's.00:53:15
Hamel Husain
Yeah, I would suggest, like, getting some design partners, getting a few early users, using it yourself, building lots of websites yourself.00:53:16
And then, potentially, you know, you could try to simulate this, you could try to, like…00:53:26
I don't know, simulate those conversations, but it's tricky. That's like a…00:53:34
If you can simulate human being, it is a product in itself, like, and there are startups that try to do that for this purpose, but, like,00:53:38
I would… I would lean on…00:53:47
Like, scaling, not in terms of…00:53:50
not… don't bootstrap in terms of synthetic data, bootstrap with users. So, like, start with one user, then go to 3 users, 5 users. That's…00:53:53
Because it's pretty complex, I would say. It's… there's a lot of… it's gonna be… It's gonna be hard.00:54:03
Abhishek Panda
So, design partners helped us with a rubric, like, how you can design, I mean, how the users could be, I mean, from various businesses or, you know, small.00:54:11
Hamel Husain
No, no, they need to be… they need to use it. They need to use the product itself.00:54:23
Abhishek Panda
F.00:54:28
Got it, got it.00:54:29
Find it.00:54:31
Hamel Husain
Francisco.00:54:36
Francesco Lanciana
Hello! Yeah, mine's… vague, vaguely, so vaguely related. It's just with…00:54:38
I was discussing with someone last night about… so I have this, like, a tool and an agent which allows you to create an event, and they were kind of like, wait, couldn't you just have a tool00:54:47
that kind of uses, there's, like, a calendar event. Couldn't you have a tool that, like, knows what the OpenAPI spec00:54:57
is for, you know, whatever you're trying to call, and it's just generic, and it's, like, the agent is just kind of passing in what it, like, needs to know to be able to call this external tool, and, you know, so you're not having to create a tool per, like, create, update, delete, all the different things that, like, that service can do. It's just one.00:55:05
Have you seen, like, any success with this?00:55:25
whatsoever, or if, like, I don't know if you've come across similar use cases, or if you found, like, it's better to break it apart? I know you say, like, start simple with a lot of things, I'm like, what is the simple version of that? Like, is it the one that does everything, or not?00:55:28
Shreya Shankar
I think it depends on the API. If these different API functions reuse parameters a lot, then you could probably have, you know, one tool that00:55:41
Allows… basically, it's, like, a one-catch-all tool, where the first argument is, like.00:55:54
Subtool name, and then the second argument is, like, a pact.00:56:00
list of arguments that you're gonna pass into that tool. Like, think of it as, like, a wrapper tool.00:56:05
But in the case where the tools are actually, like, pretty… like, the schemas for those tools are pretty different, I found that, like, you… you do really want to prompt OpenAI with the right schema, because it's likely to follow the schema if you give a very detailed schema.00:56:10
So some tools, you know, you just…00:56:27
need your tool… your arguments to, like, pass some validation. Just saying, like, list of args.00:56:30
in a hand-wavy way, you know, it's not going to get a correct tool call. I know it's very abstract.00:56:38
does what I'm saying kind of make sense? Like, think about how much you need, like, schema validation, and how complex the schema is.00:56:45
Francesco Lanciana
Yeah, I think in this case, I mean, you would need it, because if the schema is invalid, the whole thing just wouldn't work, so…00:56:54
But yeah, it was more like, yeah, can you rely on just, like, an API spec, and just feed that as, like, as the description to the tool, and it'd be like, oh yeah, cool, I know how to call that to get what I want, or is that just not going to be reliable?00:57:02
Shreya Shankar
One thing you could do, which is, like, not an AI step, is, like, have some function that turns in00:57:15
like, I don't know, like a decorator you could apply to a function in your codebase that creates, like, an OpenAI schema… sorry, OpenAPI schema.00:57:22
And then, like, kind of creates a new tool on the fly for you when your agent is running.00:57:32
Hamel Husain
You have to be cognizant of context rot. Like, for example, like, a really large API, like the GitHub API. There's a lot of endpoints, right?00:57:40
And so, if you don't… you probably don't need two-thirds of those endpoints.00:57:48
Francesco Lanciana
Hmm.00:57:53
Hamel Husain
So… You know, if you're gonna… if you have an application, you should scope it.00:57:54
And like, for my own personal.00:57:59
Francesco Lanciana
tools that I have.00:58:02
Hamel Husain
you know, I have, like, CLI tools that are, like, extremely scoped.00:58:03
Because, like, you know, like, when the GitHub API only use, like, 4 endpoints, or something.00:58:08
And I make a little CLI tool, because that just, like, works better.00:58:13
Francesco Lanciana
book reach.00:58:17
What's that? You just lean towards having, like, a tool for each in that case, like, if it was just.00:58:19
Hamel Husain
to make them…00:58:24
Francesco Lanciana
Extremely scoped, or…00:58:25
Hamel Husain
No, like, I guess she… I was just… not for GitHub exactly, like…00:58:26
you know, I have one for, like.00:58:33
like, YouTube, like, making blog posts?00:58:34
That does, like, very specific things, and it's, like, scoped to, like, go through, like, a very specific flow. For GitHub, I could, you know, I don't think you would… yeah, I would just use GH…00:58:38
CLI and just put it in the prompt, be like, this is the 3-4 things you should consider. I wouldn't even make a tool, so that's probably not the best example, because it already is a tool. It's pretty good. I can tell the AI about. But…00:58:52
if I was… Yeah, like, I do have my own tools.00:59:04
Like, I have one, like, that kind of wraps Gemini.00:59:09
It's called GEM.00:59:14
That just does stuff that I want. It's like CLI.00:59:16
I give to, like, my coding agent.00:59:20
That does, like, some very specific stuff around, like, YouTube transcription stuff, and…00:59:23
you know, I annotated a blog post.00:59:29
But yeah, I would think about it, like, I mean, you know, don't…00:59:34
I don't know, like, you know, you have to do the evals.00:59:38
Good to know. Alai Kam.00:59:41
you know, I think…00:59:44
Just, like, make sure, like, if there's a lot of surface area that you don't need, then that's bad. That's all.00:59:46
Francesco Lanciana
Hmm.00:59:52
Yeah, okay.00:59:54
Shreya Shankar
Also, these models can actually handle a lot more tools than you think, like, if it's really create, read, update, delete, like, more tools is fine.00:59:56
I think, like, the moment, it's, like, hundreds of tools. It actually, even apparently for hundreds of tools, if you look at, like.01:00:04
I think it's, like, Berkeley Function Calling Leaderboard, like, it'll support hundreds of tools.01:00:11
thousands of tools, maybe that's when you… I don't know. But yeah, maybe it's a little too premature to worry about01:00:17
Like, whether you should wrap some tools if really there's only, like, 5 or 6.01:00:25
Francesco Lanciana
Yeah.01:00:32
Yeah, cool. Thanks.01:00:33
Hamel Husain
Alright, seems like a good time to wrap things up.01:00:43
Thanks, everybody, for coming.01:00:46
See you next time.01:00:48
Dr Katya May
Thanks. Thank you. Thank you.01:00:49
Hamel Husain
Thank you.01:00:52
Live session where instructors will address questions. Instructors may present answers to common questions, followed by live Q&A
[
Home
](/parlance-labs/evals/2025-3/home)[
Community
](/parlance-labs/evals/2025-3)