How Google Labs Builds AI Products: Lessons from Google Labs' Kelly Schaefer

How Google Labs Builds AI Products: Lessons from Google Labs' Kelly Schaefer

Highlights on product strategy, eval and team tactics, and more from a leading AI product team.

Highlights on product strategy, eval and team tactics, and more from a leading AI product team.

In our latest episode of Deployed, we talk with Kelly Schaefer, a Product Director at Google Labs, who’s led a portfolio of early-stage AI products including NotebookLM, the Jules coding agent, Stitch, and more. She's responsible for taking experimental AI research from Google DeepMind and turning it into real products and businesses that can work at Google scale. It sounds like one of the most fun jobs in AI! And her success has led to being recognized as one of the top 100 Women in AI.

What makes this conversation particularly valuable is Kelly's hands-on experience getting so many reps with the messy realities of building production AI systems. Her teams have launched all sorts of AI products — from coding to voice to visual applications — and she's learned through experience what actually works when you need to ship reliable AI features at scale.

For product leaders, engineering managers, and anyone building AI products, Kelly shares fun stories and strategic perspective about building AI products, tactical insights about things like evaluation, team structure, and decision-making that could save months of trial and error, and some career insights for anyone working in product management today.

Here’s the full episode if you want to watch, and you can also find it on Spotify, Apple and YouTube. Some of our favorite highlights for builders are below.

Evals Advice, Part 1: Start Early And Keep It Simple

So many software teams are trying to figure out evals, and one of the biggest mistakes we see is too much time spent thinking about what metrics matter in theory, without enough time looking at the data.

Kelly’s got crystal clear advice on this topic: 

"(Your) first eval set should just be a spreadsheet shared in the team of categorized by use cases and some examples that you probably did manually. And the reason I say that is because sometimes folks skip ahead to, ‘Let's formalize the eval process and make sure this is running smoothly.’ And then you can realize a month, maybe six months later, that you've been optimizing for the wrong thing… So step one… (is) having a discussion: Are we optimizing for the right things here? Are these the categories of use cases we care about? How would we know that we’ve hit on the right ones?"

It’s easy to spend time building elaborate evaluation infrastructure before understanding what actually needs to be measured. Teams can spend months optimizing metrics that don't correlate with user experience, leading to the frustrating situation where "your product manager, your marketer, your leadership executive could say, hmm, the product experience sure doesn't seem 37% better. What's going on here?"

Anchoring first in a simple dataset forces teams to manually examine their outputs and have real (human!) conversations about what "good" actually means for their specific use case. It's unglamorous, but it’s what actually works time and time again — even at Google.

Evals Advice, Part 2: Evals Are Central to AI Product Strategy

That might be a simple start, but once they know what to measure, Kelly's teams take evaluation seriously at the organizational level:

"The first slide in our product review deck is our eval scorecard. The reason it's first is not only because we need to talk about it every week or two, but also because it's a message to the team of honestly, nothing else matters if your product does not reliably serve the use cases that you've promised."

This isn't just about tracking metrics, it’s about making product quality core to team culture. By starting every product review with evaluation results, Kelly's team reinforces that quality is non-negotiable, even when shipping fast. 

Evals also point the way to what areas of the product need attention. A lot of Freeplay customers have found similar value in a regular rhythm of looking at eval data closely and slicing it up by category — it’s a simple practice to decide where to point further optimization energy.

“It could be that your overall score is at 63%, but if you break that down, you’re actually rocking it on one subset of use cases but you have a long way to go on another.”

Kelly’s teams also connect evaluation results to user feedback in real-time: 

"We'll actually match user quotes that we get from research or from social with what we're seeing in the eval set to show like, ‘Hey, we know this use case is an issue. Here's a couple of examples of people complaining about it on X... And here's the project (we’re doing to fix it).’"

The feedback loop between quantitative measurement and qualitative user experience ensures that improvements to evaluation scores actually translate into better product quality for customers.

The 70% Decision Rule for Shipping

We asked Kelly for her top advice for other product teams, and it was about balancing speed against perfection:

"At a foundational level (I believe) that making the 70% right decision quickly is more valuable than making the 90% decision more slowly. Just because if you're 70% of the way there, as long as you will still make fast decisions, you can course correct pretty easily… People surprise us with all of the many ways that they are using these products. And often, what we see them using actually significantly changes the product direction."

The key insight here is about getting products into the real world faster to learn from customers. This is particularly relevant for AI products, where user behavior can be unpredictable and the best way to understand what works is to get real usage data. 

The alternative — teams swirling around product concepts that haven't made contact with real users — leads to theoretical optimization without clear practical value.

Kelly acknowledges this doesn't apply universally ("there are big exceptions to that for certain types of products or high trust environments"), but for most AI product development, speed of learning is really valuable.

Evolving Roles in AI Product Development

One of the most practical insights Kelly shares is how traditional product roles are changing in the AI era:

"I think of these roles as always having been Venn diagrams, but originally very loosely overlapping. So the PM and engineering role have some overlap, but it's not that significant. Same with PM and UXR or whatever. Now it seems like those Venn diagram overlaps are actually getting a lot bigger."

At Google Labs, this means some PMs are spending more time on evals, reviewing data, and prompt engineering, while UX researchers are "making much more specific product recommendations because they are going through Discord and analyzing the Discord feedback in NotebookLM."

The change is dramatic enough that Kelly's team explicitly looks for people who can bridge domains: "We look for folks who are dot connectors" and people "who can connect across different domains."

For PMs and others adapting to this new reality, Kelly suggests being proactive about evolving your skillset: "As a manager... (it’s) really nice to hear someone say, ‘Hey, is it cool if I carve out an hour and a half every Friday morning where my pings are off, I'm not in meetings, and I'm focused on experimenting with these tools?’"

More advice for PMs especially in this clip:

Check Out The Full Episode for the Rest

Those are just some of the highlights! Listen to the full episode to hear a lot more from Kelly’s experience. For anyone building AI products, her approach provides a practical framework for building lovable products and strong teams.

Want to hear more conversations like this? Subscribe to Deployed on Spotify, Apple Podcasts, or YouTube.

Categories

Podcast

Authors

Ian Cairns

Subscribe to our newsletter