Insights from Building AI Systems At Google Scale: In Conversation With Kyle Nesbit

We were fortunate recently to have a conversation on the Deployed podcast with Kyle Nesbit — a technical leader at Google who brings a unique perspective to building AI products.

Kyle has a PhD in computer engineering and has spent the last 17 years at Google. He's worked on everything from high-performance distributed systems (he helped build the backbone of BigQuery), to machine learning in ads, to a three-year stint on the investing side with Gradient Ventures, Google’s AI-focused venture firm. These days, he's focused on Google Cloud's AI & Data strategy, and he’s been helping bring generative AI capabilities to life within Looker – the BI platform Google acquired in 2019.

Kyle’s perspective on building generative AI products is particularly valuable to the Deployed audience. He was at Google during the early days of transformer models, and he talks about how the applied use of the technology has evolved since. He's lived the transition from RNNs and LSTMs to the current era of large language models, and he’s got years of experience building production ML systems that work at Google-scale.

In this conversation, Kyle shares practical insights about:

Why the "scaling era" of generative AI might be ending, and what comes next
How to build effective processes for improving generative AI products
What it looks like to transform engineering cultures to work well with generative AI products
Practical advice on evaluation pipelines and metrics
Plus: Some advice for founders & investors on what makes AI products “investable”

A few key themes really stood out, and we dig into each below. You can check out the full episode here: Spotify, Apple Podcasts, and YouTube.

The Return of Engineering Craftsmanship

One of Kyle's most interesting observations is that we may be approaching the end of the "scaling era" in AI – where bigger models and more data were the primary drivers of improvement. He sees us returning to an era where careful engineering and optimization make the difference between good and great products.

This speaks to the intense levels of experimentation and optimization many Freeplay customers and other generative AI teams are focused on these days — building out effective RAG systems, agent pipelines, etc.

Begin With Your Evaluations

When it comes to building AI products that actually work, Kyle’s number one tip: start with defining the metrics you’ll use for evaluation before writing any code. This mode of development is reminiscent of test-driven development. It means clearly defining what success looks like and how you'll measure it, then building and tuning your systems to hit those goals.

But evals are just the start. Kyle digs deeper on all the main parts of a successful production AI system:

Defining evals
Capturing production data, including the additional signals needed to build a data flywheel
Iterating on evals / metrics based on production learnings (“it’s not like you frame the problem once and then move on”)
Running any changes through an eval pipeline ahead of shipping, just like a CI/CD pipeline

Transforming Engineering Culture

One of the biggest challenges we hear about from teams who are shifting resources from other types of engineering to AI products isn't technical – it's adapting to the different rhythms of building AI products. Unlike traditional software development where features can be built to spec, AI development is more iterative and experimental. It requires continuous testing and optimization, and the resourcing requirements look different as a result.

Kyle shares some thoughts about what it looks like to build an AI product and engineering culture. Pro tip: Aligning on metrics up front with leadership and other key stakeholders creates a foundation for success — it lets the teams on the ground focus on optimizing those metrics.

Building A Strong AI Team Cadence

Kyle shares some specific recommendations for how to structure AI product development processes and team cadence, based on lessons learned. A few key elements:

Orient wider teams around dashboards that capture agreed-upon quality metrics or evals — focus on those for reviews and tracking progress
Small teams (2-3 people) focused on specific aspects of model quality
Regular documentation of experiments and learnings
Organize around sprints with retrospectives focused on sharing insights from the sprint

When it comes to improving system quality, Kyle recommends treating it like traditional software engineering: "When somebody reports a bug, what's the first thing you do? You write the test, you reproduce the bug." He emphasizes building up representative test sets over time that help you evaluate and improve systematically.

Evals Make Your Product Investable

Kyle leaves us with an interesting insight about advice he’s giving to VC friends. The best way to cut through all the hype with AI demos (which he notes are often "almost worthless") is to ask teams about what they plan to eval:

“If somebody pitches you something related to AI, I would ask them: How are you measuring what's good and bad? … If they come back with an eval that's well principled/well thought through and the results are bad, invest, right? They know how to measure the problem, they'll optimize it, they'll figure it out. But if they have no idea how to even collect the data or how to quantify what's good and what's bad, there's no way in my perspective that it's going to end well.”

Many thanks to Kyle for sharing his experience with us! If you want to hear more conversations like this one, subscribe to Deployed on Spotify, Apple Podcasts, and YouTube.

Subscribe to our newsletter

Product

Services

Blog

Resources

Company

Pricing

Book a demo