Real Talk on Building Coding Agents: A Conversation with Amp's Builder-in-Residence, Ryan Carson

Real Talk on Building Coding Agents: A Conversation with Amp's Builder-in-Residence, Ryan Carson

Hear about what makes Amp special, building as a solo founder, the challenge of adopting evals, and more.

Hear about what makes Amp special, building as a solo founder, the challenge of adopting evals, and more.

In our latest episode of Deployed, we sat down with Ryan Carson, who holds one of the most unique roles in AI: "Builder in Residence" at Amp, the coding agent built by Sourcegraph. You can find Ryan on LinkedIn and X.

Ryan's path to this role is fascinating. After founding and scaling Treehouse (which taught over a million people to code), he took time off, worked on AI devrel for a stint at Intel, and then found himself drawn back to building. He started using AI agents to create his own divorce assistance app called Untangle, inspired by challenges he’d seen some of his family go through. I first sat down with Ryan back in March this year and heard about his solo founder workflow, and I was struck by how thoughtful it was. For all the hype about AI solo founders, Ryan was really doing it and forcing himself to use AI at every step of company building to scale himself.

When Amp's founder Quinn noticed Ryan's engagement with their product, he made an offer Ryan couldn't refuse: join the team, keep building Untangle publicly as a case study, and help make Amp better in the process. The result is a perspective that's rare in AI product development: someone building both as a solo founder using AI tools, and as part of a team creating highly competitive AI coding infrastructure at the same time. Amp is growing fast, and if you haven't tried it yet, check it out!

For our readers especially, this interview also comes at an interesting moment. There's been a lot debate recently online about whether traditional evaluation approaches work for generative AI products, or whether live A/B testing is more effective. Ryan gives some welcome, honest perspective about their team’s struggle to adopt useful evals (even though they want to!). 

To us that debate has felt like a lot hype, and too simplistic if it's an either/or an argument. In the real world, like with many things, we think “it depends” — the use case really matters. Some applications (like categorization tasks, or narrowly-scoped agents) are well-suited to structured evaluations, while others (like open-ended coding agents, or consumer chatbots) involve so many possibilities that live A/B testing might simply be better. Regardless of your take, the straight talk from Ryan is welcome. 😎

Beyond that, this conversation runs the gamut from other tactical challenges of building coding agents (things like: model selection!) to the broader implications of AI for society. He offers candid insights into the competitive landscape of coding agents (Cursor, Claude Code, Windsurf, and others), the technical reality of what it takes to build production-grade AI products, and thoughtful reflections on how AI is reshaping who gets to build and create.

Check it out here on Spotify, Apple Podcasts, and YouTube. Some of our favorite highlights are below.

Making Amp Stand Out in a Hyper-Competitive Market

Amp is going head-to-head with some of the most well-funded and capable teams in AI, and Ryan doesn't sugarcoat the challenge:

"All of us who are trying to build agents specific for coding, if we're being honest, we're saying there is no moat, right? We basically call the API. We have custom system prompts, and then there is thousands of hours of what we call elbow grease, right? It's just a grind... Like shaving off this bit of friction for DX, tuning the prompt to do this. It's a total grind." (watch here)

This transparency about the "elbow grease" required to build great developer experience is valuable for anyone building AI products. The technical barriers to entry may be lower than ever, but creating something users love still requires obsessive attention to detail and countless hours of refinement. Ryan also points out the strategic flexibility they have to help customers pick the best models as the landscape evolves (vs. products like Claude Code and Codex).

This mindset about developer experience at Amp fits in a broader trend where the most successful AI products aren't trying to compete on model capabilities alone, but on smart user experience and workflow integration. The best way to compete is simply solving real customer problems better than anyone else.

The Evaluation Challenge: Why Coding Agents Are Hard to Evaluate

One of the most refreshingly honest parts of our conversation was Ryan's discussion of how Amp approaches evaluations. When a new model like GPT-5 is released, the team faces a complex decision about whether to adopt it:

"We do not have a set of evals that we programmatically run to decide if GPT-5 is better than Sonnet-4 yet. That may sound bad and confusing and surprising, but everybody who is in the trenches building these agentic coding tools understands that it is so dynamic and so frustrating to come up with a set of quantitative, qualitative evals that we trust."

The Amp team’s experience illustrates why coding agents are particularly challenging to evaluate: they're given open-ended tasks, the acceptable range of responses is broad, and developer expectations vary significantly based on context.

Ryan talks about how the Amp team can measure things like cost and latency easily, but quality remains elusive: "We talked to all the agent companies and they don't either. So I think that's kind of the dirty secret — all of it is dev vibes."

Defining Quality for AI Products: Externalizing Vibes

Similarly, Ryan talks about the struggle to define "quality" for coding agents. We hear a version of this from a lot of Freeplay customers:

"What is quality, right? If you tell the model to run a bash command and it doesn't do it, that might be great. And if it does do it, it might be great, right? It comes down to very much what the developer expects to happen."

This captures something fundamental about building AI products: for the first time, product builders are having to externalize and articulate the evaluation criteria they use in their own mental models. As Ryan puts it: "It's almost as if the entire human race is now fitting to evals, and all of this has been in our head or various versions of documentation, and we couldn't systematically run these things, right? And now we can."

The examples Ryan gives of potential metrics — distribution of tool calls, error rates, average message count per thread, context window consumption — show the complexity of trying to capture "good developer experience" in measurable terms.

This challenge extends beyond coding agents. Many of the teams that start using Freeplay go through a similar processes of trying to define what constitutes quality for their specific use cases, and they often discover that their intuitive understanding of "good" is harder to codify than expected. For anyone struggling with this, our view is that discovering the right answer is usually best accomplished in a bottom up way, and with an ongoing process of looking at your data. Read more in this post: Building an LLM Eval Suite That Actually Works in Practice

Practical Advice for Building with AI Agents

Beyond Amp and evals, as a solo founder Ryan has developed systematic approaches to building with AI that other founders and product builders can learn from. His open-source "AI Dev Tasks" framework (which has 5,000 GitHub stars and counting!) breaks down the building process into three structured steps:

  1. Create a PRD: Use voice-to-text to describe what you want to build, then have an AI agent ask clarifying questions and generate a product requirements document.

  2. Generate tasks: Take the PRD and have an agent break it down into atomic, actionable tasks with file specifications.

  3. Iterate on tasks: Execute tasks one at a time, asking for clarification when needed.

As Ryan explains: "Most people don't do or understand or have the discipline to run this process. But when you do, you can build big, big features pretty reliably."

For anyone who wants to go deeper on this part of his workflow, check out his interview with Claire Vo on her How I AI podcast.

———

For anyone building AI products, Ryan's experience offers valuable lessons about the reality of competing in this space, the challenges (and the promise!) of getting evals right, and the systematic approaches that can help teams build more effectively with AI. Most importantly, his honesty about what's working and what isn't provides a grounded perspective in a field often filled with too much hype.

Want to hear more conversations like this? Subscribe to Deployed on Spotify, Apple Podcasts, and YouTube.

Categories

Podcast

Authors

Ian Cairns

Subscribe to our newsletter