How Help Scout Achieved 75% LLM Cost Savings & Ships AI Features Faster with Freeplay

Feb 28, 2024

Summary

After just three months, Help Scout was able to cut costs and accelerate development velocity across multiple production LLM features by using Freeplay.

Here’s the tl;dr:

Challenge: In 2023, Help Scout quickly launched new LLM features that were working well for early customers, but they needed better ways to optimize them for both quality and costs. They were sending millions of requests to OpenAI, so the costs were adding up and the stakes to get it right were rising. But the process of iterating on prompts and models was manual and tedious. They needed a better way that they could scale across their team.

Solution: Help Scout adopted Freeplay to streamline their process for working with LLMs across their product development teams — enabling faster prototyping, testing, deployment, and optimization of the different AI features in their product. Each team now uses Freeplay both to monitor and improve existing features, and to prototype and take new features to production.

Results: Freeplay has already saved Help Scout’s product development team significant time and helped them cut costs. For example, they made a key model switch a month sooner than planned by using Freeplay to test and gain confidence, which led to 75% cost savings.

"The time we’re saving right now from using Freeplay is invaluable. It’s the first time in a long time we’ve released an LLM Feature a month ahead of time.”

  • Luis Morales, VP of Engineering

Intro

Help Scout defines what it looks like to be customer-centric. For the last 12+ years, they’ve been building an all-in-one platform for delightful conversations with customers. More than 12,000 businesses like Mixmax, Buffer, Spindrift, and Compass use Help Scout to manage customer service interactions — from email & chat support to self-serve help center content. 

Help Scout wants to help support agents look like superheroes, and they’ve aggressively adopted generative AI over the last year to that end. AI features in Help Scout now let those agents get context about their customers’ needs, draft better responses, and more quickly answer questions. 

You can see more about some of Help Scout’s AI features here. They continue to invest in those and have more uses of generative AI planned for the future.

We talked with several of their engineering leaders about why they chose Freeplay, and how it’s been helping them. They’ve used Freeplay to hit their cost goals, delivered a key feature update a month ahead of schedule, and to get new engineering team members comfortable working with LLMs — 10x faster than before. And they’re just getting started.

Read on to find out more.

Why Freeplay?

For more on why Help Scout chose to work with Freeplay, we asked senior software engineer Kevin Church.

“Building useful reporting tools to help understand how a product performs is always challenging. When adding generative AI into a product, the reporting becomes both more complex and more critical,” Kevin said. “The ability to offload our data to Freeplay allows us to spend more time on developing our core products. It also allows us to invite colleagues who are not engineers to help us shape the product more easily.”

For Help Scout, the process of getting their AI features to market involves product managers and their own in-house customer support experts and product specialists. For less technical team members, Freeplay also allows them to easily see any results logged in production (without writing SQL queries), as well as to see their production prompts and experiment with changes.

Kevin continued, “The ability to look through our LLM session logs and quickly glean insights into cost and quality is extremely valuable to us. If we get a report of a problem with one of our responses, we can quickly find the data we need and analyze it further.” 

This is the kind of practical utility most product teams need when building with LLMs. We then asked them, How had that turned into an impact that matters for the business? They had a couple of answers.

Testing, Comparing & Choosing Cheaper Models with Freeplay

One of the most magical opportunities Help Scout has identified for using generative AI is automatically drafting email responses for customer support agents. They acquired Support Agent AI to make this possible, but it wasn’t cheap to run. 

After extensive prototyping and experimenting in-house, the chain of six prompts with RAG inputs needed to generate high-quality email drafts with GPT-4 cost at least 25% more than made sense for the business model. Even early on, they were sending 100K’s of monthly requests to GPT-4, so the costs added up quickly.

When the new, cheaper GPT-4-Turbo came out in November was immediately appealing, but the challenge was how to gain the confidence to move over from the older, more expensive model. The same prompts didn’t yield the same results across different model versions. Given the breadth of use of Support Agent AI across industries and languages, it was important to maintain quality for customers while cutting costs. 

“Before Freeplay, we had no visibility to what Support Agent AI was doing. We were logging everything to a huge table of records, but had no idea what was actually happening,” said Luis Morales, VP of Engineering. “To test a new model version, we would have had to enable a flag on a group of users, deploy to some users, let it run for a while in production, export and analyze the data, then go review it…”

Freeplay’s Test Runs feature changed this. “When we got access to test GPT-4-Turbo, we could go very easily into Freeplay, create a sample Test List, change the model, and validate the new model version. It took 30 min to run through and test the new GPT-4-Turbo model, and that gave us all the confidence we needed," Luis continued. "Without that, we could never have tested, validated & made the production change in two days like we did with Freeplay. We’d originally planned for this process to take a whole month.”

With the change rolled out, Help Scout was able to cut production costs on average by 75% vs. the old model, which gives them the freedom to scale up production use even faster.

Accelerating The Path to LLM Expertise

Beyond this example, Help Scout has been using Freeplay as a core part of their workflow for building LLM-powered features. It’s helped get their team on the same page and has helped new employees ramp up faster.

“We’re building an entirely new feature that has yet to be released. It’s shipping to customers as a beta in Q1 2024. Using Freeplay has made it easy to iterate on the prompts, then re-deploy quickly,” said Luis. “We had a new engineer who didn’t know much about LLMs, who was responsible for testing our RAG system. He was able to get things figured out in 2 hours using Freeplay. It probably would have taken him 2 days to get up to speed otherwise.”

In total, that’s a 10x faster onboarding and learning process for Help Scout's engineering team to get up to speed working with LLMs in production features. That kind of speed really matters in a larger engineering organization that’s ramping up LLM use across different areas of their product.

Recommendations & Lessons Learned

We asked the Help Scout team what advice they’d share with other product development teams who are on this path, but maybe not in production yet.

Kevin had this to say: “Building products that interact with LLMs presents some unique challenges, and gaining confidence that responses are high quality is difficult to quantify. I would advise to think early and often about how changes can affect quality and how you would measure and approve such changes.”

Borys Zibrov, another senior software engineer, continued on this theme: 

“So everyone does AI now, right? And from what I see a lot of folks are grappling in the dark: they see a problem, some edge case, a prompt injection, whatever, and they try to address those one by one with prompt changes. But then one can't really be sure that by fixing that single edge case we are better overall. Perhaps those small changes and tweaks we did to the prompt broke it at some other place for some other customer?"

He continued: "This is one of the main problems we're trying to solve at Help Scout: how can we improve and iterate with confidence? Collaboration with Freeplay is one of the most important decisions we took on that path. Freeplay allows us to collect and compare LLM responses, run human and auto-evaluations, monitor quality metrics and make prompt changes with confidence. I absolutely love how responsive and open to feedback the Freeplay team is and we're seeing results already.”

—————

Thanks to Luis, Kevin, and Borys for sharing their experience! You can check out everything Help Scout is building with generative AI at https://www.helpscout.com/ai-features/

Interested to learn more about how Freeplay can help you? We’d love to talk, you can reach us here.

Careers

© 228 Labs Inc. 2024