Over the past two years, we’ve seen a range of companies tackling accounting with new AI-first solutions. Accounting is hard! And in theory, at least, there’s no room for error. So how do you learn to trust AI to do the work?
One of the companies that’s actually making AI accounting work well in production is Digits, which helps startups automate their accounting and helps accounting firms to better serve their customers.
Digits launched in 2018, and raised a Series C in 2022. They’ve been working on AI accounting problems for a while (before LLMs were cool), so it was especially interesting to hear their perspective. They’ve also been quick to adopt LLMs: they started running autonomous agents in production since last January, and they've figured out how to solve complex text-to-SQL questions in a chatbot with smart tool use and humans-in-the-loop.
We invited Digits co-founder and CEO, Jeff Seibert, to join us for this first episode of our new podcast, Deployed. In the episode, Jeff shares what’s worked, what hasn’t, and how the team at Digits approached building with AI to create better customer experiences that work at scale. We covered a range of topics useful for anyone building AI products — the show notes below hit a bunch of the highlights if you want to skim quickly.
A few topics really stood out though, and we go deeper on each of these below:
The critical role of data review & dataset curation: How Digits works with row-level data and builds feedback loops into their product to increase AI product quality
Building autonomous agents for finance: How they’ve gotten autonomous agents to be useful, even in a “no room for error” industry
Making Text-to-SQL prompts work better with tools: How they rethought text-to-SQL chatbot that required accurate math — using custom tools/function calling, instead of passing whole database schemas to an LLM
Check out the full show on Spotify, Apple Podcasts, and YouTube, review the show notes below to see the highlights, or read on for more thoughts on these big topics. We'll have more interviews coming soon so consider subscribing!
Show notes: The highlights
Here are the highlights in case you want to skip ahead!
1:37 - Intro to Jeff & Digits: Pioneers in AI-powered accounting
3:08 - Why NOT AI: When the hype doesn't match reality in finance
5:34 - AI success stories at Digits: Matching the right models to different accounting use cases
10:02 - Building AI feedback loops – combining implicit confirmations and explicit customer feedback
11:24 - “Aggregate (metrics) are not actually that helpful… You have to look at the row-level data.”
15:17 - Autonomous agents in finance: Digits' groundbreaking implementation
17:59 - Aiming for 90% accuracy: Why 100% isn’t possible, and how to make 90% ok
20:45 - Practical learnings: Where AI has worked well, and where it hasn’t
26:11 - Making complex Text-to-SQL work better with tools & custom queries
29:00 - Getting AI systems to work reliably: “90% of our AI effort is systems engineering”
30:24 - AI product thinking: Start with the problem to solve
32:45 - Is all the extra effort worth it? Betting on future improvements
35:24 - What’s under the hood? Digits' tech stack & collaboration process (surprise! They use Kotlin & Go, not Python or Node)
38:31 - Final thoughts & advice: Focus on the customer and the problem to solve
The critical role of data review & dataset curation
The theme of manual data review and dataset curation shows up multiple times throughout the show as a key to Digits’ success with AI. As Jeff says, “Aggregate (metrics) are not actually that helpful… You have to look at the row-level data.” It’s something we’ve found to be essential across our customer base, and it’s helpful to hear how the Digits team works in practice.
They spend lots of time looking at row-level data, including:
3 ML engineers recently spent a couple of days hand-labeling 800 rows of data to increase accuracy — and it felt entirely worthwhile for the CEO.
They forward every AI generation that gets customer feedback in their app to a team Slack channel, so they can all see it, and then decide whether to include a new row in their testing and benchmarking datasets. Those datasets are core to how they improve on AI quality with confidence month over month.
They’ve built human-in-the-loop workflows with their in-house accounting experts who look at every result when the AI model generates a low confidence score.
They’ve instrumented feedback mechanisms for building up good ground truth datasets, including as a natural result of double-entry bookkeeping. This ground truth is implicitly created as the result of accountants reviewing and finalizing a company’s books each month. Not every business has such a clear, natural source of ground truth data, but it’s helpful to think about how you might create one.
Building autonomous agents for finance
A lot of people have told us it’s scary to think about turning on autonomous agents in production. That fear increases for those working in high-stakes industries like, finance, legal tech, and government.
It was surprising then to hear how early the Digits team was in making bets on autonomous agents – they were in production as of January 2024. How did they make it happen?
Jeff tells the story in much more detail, but in short, they found a use case where agents were safe to launch: Categorizing the type of transaction for unknown vendor names/transaction strings. Imagine a credit card statement where you see a vendor name but don’t know what it is. What do you do? You Google it. The agent does the same thing and suggests a category – which Digits’ expert accountants can confirm.
It seems to have worked particularly well because:
The agent naturally mimicked human workflows – humans can easily understand what happened to generate a result.
The outputs didn’t need to be 100% correct, because the product design still put a human in the loop to check them.
Making Text-to-SQL prompts work better with tools
Another area that stood out was the Digits’ team approach to text-to-SQL in their chatbot. We’ve seen many Freeplay customers attempt these use cases for different use cases, but they generally revolve around pulling metrics or answering other questions about a database.
As Jeff points out, in their case those queries almost always come with an expectation for the LLM to do some math on the results, and today’s LLMs aren’t great at that. They’re additionally not great at making sense of large and complex database schemas to construct the right query.
So the Digits team rearchitected the system and flipped the model on its head. They identified their most common types of queries, turned those into GraphQL queries that talk directly to their financial modeling engine, and then expose those GraphQL queries to the LLM as tools. We’ve seen this approach work before, and it has a couple of benefits:
Smaller prompts! No more passing the full database schema.
Increased accuracy. LLMs tend to be much better at picking the right tool to use, vs. constructing a relevant SQL query from scratch.
As a bonus, they built a human-in-the-loop fallback, so when customers ask a question that isn’t addressed by an existing GraphQL query/tool, the chatbot pings a human expert to engage and respond.
——
Every conversation with Jeff is a good one, he's always enthusiastic about what he's building and it shows here. The full episode has much more interesting content to dig into that you won't want to miss.
Give the episode a listen and let us know what you think. As a reminder, you can subscribe on your favorite podcast service to keep up on the next interviews like this: Spotify, Apple Podcasts, and YouTube.
And finally, a huge thanks to Jeff for taking the time to talk with us and share his learnings! If you’re in need of a better accounting tool, check out Digits. 🔥