Prompt Engineering for Product Managers, Part 1: Composing Prompts

Jul 5, 2023

Are you a product manager building generative AI-powered features in your products? Does it feel overwhelming trying to get a large language model like GPT-4 to do what you want consistently, and be able to trust the quality of the experience you’re delivering to customers?

You’re not alone.

Over the past 6 months there’s been explosion of product teams building features with large language models (LLMs). We’ve talked to dozens of them — from startups to large companies — and it’s clear that many teams are struggling to get LLMs to do what they want reliably in production environments. People are learning a new technology, new tactics, building new processes, and discovering a seemingly never-ending set of edge cases and real-world surprises as they go (both from customer use, and from idiosyncrasies in the underlying models).

Product managers need to ramp up quickly. Many are getting drafted in to prompt engineering, especially once initial integrations are set up by engineers. They’re also being asked to make prioritization decisions about how much to invest across their teams in optimization, testing, experimentation, etc. Most of the PMs we’ve talked to feel like they’ve had to learn the hard way what tactics to use & what their team process should look like, and even then they can feel uncertain.

This post is the first in a series for these product managers. We’ll be sharing lessons learned & experiences from your peers to that will summarize some of the basics & good practices learned through experience so PMs can ramp up faster and be more confident shipping features with LLMs.

This post will focus on the first part of the process: crafting prompts. We’ll cover other aspects like testing & evaluation practices in the future.

First, what do we mean by “prompt engineering?”

It’s NOT just about coming up with creative phrasing to coax a model to do what you want. This is a common misconception.

The definition is evolving. When we talk about it in this post, “prompt engineering” means the full set of practices you need to get desired outputs from a model — from crafting the right inputs (including dynamic inputs from other sources), to evaluation & iteration over time.

Beyond simply writing out the instructions & examples you want to send to a model, prompt engineering for your product will also include actual software engineering needed to stitch in the right inputs from your customer experience, inject context from other data sources and consistently format responses in a usable way. It also includes picking the right model, tuning request parameters for the model, and building up the right combination of test cases & evaluation practices to give you the confidence your new feature will work well consistently for your customers. There are also a range of advanced techniques to explore for generating better-quality outputs, like “Chain-of-Thought” and “Self-consistency”.

Prompt engineering always involves iteration on prompts in 100% of cases we’ve encountered — it’s never once-and-done. What follows then are pointers to help guide your initial experimentation.

Getting Started: Plan ahead

When you use an LLM directly like with ChatGPT, you simply write out what you want to ask of the model. When you’re building with LLMs in your product, you need to account for a few extra concepts in the prompt design:

  • What will the input be from the customer? Will they ask a question? Enter some value into a form? Just click a button, and then you’ll pass content in behind the scenes? If you’re putting a prompt in your product, you’ll likely have some variable input inject each time.

  • What does a good response look like? Think about all the dimensions — tone, format, substance, etc. Ideally you can write out a few ideal examples, or you have real samples that were generated without AI in the past. You’ll discover more as you build & test. Keep track of them! Use better/different ones to test & improve in the future.

  • What other data is needed to generate a good answer? LLMs don’t know about your product & business data, so you may need to pass in additional context. (e.g. help center content & order history to answer customer support questions) A lot of people assume the path here is “fine tuning” a model, but at least for now you’ll likely get much better results cheaper & faster by injecting context into the prompt itself and using default foundational models. (A practical example from OpenAI here)

  • What output will developers expect? The output of your prompt has to be compatible with what your application is expecting. You’ll want to define a simple API contract with developers that the LLM can reliably generate, such as a JSON or YAML schema, or even just an agreement to render the response content directly.

Drafting a Prompt: The Basics

The following basic components often go into prompts that are used in product features. Each of the following elements can be combined to create great prompts.

  1. Instructions: The guidelines that tell the LLM what to do. They should be clear, concise, and specific. They can include “do’s” (basics like “You’re a bot that summarizes video transcripts”), as well as “don’ts” (like “don’t say you’re a bot, don’t answer billing questions”). Note that the model won’t always respect your instructions verbatim.

  2. Input Variables: You’ll need placeholders in the prompt for your code to fill in information from your product experience. At a minimum they’ll include customer inputs (e.g. the text value someone enters in a form in your application to trigger an LLM response), and can also included other inputs from your code.

  3. Context: In most cases the LLM doesn’t know about your business or customer data, and you need to give it the context it needs to respond. The best options here will involve working with your engineering team to inject relevant context dynamically from another data store, likely an embeddings database or a search index. (aka “retrieval models” or “in-context learning”)

  4. Examples (aka “Few-Shot” prompting): Not essential, but it can be very helpful to include examples that show the LLM what to do. Few-shot examples can work better sometimes than your instructions, and they can also be used alone. Mature applications inject relevant few-shot examples dynamically.

  5. Output Formatting: A crucial piece of most prompts in production software is specifying how the output should be formatted, so that it can be used by other code. You’ll quickly find that LLMs don’t always stick to the requested output format. You’ll have to be prepared for this with your engineering partners — perhaps by being tolerant of different formats, making re-try requests, or using advanced features like OpenAI’s functions.

Tip: If you read other blog posts about prompt engineering, you’ll often see concepts like “zero shot” (instructions only) and “few shot” (examples only) discussed like they’re opposites. They’re not mutually exclusive! In practice, many of the teams we’ve talked to end up combining these elements together to get the results they want.

An example "chat-style" prompt

An example prompt for an OpenAI chat completion for a service that summarizes basic info out of video transcripts (injected as both an input variable and as context using {{transcript}} below):

Want to go deeper on advanced techniques? Check out Lillian Weng’s great Prompt Engineering post or the Prompt Engineering Guide.

Models & Request Parameters

Once you’ve got a prompt, you then need to decide where to send it and how to tune your requests. Different models have obviously different characteristics and tradeoffs, including quality of responses, latency & cost. But some differences are more subtle, and specific to your prompt. For instance, Anthropic’s approach to “constitutional AI” and their strong RLHF practices might generate higher-quality results in some cases, but also lead to it to defer responding more frequently.

Model-specific settings or request parameters can also radically change the responses to a prompt from the same model.

You’ll need to experiment a bit and explore these trade-offs to choose the right model and settings for your needs. At a minimum, we’d suggest testing out a couple different models to get a feel for differences.

Want to quickly try out a bunch of different models and parameters? Try out nat.dev.

Which providers are most popular today?

We’ve surveyed everyone who’s signed up for our waitlist on what they’re using in production, and a significant minority — just 32% — report using only one model provider in production. Whether for quality reasons, redundancy, latency or cost tradeoffs, or even business terms, most production software teams are finding it helpful to use more than one model at different times.

We found the following self-reported distribution among people using LLMs in their software products:

  • 95% use OpenAI

  • 39% use open source models (collectively)

  • 38% use Google

  • 24% use Anthropic (still waitlisted, growing quickly)

  • 10% use Cohere

  • … plus single-digit percentages for a few other providers

Different settings / request parameters

In addition to the models themselves, there’s a range of request parameters that can be used to tune responses. The most important ones for most product teams?

  • Temperature: Technically a control of randomness, it’s also a proxy for “creativity.” The higher the number, the more random (creative?) the response. This also means it’s harder to get a reliable output format. Since most businesses are looking to tame their LLMs and get reliable and consistent output, the significant majority we’ve seen (~90%) set temperature=0.

  • Max Tokens: Limit (or expand) the length of a reply to your prompt. What’s a token, one might ask? As a rough rule of thumb, think about ~3/4 to 2 tokens per word. (Calculations vary across providers.)

A helpful overview of other parameters like stop words, top-p and top-k from Nvidia here.

Formatting Considerations

  • Chat vs. text formatting: Some LLM APIs including OpenAI’s for GPT-3.5 and GPT-4 require you to format prompts like you’re in a chat conversation, even if your UX isn’t chat-like (see our example prompt above). The general idea is that the single “prompt” you send to the API will be a combination of scripted “messages” that work together. This can be a little awkward! More in-depth examples and suggestions from OpenAI are here.

  • Response Formatting: Keep in mind that the outputs from running your prompt will need to be processed by your application code. If formatting matters, you’ll need to include instructions that tell the LLM how to format a response — and work with your developers to make sure that format can be parsed consistently.

Top 6 Lessons Learned

Finally, what have we found in practice? Here are 6 helpful tips we’ve observed along the way that could help make your life easier — and your product better.

  1. Show, Don’t Tell the LLM: Include few-shot examples in your prompts. Include other instructions too, but with examples. Research and anecdotal evidence supports it — results are better. The most advanced teams dynamically inject few-shot examples that relate to each specific completion.

  2. Version your prompts: You’ll likely iterate on your prompt over time — filling in new instructions to handle edge cases, adjusting few-shot examples, tuning the context provided, etc. Be prepared to save different versions along the way so that you can easily revert if you make changes you don’t like, or want to compare a few options later.

  3. Save all the interesting outputs: You’ll need both good and bad examples as test cases to make sure you’re keeping your quality bar high & addressing failure cases, and you’ll need to know the version of the prompt that generated them. Lots of PMs discover too late that they’ll need these test cases later.

  4. Pick the model for your prompt: The common progression we see is from OpenAI to multiple providers, especially Anthropic, Google & Cohere. There are cost, latency & redundancy benefits from multiple providers, but we frequently hear that some models are just better than others for each prompt. It’s common to find teams using 2-3 providers in production, and some use more.

  5. Keep an eye on production: LLMs aren’t good for “set it and forget it” features. Edge cases emerge for customers, models change (aka “drift”), latencies can spike, and result quality can change drastically as a result. Figure out how to monitor for quality, latency, cost and errors so you

  6. Design for a feedback loop: If this is your first time building with ML tech, you’ll want to think about feedback loops you can create in your customer experience. These might be active/explicit (e.g. thumbs up/down with a comment, like ChatGPT offers), or passive/implicit (e.g. tracking an event for when a customer makes use of an LLM-generated output, or when they request edits). Review a sample from each bucket of data on a regular basis to spot ways to improve.

Freeplay is building solutions to help with each of those recommendations. Interested? Sign up for our waitlist here.

Careers

© 228 Labs Inc. 2024