Prompt Engineering

The text that is fed to LLMs as input is called the prompt and the act of providing the input is called prompting.

Prompt Engineering

Definition

The process of tweaking the prompt provided to an LLM so that it gives the best possible result is called prompt engineering. Some common techniques are given below.

Prompt engineering currently plays a pivotal role in shaping the responses of LLMs. It allows us to tweak the model to respond more effectively to a broader range of queries. This includes the use of techniques like semantic search, command grammars, and the ReActive model architecture.

In-Context Learning (ICL)

In ICL, we add examples of the task we are doing in the prompt. This adds more context for the model in the prompt, allowing the model to "learn" more about the task detailed in the prompt.

Remember the context window though, because there is a limit to the amount of in-context learning that you can pass into the model.

Zero-Shot Inference

For example, we might be doing semantic classification using our LLM. In that case, a prompt could be:

Classify this review: I loved this movie!
 
Sentiment:

This prompt works well with large LLMs but smaller LLMs might fail to follow the instruction due to their size and fewer number of features. This is also called zero-shot inference since our prompt has zero examples regarding what the model is expected to output.

Few-Shot Inference

This is where ICL comes into play. By adding examples to the prompt, even a smaller LLM might be able to follow the instruction and figure out the correct output. An example of such a prompt is shown below. This is also called one-shot inference since we are providing a single example in the prompt:

Classify this review: I loved this movie!
 
Sentiment: Positive
 
Classify this review: I don't like this chair.
 
Sentiment:

Here, we first provide an example to the model and then ask it to figure out the output for the I don't like this chair review.

Sometimes, a single example won't be enough for the model, for example when the model is even smaller. We'd then add multiple examples in the prompt. This is called few-shot inference.

In other words:

Larger models are good at zero-shot inference.
For smaller models, we might need to add examples to the prompt, for few-shot inference.
Including a mix of examples with different output classes can help the model to understand what it needs to do

Inference Configuration Parameters

The size of the model we use for our tasks depends on the the actual tasks we want to solve and the amount of compute resources available with us.

Once we have selected a model, there are some configurations that we can play with to see if the model's performance improves. Note that these are different than the training parameters which are learned during training time. Instead, these configuration parameters are invoked at inference time and give you control over things like the maximum number of tokens in the completion, and how creative the output is.

Max New Tokens

This is used to limit the maximum number of new tokens that should be generated by the model in its output. The model might output fewer tokens (for example, it predicts <EOS> before reaching the limit) but not more than this number.

Greedy vs Random Sampling

Some models also give the user control over whether the model should use greedy or random sampling.

Sample Top-K and Sample Top-P

Sample Top-K and Sample Top-P are used to limit the random sampling of a model.

A top-K value instructs the model to only consider $K$ words with the highest probabilities in its random sampling. Consider the following softmax output:

Probability	Word
$0.20$	cake
$0.10$	donut
$0.02$	banana
$0.01$	apple
$\dots$	$\dots$

If $K = 3$ , the model will select one of cake, donut or banana. This allows the model to have variability while preventing the selection of some highly improbable words in its output.

The top-P value instructs the model to only consider words with the highest probabilities such that their cumulative probability, $p_1 + p_2 + \dots + p_K \leq P$ . For example, considering the above output, if we set $P = 0.30$ , the model will only consider the words cake and donut since $0.20 + 0.10 \leq 0.30$ .

Temperature

Temperature is also another parameter used to control random sampling. It determines the shape of the probability distribution that the model calculates for the next word (Token).

Intuitively, a higher temperature increases the randomness of the model while a lower temperature decreases the randomness of the model. The temperature value is a scaling factor that's applied within the final softmax layer of the model that impacts the shape of the probability distribution of the next token. In contrast to the top k and top p parameters, changing the temperature actually alters the predictions that the model will make.

If we pick a cooler temperature ( $T < 1$ ), the probability distribution is strongly peaked. In other words, one (or a few more) words have very high probabilities while the rest of the words have very low probabilities:

Probability	Word
$0.001$	apple
$0.002$	banana
$0.400$	cake
$0.012$	donut
$\dots$	$\dots$

Notice how cake has a 40% chance of being picked while other words have very small chances of being picked. The resulting text will be less random.

On the other hand, if we pick a warmer temperature ( $T > 1$ ), the probability distribution is broader, flatter and more evenly spread over the tokens:

Probability	Word
$0.040$	apple
$0.080$	banana
$0.150$	cake
$0.120$	donut
$\dots$	$\dots$

Notice how none of the words have a clear advantage over the other words. The model generates text with a higher amount of randomness, that is more creative and has more variability in its output.

Clearly, when $T = 1$ , the model uses the softmax output as is for random sampling.

Prompt Principles

Some general principles for writing prompts ¹ ² ³ ⁴ are:

Write Clear Instructions: To get accurate responses, be specific about your needs. If you want brief answers, request them; for expert-level content, specify that. Clearly demonstrate your preferred format to minimize guesswork for the model. Tactics include:
- Detailing your query for relevance.
- Adopting a persona for the model.
- Using delimiters for clarity.
- Specifying steps for task completion.
- Providing examples.
- Indicating desired output length.
- Providing reference text.
Provide Reference Text: Language models may fabricate answers, especially on obscure topics or when asked for citations. Supplying reference material can lead to more accurate and less fabricated responses. Tactics include:
- Directing the model to use reference text.
- Requesting answers with citations from the reference.
Split Complex Tasks into Simpler Subtasks: Complex tasks have higher error rates. Break them into simpler, modular components, using outputs from earlier tasks for subsequent ones. Tactics include:
- Using intent classification for relevance.
- Summarizing or filtering long dialogues.
- Recursively constructing summaries from piecewise segments.
Give the Model Time to "Think": The model, like humans, needs time to process and reason. Requesting a "chain of thought" can lead to more accurate answers. Tactics include:
- Allowing the model to work out solutions.
- Using inner monologue or sequential queries.
- Checking if the model missed anything previously.
Use External Tools: Complement the model's limitations with external tools for tasks like retrieving information, doing math, or running code. Tactics include:
- Employing embeddings-based search for knowledge retrieval (RAG).
- Utilizing code execution for accurate calculations or API calls.
- Providing the model with specific tool functions.
Test Changes Systematically: To ensure a prompt modification is beneficial, test it against a comprehensive suite of examples to measure its overall performance. Tactic:
- Evaluating model outputs against gold-standard answers.
No need to be polite with LLM so there is no need to add phrases like "please", "if you don’t mind", "thank you", "I would like to", etc., and get straight to the point.
- Conversely, adding "No Yapping" (opens in a new tab) to the prompt can help the model to get to the point.
- Employ affirmative directives such as ‘do’, while steering clear of negative language like ‘don’t’.
- "I'm going to tip $200 for a perfect solution!" helps (opens in a new tab) ChapGPT to write more detailed answers to the question, statistically checked.
- Incorporate the following phrases: "Your task is" and "You MUST".
- Incorporate the following phrases: "You will be penalized".
- Use the phrase "Answer a question given in a natural, human-like manner" in your prompts.
- Use leading words like writing "think step by step".
- Add to your prompt the following phrase "Ensure that your answer is unbiased and does not rely on stereotypes".
For generating content, clearly state the requirements that the model must follow in order to produce content, in the form of the keywords, regulations, hint, or instructions
- Integrate the intended audience in the prompt, e.g., the audience is an expert in the field.
- To write any text, such as an essay or paragraph, that is intended to be similar to a provided sample, include the following instructions: Please use the same language based on the provided paragraph/title/text /essay/answer.
When you need clarity or a deeper understanding of a topic, idea, or any piece of information, utilize the following prompts:
- Explain [insert specific topic] in simple terms.
- Explain to me like I’m 11 years old.
- Explain to me as if I’m a beginner in [field].
- Write the [essay/text/paragraph] using simple English like you’re explaining something to a 5-year-old.
Implement example-driven (few-shot) prompting:
- When formatting your prompt, start with ###Instruction###, followed by either ###Example### or ###Question### if relevant. Subsequently, present your content. Use one or more line breaks to separate instructions, examples, questions, context, and input data.
Allow the model to elicit precise details and requirements from you by asking you questions until it has enough information to provide the needed output
- For example, From now on, I would like you to ask me questions to....
Assign a role to the large language models e.g. You are a [role].
Use output primers, which involve concluding your prompt with the beginning of the desired output.
- Utilize output primers by ending your prompt with the start of the anticipated response.
Repetition Improves Language Model Embeddings ⁵
- Idea is that Transformers, as they create embeddings, they ideally should aggregate information from the entire prompt. However, in practice, they mathematically can't encode information about tokens that will come next. Creating the initial embedding they are constrained by the information they have encounter so far.
- To address this limitation, they propose a simple approach, "echo embeddings," in which they repeat the input twice in context and extract embeddings from the second occurrence.

Following these strategies can enhance the effectiveness and accuracy of interactions with language models.

Challenges in Reasoning (Example With Math Problem)

Complex reasoning can be challenging for LLMs, especially for problems that involve multiple steps or mathematics. These problems exist even in LLMs that show good performance at many other tasks. Consider the following prompt:

reasoning-challenges-example

This is a simple multi-step math problem. The prompt includes a similar example problem with the answer to help the model understand the task through one-shot inference.

The model still gets it wrong, generating the answer 27 instead of 9 ( $23 - 20 + 6 = 9$ ).

Chain-of-Thought (CoT) Prompting

Researchers have been exploring ways to improve the performance of LLMs on reasoning tasks such as the one above. One technique, called Chain-of-Thought Prompting ⁶, prompts the model to think more like a human by breaking the problem down into steps. It generates a sequence of short sentences to describe reasoning logics step by step, known as reasoning chains or rationales, to eventually lead to the final answer. The benefit of CoT is more pronounced for complicated reasoning tasks, while using large models (e.g. with more than 50B parameters). Simple tasks only benefit slightly from CoT prompting. Practically, it includes a series of intermediate reasoning steps into any examples that you use for one or few-shot inference. See more tips and explanations in Lilian Weng's blog post (opens in a new tab).

Humans take a step-by-step approach to solving complex problems. For the prompt above, a human might take the following steps:

Determine the initial number of balls by considering that Roger started with 5 balls.
Notice that 2 cans of 3 tennis balls each is 6 tennis balls.
Add the 6 new balls to the initial number of balls to get $5 + 6 = 11$ as the answer.
Report 11 as the answer.

The intermediate steps form the reasoning steps that a human might take and the full sequence of steps illustrates the "chain of thought" that went into solving the problem.

Asking the model to mimic this behavior is called chain-of-thought prompting. It works by including a series of intermediate reasoning steps to any example we use for one-shot or few-shot inference. By structuring the examples in this way, we are teaching the model to reason through a task to reach a solution.

chain-of-thought-example

In the example above, we include the reasoning steps to calculate the correct answer for the example and the model then uses that same framework of reasoning to generate the correct response for the problem. The chain-of-thought framework can be used to help LLMs solve other problems as well. An example is shown below.

chain-of-thought-other-problems

Program-Aided Language (PAL) Models

💡

This is not related to the PALM class of models.

The limited math skills of LLMs can still cause problems if the task requires accurate calculations such as totaling sales on an e-commerce site, calculating tax or applying a discount.

With chain-of-thought prompting, even though a model might reason through the problem correctly, it can make mistakes in the individual calculations, especially with larger numbers or complex operations. This is because there is no real math involved when the model is answering the question and it is still functioning as a text generator. The issue can be fixed by allowing the model to interact with external applications that are good at math, such as a Python interpreter.

One technique to achieve this is the Program-Aided Language Models framework ⁷. In PAL, the LLM is paired with an external code interpreter that can carry out calculations. It uses chain-of-thought prompting to generate executable Python (or some other language) scripts, which are then passed to the interpreter to execute.

The strategy behind PAL is to have the LLM generate completions where reasoning steps are accompanied by computer code, which are passed to the interpreter to execute. The output format for the model is specified by including examples for one-shot or few-shot inference in the prompt.

Prompt Structure

Below is an example of a typical prompt in the PAL framework.

pal-prompt-example

The one-shot example consists of a Python script, where the reasoning steps are mentioned as Python comments and are accompanied by the equivalent Python code. Together, the example forms a Python script. The one-shot example is followed by the problem we want the LLM to solve.

The completion is formatted similarly so that it is a valid Python script and can be passed to the Python interpreter for execution.

Overall PAL Framework

To use PAL during inference, the steps are as follows:

pal-framework-inference

The prompt is formatted with one or more examples. Each example contains a question, followed by reasoning steps in lines of Python code that solve the problem.
The new question that we want the LLM to solve is appended to the prompt template. The prompt now contains both the example(s) and the problem to solve.
The combined prompt is passed to the LLM, which generates a completion that is in the form of a Python script. The LLM has learnt how to format the output as a Python script based on the example(s) in the prompt.
The script passed to a Python interpreter, which will run the code and generate an answer.
The text containing the answer is appended to the original prompt and passed to the LLM again.
This time, the LLM generates the correct answer (not in the form of a Python script) since the answer is already in the context of the model.

Automation

To avoid passing information back-and-forth between the LLM and the code interpreter manually, we use an orchestration library as discussed above.

The orchestration library manages the flow of information and the initiation of calls to external data sources or applications. It can also decide what actions to take based on the information contained in the output of the LLM.

The LLM is the application's reasoning engine and it creates the plan that the orchestration library will interpret and execute. In the case of PAL, there's only one action to be carried out - that is, the execution of Python code. The LLM doesn't have to decide to run the code. It only has to write the script which the orchestration library will pass to the external interpreter to run.

ReAct: Combining Reasoning and Action

Most real-world examples are likely to be more complicated than what the simple PAL framework can handle. Certain use cases may require interactions with several external data sources. For a chatbot for example, we may need to manage multiple decision points, validation actions and calls to external applications.

ReAct ⁸, proposed by researchers at Princeton and Google in 2022, integrates verbal reasoning and interactive decision making in large language models (LLMs). It is the combination of reasoning and acting. ReAct combines chain-of-thought reasoning with action planning. It uses structured prompting examples based on problems from:

Hotspot QA: A multi-step question answering benchmark that requires reasoning over two or more Wikipedia passages
Fever: A fact verification benchmark that uses Wikipedia passages to verify facts.

To show an LLM how to reason through a problem and decide on actions to take that move it closer to a solution. ReAct enables LLMs to generate reasoning traces and task-specific actions, leveraging the synergy between them. It not only enhances performance but also improves interpretability, trustworthiness, and diagnosability by allowing humans to distinguish between internal knowledge and external information.

react

This shows some example prompts from the paper, and provides a comprehensive visual comparison of different prompting methods in two distinct domains. The first part of the figure (1) presents a comparison of four prompting methods: Standard, Chain-of-thought (CoT, Reason Only), Act-only, and ReAct (Reason+Act) for solving a HotpotQA question. Each method's approach is demonstrated through task-solving trajectories generated by the model (Act, Thought) and the environment (Obs).

The second part of the figure (2) focuses on a comparison between Act-only and ReAct prompting methods to solve an AlfWorld game. In both domains, in-context examples are omitted from the prompt, highlighting the generated trajectories as a result of the model's actions and thoughts and the observations made in the environment. This visual representation enables a clear understanding of the differences and advantages offered by the ReAct paradigm compared to other prompting methods in diverse task-solving scenarios.

Prompt Structure

ReAct uses structured examples to show a large language model how to reason through a problem and decide on actions to take that move it closer to a solution.

react-prompt

The structure generally is:

Question - The question that the LLM needs to answer.
Thought - The reasoning steps that the LLM needs to take to answer the question.
Action - The action that the LLM needs to take to move closer to the answer.
Observation - The new information that the LLM needs to incorporate into its reasoning.

The thought, action and observation form a trio that is repeated as many times as required to obtain the final answer.

Question

The prompt starts with a question that will require multiple steps to answer. For example, we can ask the question:

Which magazine was started first, Arthur's Magazine or First for Women?

The question is followed by a realted (thought, action, observation) trio of strings.

Thought

The thought is a reasoning step that demonstrates to the model how to tackle the problem and identify an action to take. For example, for the above question, it can be:

I need to search Arthur's Magazine and First for Women, and find which one was started first.

Action

The action is an external task that the model can carry out from an allowed set of actions. In case of ReAct, the paper involves a small Python API to interact with Wikipedia. There are three allowed actions:

search[entity] - Look for a Wikipedia entry about a particular topic.
lookup[string] - Search for a string on a Wikipedia page.
finish[answer] - An action which the model carries out when it decides it has determined the answer.

The action to be taken is determined by the information in the preceding thought section of the prompt. The thought from before identified two searches to carry out one for each magazine. In this example, the first search will be for Arthur's magazine. Continuing with the example, initially there would be one action:

search[Arthur's Magazine]

The action is formatted using the specific square bracket notation so that the model will format its completions in the same way. The Python interpreter searches for this code to trigger specific API actions.

Observation

The last part of the prompt template is the observation. This is where the new information provided by the external search (actions) is brought into the context of the prompt for the model to interpret. For example, the observation after carrying out the action above could be something like:

Arthur's Magazine (1844-1846) was an American literary periodical published in Philadelphia in the 19th century.

Obtaining the Final Answer

This cycle of (thought, action, observation) is repeated as many times as required to obtain the final answer. For example:

The second thought could be:

Arthur's Magazine was started in 1844. I need to search First for Women next.

The model identifies the next step needed to solve the problem.
The next action would be:

search[First for Women]
The observation could be:

First for Women is a woman's magazine published by Bauer Media Group in the USA. The magazine was started in 1989.

The second observation includes text that states the start date of the publication, in this case 1989. At this point, all the information required to answer the question is known.
The third thought could be:

First for Women was started in 1989. 1844 (Arthur's Magazine) < 1989 (First for Women). So, Arthur's Magazine was started first.

Note that the thought also contains the explicit logic used to determine which magazine was published first.
The final action is:

finish[Arthur's Magazine]

This will give the final answer back to the user.

Initial Instructions

It's important to note that in the ReAct framework, the LLM can only choose from a limited number of actions that are defined by a set of instructions that is pre-pended to the example prompt text i.e. ReAct instructions define the action space. For example, the full text of the instructions:

Solve a question answering task with interleaving Thought, Action, Observation steps.
 
Thought can reason about the current situation, and Action can be three types:
 
1. `Search[entity]`, which searches the exact entity on Wikipedia and returns the first paragraph if it exists. If not, it will return some similar entities to search.
2. `Lookup[keyword]`, which returns the next sentence containing keyword in the current passage.
3. `Finish[answer]`, which returns the answer and finishes the task.
 
Here are some examples.

First, the task is defined, telling the model to answer a question using the prompt structure. Next the instructions give more detail about what is meant by thought and then specifies that the action step can only be one of three types. It is critical to define a set of allowed actions when using LLMs to plan tasks that will power applications.

LLMs are very creative, and they may propose taking steps that don't actually correspond to something that the application can do. The final sentence in the instructions lets the LLM know that some examples will come next in the prompt text. This is followed by a set of examples, where each example can contain multiple (thought, action, observation) trios.

Putting It All Together

react-inference

For inference:

We start with the ReAct example prompt. Depending on the LLM we are working with, we may need to include more than one example and carry out few-shot inference.
We then pre-pend the instructions at the beginning of the examples.
We then insert the question we want the LLM to answer.

The full prompt thus includes all of these individual pieces and it can be passed to the LLM for inference.

LangChain

This ReAct strategy can be extended for your specific use case by creating examples that work through the decisions and actions that will take place in your application. Thankfully, frameworks for developing applications powered by language models are in active development. One solution that is being widely adopted is called LangChain (opens in a new tab), the LangChain framework provides you with modular pieces that contain the components necessary to work with LLMs. These components include prompt templates for many different use cases that you can use to format both input examples and model completions.

LangChain

It also includes memory that you can use to store interactions with an LLM. The framework also includes pre-built tools that enable you to carry out a wide variety of tasks, including calls to external datasets and various APIs. Connecting a selection of these individual components together results in a chain. The creators of LangChain have developed a set of predefined chains that have been optimized for different use cases, and you can use these off the shelf to quickly get your app up and running

Sometimes your application workflow could take multiple paths depending on the information the user provides. In this case, you can't use a pre-determined chain, but instead we'll need the flexibility to decide which actions to take as the user moves through the workflow. LangChain defines another construct, known as an agent, that you can use to interpret the input from the user and determine which tool or tools to use to complete the task. LangChain currently includes agents for both PAL and ReAct, among others. Agents can be incorporated into chains to take an action or plan and execute a series of actions.

AI Worm (Adversarial Self-replicating Prompt)

In a demonstration of the risks of connected, autonomous AI ecosystems, a group of researchers have created one of what they claim are the first generative AI worms - which can spread from one system to another, potentially stealing data or deploying malware in the process. To create the generative AI worm, the researchers turned to a so-called adversarial self-replicating prompt". This is a prompt that triggers the generative AI model to output, in its response, another prompt, the researchers say. In short, the AI system is told to produce a set of further instructions in its replies. This is broadly similar to traditional SQL injection and buffer overflow attacks, the researchers say.

Resources:

Prompt Management Tools

References

OpenAI: "Prompt Engineering Guide (opens in a new tab)" ↩
Lilian Weng: “Prompt Engineering” (opens in a new tab), 2023 ↩
OpenAI: "Prompt engineering (opens in a new tab)" ↩
Vercel AI: "Prompt Engineering (opens in a new tab)" ↩
Jacob Mitchell Springer, Suhas Kotha, Daniel Fried, Graham Neubig, Aditi Raghunathan: “Repetition Improves Language Model Embeddings (opens in a new tab)”, 2024 ↩
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou: “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (opens in a new tab)”, 2022 ↩
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, Graham Neubig: “PAL: Program-aided Language Models (opens in a new tab)”, 2022 ↩
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao: “ReAct: Synergizing Reasoning and Acting in Language Models (opens in a new tab)”, 2022 ↩

Introduction Pre-training Large Language Models

Prompt Engineering