Zero-shot prompting for the Flan-T5 foundation model in Amazon SageMaker JumpStart Vivek Gangasani AWS Machine Learning Blog

The size and complexity of large language models (LLMs) have exploded in the last few years. LLMs have demonstrated remarkable capabilities in learning the semantics of natural language and producing human-like responses. Many recent LLMs are fine-tuned with a powerful technique called instruction tuning, which helps the model perform new tasks or generate responses to novel prompts without prompt-specific fine-tuning. An instruction-tuned model uses its understanding of related tasks or concepts to generate predictions to novel prompts. Because this technique doesn’t involve updating model weights, it avoids the time-consuming and computationally expensive process required to fine-tune a model for a new, previously unseen task.

In this post, we show how you can access and deploy an instruction-tuned Flan T5 model from Amazon SageMaker Jumpstart. We also demonstrate how you can engineer prompts for Flan-T5 models to perform various natural language processing (NLP) tasks. Furthermore, these tasks can be performed with zero-shot learning, where a well-engineered prompt can guide the model towards desired results. For example, consider providing a multiple-choice question and asking the model to return the appropriate answer from the available choices. We cover prompts for the following NLP tasks:

Text summarization
Common sense reasoning
Question answering
Sentiment classification
Translation
Pronoun resolution
Text generation based on article
Imaginary article based on title

Code for all the steps in this demo is available in the following notebook.

JumpStart is the machine learning (ML) hub of Amazon SageMaker that offers a one-click access to over 350 built-in algorithms; pre-trained models from TensorFlow, PyTorch, Hugging Face, and MXNet; and pre-built solution templates. JumpStart also provides pre-trained foundation models like Stability AI’s Stable Diffusion text-to-image model, BLOOM, Cohere’s Generate, Amazon’s AlexaTM and more.

Instruction tuning

Instruction tuning is a technique that involves fine-tuning a language model on a collection of NLP tasks using instructions. In this technique, the model is trained to perform tasks by following textual instructions instead of specific datasets for each task. The model is fine-tuned with a set of input and output examples for each task, allowing the model to generalize to new tasks that it hasn’t been explicitly trained on as long as prompts are provided for the tasks. Instruction tuning helps improve the accuracy and effectiveness of models and is helpful in situations where large datasets aren’t available for specific tasks.

A myriad of instruction tuning research has been performed since 2020, producing a collection of various tasks, templates, and methods. One of the most prominent instruction tuning methods, Finetuning language models (Flan), aggregates these publicly available collections into a Flan Collection to produce fine-tuned models on a wide variety of instructions. In this way, the multi-task Flan models are competitive with the same models independently fine-tuned on each specific task and can generalize beyond the specific instructions seen during training to following instructions in general.

Zero-shot learning

Zero-shot learning in NLP allows a pre-trained LLM to generate responses to tasks that it hasn’t been specifically trained for. In this technique, the model is provided with an input text and a prompt that describes the expected output from the model in natural language. The pre-trained models can use its knowledge to generate coherent and relevant responses even for prompts it hasn’t specifically been trained on. Zero-shot learning can reduce the time and data required while improving efficiency and accuracy of NLP tasks. Zero-shot learning is used in a variety of NLP tasks, such as question answering, summarization, and text generation.

Few-shot learning involves training a model to perform new tasks by providing only a few examples. This is useful where limited labeled data is available for training. Although this post primarily focuses on zero-shot learning, the referenced models are also capable of generating responses to few-shot learning prompts.

Flan-T5 model

A popular encoder-decoder model known as T5 (Text-to-Text Transfer Transformer) is one such model that was subsequently fine-tuned via the Flan method to produce the Flan-T5 family of models. Flan-T5 is an instruction-tuned model and therefore is capable of performing various zero-shot NLP tasks, as well as few-shot in-context learning tasks. With appropriate prompting, it can perform zero-shot NLP tasks such as text summarization, common sense reasoning, natural language inference, question answering, sentence and sentiment classification, translation, and pronoun resolution. The examples provided in this post are generated with the Flan-T5 family.

JumpStart provides convenient deployment of this model family through Amazon SageMaker Studio and the SageMaker SDK. This includes Flan-T5 Small, Flan-T5 Base, Flan-T5 Large, Flan-T5 XL, and Flan-T5 XXL. Furthermore, JumpStart provides three versions of Flan-T5 XXL at different levels of quantization:

Flan-T5 XXL – The full model, loaded in single-precision floating-point format (FP32).
Flan-T5 XXL FP16 – A half-precision floating-point format (FP16) version of the full model. This implementation consumes less GPU memory and performs faster inference than the FP32 version.
Flan-T5 XXL BNB INT8 – An 8-bit quantized version of the full model, loaded onto the GPU context using the accelerate and bitsandbytes libraries. This implementation provides accessibility to this LLM on instances with less compute, such as a single-GPU ml.g5.xlarge instance.

Prompt engineering for zero-shot NLP tasks on Flan-T5 models

Prompt engineering deals with creating high-quality prompts to guide the model towards the desired responses. Prompts need to be designed based on the specific task and dataset being used. The goal here is to provide the model with necessary information to generate high-quality responses while minimizing noise. This could involve keywords, additional contexts, questions, and more. For example, see the following code:

Input with Prompt: Translate this English sentence to Spanish: Cat loves chicken pizza
Model Output: Gato ama la pizza de pollo

A well-designed prompt can make the model more creative and generalized so that it can easily adapt to new tasks. Prompts can also help incorporate domain knowledge on specific tasks and improve interpretability. Prompt engineering can greatly improve the performance of zero-shot and few-shot learning models. Creating high-quality prompts requires careful consideration of the task at hand, as well as a deep understanding of the model’s strengths and limitations.

In the provided example notebook, each task demonstrates at least seven prompt templates and a comprehensive set of parameters to control the model output, such as maximum sequence length, number of return sequences, and number of beams. In addition, the prompt templates used are from the Flan T5 GitHub repository, which consists of many templates used within the Flan Collection. This collection of templates is helpful to explore when you perform your own prompt engineering.

In the following table, the Flan-T5 XXL model is used to generate responses for various zero-shot NLP tasks. The first column shows the task, the second column contains the prompt provided to the model (where the template text is bold and the non-bold text is the example input), and the third column is the response from the model when queried against the prompt.

Take the summarization task for an example: to create a model prompt, you can concatenate the template Briefly summarize this paragraph: with the text example you want to summarize. All tasks in this table used the same payload parameters: max_length=150 to provide an upper limit on the number of response tokens, no_repeat_ngram_size=5 to discourage n-gram repetition, and do_sample=False to disable sampling for repeatability. We discuss available payload parameter options when querying the endpoint in more detail later.

Task
Prompt (template in bold)
Model output

Summarization
Briefly summarize this paragraph: Amazon Comprehend uses natural language processing (NLP) to extract insights about the content of documents. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document. Use Amazon Comprehend to create new products based on understanding the structure of documents. For example, using Amazon Comprehend you can search social networking feeds for mentions of products or scan an entire document repository for key phrases.
You can access Amazon Comprehend document analysis capabilities using the Amazon Comprehend console or using the Amazon Comprehend APIs. You can run real-time analysis for small workloads or you can start asynchronous analysis jobs for large document sets. You can use the pre-trained models that Amazon Comprehend provides, or you can train your own custom models for classification and entity recognition.
All of the Amazon Comprehend features accept UTF-8 text documents as the input. In addition, custom classification and custom entity recognition accept image files, PDF files, and Word files as input.
Amazon Comprehend can examine and analyze documents in a variety of languages, depending on the specific feature. For more information, see Languages supported in Amazon Comprehend. Amazon Comprehend’s Dominant language capability can examine documents and determine the dominant language for a far wider selection of languages.
Understand the capabilities of Amazon Comprehend

Common sense reasoning or natural language reasoning
The world cup has kicked off in Los Angeles, United States.nnBased on the paragraph above can we conclude that ”The world cup takes place in United States.”?nn[”yes”, ”no”]
yes

question answering

Answer based on context:nnThe newest and most innovative Kindle yet lets you take notes on millions of books and documents, write lists and journals, and more.

For readers who have always wished they could write in their eBooks, Amazon’s new Kindle lets them do just that. The Kindle Scribe is the first Kindle for reading and writing and allows users to supplement their books and documents with notes, lists, and more.

Here’s everything you need to know about the Kindle Scribe, including frequently asked questions.

The Kindle Scribe makes it easy to read and write like you would on paper

The Kindle Scribe features a 10.2-inch, glare-free screen (the largest of all Kindle devices), crisp 300 ppi resolution, and 35 LED front lights that automatically adjust to your environment. Further personalize your experience with the adjustable warm light, font sizes, line spacing, and more.

It comes with your choice of the Basic Pen or the Premium Pen, which you use to write on the screen like you would on paper. They also attach magnetically to your Kindle and never need to be charged. The Premium Pen includes a dedicated eraser and a customizable shortcut button.

The Kindle Scribe has the most storage options of all Kindle devices: choose from 8 GB, 16 GB, or 32 GB to suit your level of reading and writing.nnwhat are the key features of new Kindle?

10.2-inch, glare-free screen

Sentence or sentiment classification
Review:nThis moive is so great and once again dazzles and delights usnIs this movie review sentence negative or positive?nOPTIONS:n-positive n-negative
positive

Translation
My name is ArthurnnTranslate to German
Mein Name ist Arthur

Pronoun resolution
Allen made his friend, Xin, laugh when he told a funny joke.nnWho is he referring to?nn(A)Allen n(B)Xin
Allen

Title generation based on an article

The newest and most innovative Kindle yet lets you take notes on millions of books and documents, write lists and journals, and more.

Here’s everything you need to know about the Kindle Scribe, including frequently asked questions.

The Kindle Scribe makes it easy to read and write like you would on paper

The Kindle Scribe has the most storage options of all Kindle devices: choose from 8 GB, 16 GB, or 32 GB to suit your level of reading and writing.

nnGive me a good title for the article above.

Everything You Need to Know About the New Kindle Scribe

Imaginary article generation based on a title
Title: ”University has new facility coming up“\nGiven the above title of an imaginary article, imagine the article.n
The University of the Philippines (UP) is set to construct a new building for its College of Medicine and Health Sciences (CMHS) in the campus of the University of Santo Tomas (UST) in Bacolod City. The CMHS building, which will be located at the UST campus, will be the first building constructed by the university in the UST campus. The CMMS building will be the first of its kind in the country, according to UP Vice President for Administration and Finance Dr. Jose L. Alcala. The CMMH building will be the second building constructed by the UP in the UST. The first building, which was completed in 2008, is the UP

Access Flan-T5 instruction-tuned models in SageMaker

JumpStart provides three avenues to get started using these instruction-tuned Flan models: JumpStart foundation models, Studio, and the SageMaker SDK. The following sections illustrate what each of these avenues look like and describe how to access them.

JumpStart foundation models

Developers can use the visual interface of the JumpStart foundation models, accessed via the SageMaker console, to test instruction-tuned Flan models without writing a single line of code. This playground provides an input prompt textbox along with controls for various parameters used during inference. This feature is currently in a gated preview, and you will see Request Access button instead of models if you don’t have access. As seen in the following screenshots, you can access foundation models in the navigation pane of the SageMaker console. Choose View model on the Flan-T5 XL model card to access the user interface.

You can use this flexible user interface to try a demo of the model.

SageMaker Studio

You can also access these models through the JumpStart landing page in Studio. This page lists available end-to-end ML solutions, pre-trained models, and example notebooks.

You can choose a Flan-T5 model card to deploy a model endpoint through the user interface.

After your endpoint is successfully launched, you can launch an example Jupyter notebook that demonstrates how to query that endpoint.

SageMaker Python SDK

Finally, you can programmatically deploy an endpoint through the SageMaker SDK. You will need to specify the model ID of your desired model in the SageMaker model hub and the instance type used for deployment. The model URI, which contains the inference script, and the URI of the Docker container are obtained through the SageMaker SDK. These URIs are provided by JumpStart and can be used to initialize a SageMaker model object for deployment. See the following code:

from sagemaker import image_uris, model_uris
from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.session import Session

aws_role = Session().get_caller_identity_arn()
model_id, model_version = “huggingface-text2text-flan-t5-xxl”, “*”
endpoint_name = f”jumpstart-example-{model_id}”
instance_type = “ml.g5.12xlarge”

# Retrieve the inference docker container URI.
deploy_image_uri = image_uris.retrieve(
region=None,
framework=None, # automatically inferred from model_id
image_scope=”inference”,
model_id=model_id,
model_version=model_version,
instance_type=instance_type,
)

# Retrieve the model URI.
model_uri = model_uris.retrieve(
model_id=model_id, model_version=model_version, model_scope=”inference”
)

# Create a SageMaker Model object.
model = Model(
image_uri=deploy_image_uri,
model_data=model_uri,
role=aws_role,
predictor_cls=Predictor,
name=endpoint_name,
)

# Deploy the Model. Provide a predictor_cls to use the SageMaker API for inference.
model_predictor = model.deploy(
initial_instance_count=1,
instance_type=inference_instance_type,
predictor_cls=Predictor,
endpoint_name=endpoint_name,
)

Now that the endpoint is deployed, you can query the endpoint to produce generated text. Consider a summarization task as an example, where you want to produce a summary of the following text:

text = “””Amazon Comprehend uses natural language processing (NLP) to extract insights about the content of documents. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document. Use Amazon Comprehend to create new products based on understanding the structure of documents. For example, using Amazon Comprehend you can search social networking feeds for mentions of products or scan an entire document repository for key phrases.
You can access Amazon Comprehend document analysis capabilities using the Amazon Comprehend console or using the Amazon Comprehend APIs. You can run real-time analysis for small workloads or you can start asynchronous analysis jobs for large document sets. You can use the pre-trained models that Amazon Comprehend provides, or you can train your own custom models for classification and entity recognition.
All of the Amazon Comprehend features accept UTF-8 text documents as the input. In addition, custom classification and custom entity recognition accept image files, PDF files, and Word files as input.
Amazon Comprehend can examine and analyze documents in a variety of languages, depending on the specific feature. For more information, see Languages supported in Amazon Comprehend. Amazon Comprehend’s Dominant language capability can examine documents and determine the dominant language for a far wider selection of languages.”””

You should supply this text within a JSON payload when invoking the endpoint. This JSON payload can include any desired inference parameters that help control the length, sampling strategy, and output token sequence restrictions. While the transformers library defines a full list of available payload parameters, many important payload parameters are defined as follows:

max_length – The model generates text until the output length (which includes the input context length) reaches max_length. If specified, it must be a positive integer.
num_return_sequences – The number of output sequences returned. If specified, it must be a positive integer.
num_beams – The number of beams used in the greedy search. If specified, it must be an integer greater than or equal to num_return_sequences.
no_repeat_ngram_size – The model ensures that a sequence of words of no_repeat_ngram_size is not repeated in the output sequence. If specified, it must be a positive integer greater than 1.
temperature – Controls the randomness in the output. Higher temperature results in output sequence with low-probability words, and lower temperature results in output sequence with high-probability words. If temperature equals 0, it results in greedy decoding. If specified, it must be a positive float.
early_stopping – If True, text generation is finished when all beam hypotheses reach the end of stence token. If specified, it must be Boolean.
do_sample – If True, sample the next word as per the likelihood. If specified, it must be Boolean.
top_k – In each step of text generation, sample from only the top_k most likely words. If specified, it must be a positive integer.
top_p – In each step of text generation, sample from the smallest possible set of words with cumulative probability top_p. If specified, it must be a float between 0–1.
seed – Fix the randomized state for reproducibility. If specified, it must be an integer.

We can specify any subset of these parameters while invoking an endpoint. Next, we show an example of how to invoke an endpoint with these arguments:

import boto3
import json

def query_endpoint_and_parse_response(payload_dict, endpoint_name):
   encoded_json = json.dumps(payload_dict).encode(“utf-8”)
   client = boto3.client(“runtime.sagemaker”)
   response = client.invoke_endpoint(
   EndpointName=endpoint_name, ContentType=”application/json”, Body=encoded_json
   )
   model_predictions = json.loads(response[“Body”].read())
   return model_predictions[“generated_texts”]

prompt_template = “Write a short summary for this text: {text}”

parameters = {
   “max_length”: 200,
   “num_return_sequences”: 1,
   “top_k”: 50,
   “top_p”: .95,
   “do_sample”: True,
   “early_stopping”: False,
   “num_beams”: 1,
   “no_repeat_ngram_size”: 3,
   “temperature”: 1
}

payload = {“text_inputs”: prompt_template.replace(“{text}”, text), **parameters}
generated_texts = query_endpoint_and_parse_response(payload, endpoint_name)
print(f”For prompt: ‘{prompts}'”)
print(f”Result: {generated_texts}”)

This code block generates an output sequence sample that resembles the following text:

# For prompt: ‘Write a short summary for this text: {text}’
# Result: [‘Amazon Comprehend is a service that uses natural language processing to extract insights about the content of documents. Using Amazon Comprehend, you can find new products and services by understanding the structure of documents, and then use the information to create new offerings.’]

Clean up

To avoid ongoing charges, delete the SageMaker inference endpoints. You can delete the endpoints via the SageMaker console or from the Studio notebook using the following commands:

model_predictor.delete_model()
model_predictor.delete_endpoint()

Conclusion

In this post, we gave an overview of the benefits of zero-shot learning and described how prompt engineering can improve the performance of instruction-tuned models. We also showed how to easily deploy an instruction-tuned Flan T5 model from JumpStart and provided examples to demonstrate how you can perform different NLP tasks using the deployed Flan T5 model endpoint in SageMaker.

We encourage you to deploy a Flan T5 model from JumpStart and create your own prompts for NLP use cases.

To learn more about JumpStart, check out the following:

Run text generation with Bloom and GPT models on Amazon SageMaker JumpStart
Generate images from text with the stable diffusion model on Amazon SageMaker JumpStart
Run image segmentation with Amazon SageMaker JumpStart
Run text classification with Amazon SageMaker JumpStart using TensorFlow Hub and Hugging Face models
Amazon SageMaker JumpStart models and algorithms now available via API
Incremental training with Amazon SageMaker JumpStart
Transfer learning for TensorFlow object detection models in Amazon SageMaker
Transfer learning for TensorFlow text classification models in Amazon SageMaker
Transfer learning for TensorFlow image classification models in Amazon SageMaker

About the authors

Dr. Xin Huang is an Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A journal.

Vivek Gangasani is a Senior Machine Learning Solutions Architect at Amazon Web Services. He works with Machine Learning Startups to build and deploy AI/ML applications on AWS. He is currently focused on delivering solutions for MLOps, ML Inference and low-code ML. He has worked on projects in different domains, including Natural Language Processing and Computer Vision.

Dr. Kyle Ulrich is an Applied Scientist with the Amazon SageMaker built-in algorithms team. His research interests include scalable machine learning algorithms, computer vision, time series, Bayesian non-parametrics, and Gaussian processes. His PhD is from Duke University and he has published papers in NeurIPS, Cell, and Neuron.