Readers like you help support MUO. When you make a purchase using links on our site, we may earn an affiliate commission. Read More.
Output showing a conversation between LLaMA-2 model and a human on a webpage

Llama 2 is an open-source large language model (LLM) developed by Meta. It is a competent open-source large language model, arguably better than some closed models like GPT-3.5 and PaLM 2. It consists of three pre-trained and fine-tuned generative text model sizes, including the 7 billion, 13 billion, and 70 billion parameter models.

You will explore Llama 2’s conversational capabilities by building a chatbot using Streamlit and Llama 2.

Understanding Llama 2: Features and Benefits

How different is Llama 2 from its predecessor large language model, Llama 1?

  • Larger model size: The model is larger, with up to 70 billion parameters. This enables it to learn more intricate associations between words and sentences.
  • Improved conversation abilities: Reinforcement Learning from Human Feedback (RLHF) improves conversational application abilities. This allows the model to generate humanlike content even in convoluted interactions.
  • Quicker inference: It introduces a novel method called grouped-query attention to accelerate inference. This results in its ability to build more useful applications like chatbots and virtual assistants.
  • More efficient: It is more memory and computational resource efficient than its predecessor.
  • Open-source and non-commercial license: It is open-source. Researchers and developers can use and modify Llama 2 without restrictions.

Llama 2 significantly outperforms its predecessor in all respects. These characteristics make it a potent tool for many applications, such as chatbots, virtual assistants, and natural language comprehension.

Setting Up a Streamlit Environment for Chatbot Development

To start building your application, you have to set up a development environment. This is to isolate your project from the existing projects on your machine.

First, start by creating a virtual environment using the Pipenv library as follows:

 pipenv shell 

Next, install the necessary libraries to build the chatbot.

 pipenv install streamlit replicate 

Streamlit: It is an open-source web app framework that renders machine learning and data science applications quickly.

Replicate: It is a cloud platform that provides access to large open-source machine-learning models for deployment.

Get Your Llama 2 API Token From Replicate

To get a Replicate token key, you must first register an account on Replicate using your GitHub account.

Replicate only allows sign-in through a GitHub account.

Once you have accessed the dashboard, navigate to the Explore button and search for Llama 2 chat to see the llama-2–70b-chat model.

Replicate Dashboard

Click on the llama-2–70b-chat model to view the Llama 2 API endpoints. Click the API button on the llama-2–70b-chat model’s navigation bar. On the right side of the page, click on the Python button. This will provide you with access to the API token for Python Applications.

llama-2–70b-chat API token page for Python

Copy the REPLICATE_API_TOKEN and store it safe for future use.

The full source code is available in this GitHub repository.

Building the Chatbot

First, create a Python file called llama_chatbot.py and an env file (.env). You will write your code in llama_chatbot.py and store your secret keys and API tokens in the .env file.

On the llama_chatbot.py file, import the libraries as follows.

 import streamlit as st 
import os
import replicate

Next, set the global variables of the llama-2–70b-chat model.

 # Global variables
REPLICATE_API_TOKEN = os.environ.get('REPLICATE_API_TOKEN', default='')

# Define model endpoints as independent variables
LLaMA2_7B_ENDPOINT = os.environ.get('MODEL_ENDPOINT7B', default='')
LLaMA2_13B_ENDPOINT = os.environ.get('MODEL_ENDPOINT13B', default='')
LLaMA2_70B_ENDPOINT = os.environ.get('MODEL_ENDPOINT70B', default='')

On the .env file, add the Replicate token and model endpoints in the following format:

 REPLICATE_API_TOKEN='Paste_Your_Replicate_Token' 
MODEL_ENDPOINT7B='a16z-infra/llama7b-v2-chat:4f0a4744c7295c024a1de15e1a63c880d3da035fa1f49bfd344fe076074c8eea'
MODEL_ENDPOINT13B='a16z-infra/llama13b-v2-chat:df7690f1994d94e96ad9d568eac121aecf50684a0b0963b25a41cc40061269e5'
MODEL_ENDPOINT70B='replicate/llama70b-v2-chat:e951f18578850b652510200860fc4ea62b3b16fac280f83ff32282f87bbd2e48'

Paste your Replicate token and save the .env file.

Designing the Chatbot’s Conversational Flow

Create a pre-prompt to start the Llama 2 model depending on what task you want it to do. In this case, you want the model to act as an assistant.

 # Set Pre-propmt 
PRE_PROMPT = "You are a helpful assistant. You do not respond as " \
            "'User' or pretend to be 'User'." \
            " You only respond once as Assistant."

Set up the page configuration for your chatbot as follows:

 # Set initial page configuration
st.set_page_config(
   page_title="LLaMA2Chat",
   page_icon=":volleyball:",
   layout="wide"
)

Write a function that initializes and sets up session state variables.

 # Constants
LLaMA2_MODELS = {
   'LLaMA2-7B': LLaMA2_7B_ENDPOINT,
   'LLaMA2-13B': LLaMA2_13B_ENDPOINT,
   'LLaMA2-70B': LLaMA2_70B_ENDPOINT,
}

# Session State Variables
DEFAULT_TEMPERATURE = 0.1
DEFAULT_TOP_P = 0.9
DEFAULT_MAX_SEQ_LEN = 512
DEFAULT_PRE_PROMPT = PRE_PROMPT

def setup_session_state():
   st.session_state.setdefault('chat_dialogue', [])
   selected_model = st.sidebar.selectbox(
       'Choose a LLaMA2 model:', list(LLaMA2_MODELS.keys()), key='model')
   st.session_state.setdefault(
       'llm', LLaMA2_MODELS.get(selected_model, LLaMA2_70B_ENDPOINT))
   st.session_state.setdefault('temperature', DEFAULT_TEMPERATURE)
   st.session_state.setdefault('top_p', DEFAULT_TOP_P)
   st.session_state.setdefault('max_seq_len', DEFAULT_MAX_SEQ_LEN)
   st.session_state.setdefault('pre_prompt', DEFAULT_PRE_PROMPT)

The function sets the essential variables like chat_dialogue, pre_prompt, llm, top_p, max_seq_len, and temperature in the session state. It also handles the selection of the Llama 2 model based on the user's choice.

Write a function to render the sidebar content of the Streamlit app.

 def render_sidebar():
   st.sidebar.header("LLaMA2 Chatbot")
   st.session_state['temperature'] = st.sidebar.slider('Temperature:',
         min_value=0.01, max_value=5.0, value=DEFAULT_TEMPERATURE, step=0.01)
   st.session_state['top_p'] = st.sidebar.slider('Top P:', min_value=0.01,
         max_value=1.0, value=DEFAULT_TOP_P, step=0.01)
   st.session_state['max_seq_len'] = st.sidebar.slider('Max Sequence Length:',
         min_value=64, max_value=4096, value=DEFAULT_MAX_SEQ_LEN, step=8)
   new_prompt = st.sidebar.text_area(
         'Prompt before the chat starts. Edit here if desired:',
          DEFAULT_PRE_PROMPT,height=60)
   if new_prompt != DEFAULT_PRE_PROMPT and new_prompt != "" and
new_prompt is not None:
       st.session_state['pre_prompt'] = new_prompt + "\n"
   else:
       st.session_state['pre_prompt'] = DEFAULT_PRE_PROMPT

The function displays the header and the setting variables of the Llama 2 chatbot for adjustments.

Write the function that renders the chat history in the main content area of the Streamlit app.

 def render_chat_history():
   response_container = st.container()
   for message in st.session_state.chat_dialogue:
       with st.chat_message(message["role"]):
           st.markdown(message["content"])

The function iterates through the chat_dialogue saved in the session state, displaying each message with the corresponding role (user or assistant).

Handle the user's input using the function below.

 def handle_user_input():
   user_input = st.chat_input(
"Type your question here to talk to LLaMA2"
)
   if user_input:
       st.session_state.chat_dialogue.append(
{"role": "user", "content": user_input}
)
       with st.chat_message("user"):
           st.markdown(user_input)

This function presents the user with an input field where they can enter their messages and questions. The message is added to the chat_dialogue in the session state with the user role once the user submits the message.

Write a function that generates responses from the Llama 2 model and displays them in the chat area.

 def generate_assistant_response():
   message_placeholder = st.empty()
   full_response = ""
   string_dialogue = st.session_state['pre_prompt']
  
   for dict_message in st.session_state.chat_dialogue:
       speaker = "User" if dict_message["role"] == "user" else "Assistant"
       string_dialogue += f"{speaker}: {dict_message['content']}\n"
  
   output = debounce_replicate_run(
       st.session_state['llm'],
       string_dialogue + "Assistant: ",
       st.session_state['max_seq_len'],
       st.session_state['temperature'],
       st.session_state['top_p'],
       REPLICATE_API_TOKEN
   )
  
   for item in output:
       full_response += item
       message_placeholder.markdown(full_response + "▌")
  
   message_placeholder.markdown(full_response)
   st.session_state.chat_dialogue.append({"role": "assistant",
"content": full_response})

The function creates a conversation history string that includes both user and assistant messages before calling the debounce_replicate_run function to obtain the assistant's response. It continually modifies the response in the UI to give a real-time chat experience.

Write the main function responsible for rendering the entire Streamlit app.

 def render_app():
   setup_session_state()
   render_sidebar()
   render_chat_history()
   handle_user_input()
   generate_assistant_response()

It calls all the defined functions to set up the session state, render the sidebar, chat history, handle user input, and generate assistant responses in a logical order.

Write a function to invoke the render_app function and start the application when the script is executed.

 def main():
   render_app()

if __name__ == "__main__":
   main()

Now your application should be ready for execution.

Handling API Requests

Create a utils.py file in your project directory and add the function below:

 import replicate
import time

# Initialize debounce variables
last_call_time = 0
debounce_interval = 2 # Set the debounce interval (in seconds)


def debounce_replicate_run(llm, prompt, max_len, temperature, top_p,
                          API_TOKEN):
   global last_call_time
   print("last call time: ", last_call_time)

   current_time = time.time()
   elapsed_time = current_time - last_call_time

   if elapsed_time < debounce_interval:
       print("Debouncing")
       return "Hello! Your requests are too fast. Please wait a few" \
              " seconds before sending another request."

   last_call_time = time.time()

   output = replicate.run(llm, input={"prompt": prompt + "Assistant: ",
                                      "max_length": max_len, "temperature":
                                          temperature, "top_p": top_p,
                                      "repetition_penalty": 1}, api_token=API_TOKEN)
   return output

The function performs a debounce mechanism to prevent frequent and excessive API queries from a user’s input.

Next, import the debounce response function into your llama_chatbot.py file as follows:

 from utils import debounce_replicate_run 

Now run the application:

 streamlit run llama_chatbot.py 

Expected output:

The output shows a conversation between the model and a human.

Real-world Applications of Streamlit and Llama 2 Chatbots

Some real-world examples of Llama 2 applications include:

  • Chatbots: Its use applies to creating human response chatbots that can hold real-time conversations on several topics.
  • Virtual assistants: Its use applies to creating virtual assistants that understand and respond to human language queries.
  • Language translation: Its use applies to language translation tasks.
  • Text summarization: Its use is applicable in summarizing large texts into short texts for easy understanding.
  • Research: You can apply Llama 2 for research purposes by answering questions across a range of topics.

The Future of AI

With closed models like GPT-3.5 and GPT-4, it is pretty difficult for small players to build anything of substance using LLMs since accessing the GPT model API can be quite expensive.

Opening up advanced large language models like Llama 2 to the developer community is just the beginning of a new era of AI. It will lead to more creative and innovative implementation of the models in real-world applications, leading to an accelerated race toward achieving Artificial Super Intelligence (ASI).