Notes

Notes - notes.io

Search Medium
Write

Santanu Mahapatra

Member-only story

Deploying Open-Source LLMs As APIs
Open-source LLMs are all the rage, along with concerns about data privacy with closed-source LLM APIs. This tutorial goes through how to deploy your own open-source LLM API Using Hugging Face + AWS
Skanda Vivek
Skanda Vivek

·
Follow

7 min read
·
Jul 9
292

2

Super intelligent AI llama prompted by author, generated using Leonardo.AI
While ChatGPT and GPT-4 have taken the world of AI by storm in the last half year, open-source models are catching up — slowly but surely. And there has been a lot of ground to cover, to reach OpenAI model performance. In many cases, ChatGPT and GPT-4 are clear winners due to their quality and competitive pricing.

But, open-source models will always have value over closed APIs like ChatGPT/GPT-4 for certain business cases. I have spoken with folks in industries like legal, healthcare, and finance — who have concerns over data and customer privacy. These companies would rather spend thousands of dollars a month (or more) to run open-source models on their own cloud instances (think AWS, Google Cloud, Azure) rather than send data through OpenAI APIs that are used by everyone. These folks understand that right now, open-source LLMs might not perform as well as ChatGPT/GPT-4, and may end up being 10X more expensive due to the costs involved in training, deploying, and hosting models with tens or hundreds of billions of parameters. But they are either willing to test open-source models out for their use cases or wait a few more months till open-source models catch up to closed-source models, and don’t mind being set back a few months.

If you are concerned by potential data sharing issues with closed-source APIs, or just want to understand how to use/make open-source LLMs available to users, this article is for you.

Let’s dive in!

Deploying Hugging Face LLMs On AWS Sagemaker
Over the past few months, there has been a boom in open-source LLMs — a number of them achieving near-ChatGPT quality. For this example, I’m going to walk through deploying a relatively small LLM model, at 7 Billion parameters — Dolly, released by Databricks. According to Databricks, Dolly was the world’s first Open Instruction-Tuned LLM.

The training data consisted of 13k demonstrations of instruction-following behavior — basically 13k question-and-answer task pairs (although the tasks could be as different as summarization, classification, etc.)

Many open-source LLMs are hosted on Hugging Face — a universal platform for open-source ML models. Hugging Face provides many ways to interface with these models.

Deploying Dolly 7B LLM From Hugging Face On AWS Sagemaker | Hugging Face
AWS Sagemaker is a cloud machine learning platform for creating, training and deploying ML models in the cloud. Hugging Face makes it really easy to deploy open-source models that are created by users and uploaded to their platform to AWS Sagemaker as an endpoint as below:

Here is the full code taken from Hugging Face and modified slightly:

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

# Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'databricks/dolly-v2-7b',
'SM_NUM_GPUS': json.dumps(1)
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="0.8.2"),
env=hub,
role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g4dn.2xlarge",
container_startup_health_check_timeout=3600,
)

# send request
predictor.predict({
"inputs": "My name is Julien and I like to",
})
Create a notebook in AWS Sagemaker (I created an ml.m5.xlarge notebook) and run the above code. The important parameters here are the instance_type and container_startup_health_check_timeout. The instance_type needs to be a GPU one with memory at least able to host the underlying model + extra for inference. The dolly 7B model is ~14GB, so I chose the ml.g4dn.2xlarge instance (32 GB memory and 8 vCPUs). Once you run this code, the model is automatically deployed as an endpoint.

Creating A Serverless Function Using AWS Lambda For Accepting Requests
Once the endpoint is created, AWS Lambda can be used to call this endpoint in a serverless manner as below:

Creating an AWS Lambda Function | Skanda Vivek

AWS Lambda Function | Skanda Vivek
Adding An API Gateway For Posting Requests
Finally, adding an API gateway ensures you properly pass client requests to the lambda function and receive the values.

AWS Lambda + API Gateway Integration | Skanda Vivek
Adding in an API gateway means you can make requests as below using the standard Python requests package or cURL. Below is the response to asking the model, “What is the time?”

Dolly LLM API Example | Skanda Vivek
As you can see, it gives a cute (even philosophical) answer above, but might not be the best response. ChatGPT might have said something like “I’m sorry, I don’t have access to real-time information.”

Below, I check out the performance of Dolly-7B in answering questions based on the context below:

context = """Transformers is a media franchise produced by
American toy company Hasbro and Japanese toy company Takara Tomy.
It primarily follows the heroic Autobots and the villainous Decepticons,
two alien robot factions at war that can transform into other forms,
such as vehicles and animals. The franchise encompasses toys, animation,
comic books, video games and films. As of 2011, it generated more than ¥2
trillion ($25 billion) in revenue,[1] making it one of the highest-grossing
media franchises of all time."""

Dolly LLM API Example | Skanda Vivek
As you can see, the model basically just adds to the question and doesn’t really answer it. You can see the same issue when I ask a different question. The part that is appended is highlighted in blue. Notice that the LLM is acting basically as an “autoregression” model and is confused about the task — it thinks it is doing a completion task, and not a question-answering task.

Dolly LLM API Example | Skanda Vivek
Open-Source LLM Economics
During my experiments, I encountered some significant costs — stemming basically from Sagemaker, and not Lambda or the API Gateway as below:

AWS Costs | Skanda Vivek
This is consistent with a previous article I wrote, where for smaller scale requests, the AWS costs associated with deploying and making inference on LLMs are basically due to Sagemaker costs:

LLM Economics: ChatGPT vs Open-Source
How much does it cost to deploy LLMs like ChatGPT? Are open-source LLMs cheaper to deploy? What are the tradeoffs?
towardsdatascience.com

Below, you can see the detailed costs. The main cost is the ~1$/hr associated with the ml.g4dn.2xlarge instance type. The other ml.m5.2xlarge cost stemmed from the notebook used to create the ml.g4dn.2xlarge instance. The good news is that this cost is associated just with open notebooks. So make sure to turn off any open notebooks after deploying your Sagemaker endpoints!

AWS Sagemaker Costs | Skanda Vivek
Takeaways
While this tutorial showed the relative ease with which you can deploy open-source LLMs on cloud architectures like AWS, it also highlighted the considerable cloud costs as well as performance issues with one such model, Dolly-7B. Keep in mind that Dolly-7B is by no means the state-of-the-art in open-source LLMs — although it was easier to deploy on AWS as compared to larger LLMs. AWS sets restrictions on models that are available, and you need to request them to increase this limit, as well as provide a valid reason (I guess there were just too many folks that ran expensive models and racked up significant bills without knowing, and refused to pay). I tried deploying other models like the Salesforce/xgen-7b-8k-base model that has one of the best performances as compared to other open-source models as of now. However, this deployment failed due to container startup timeout (which took longer than 1 hour).

Overall, I wouldn’t be discouraged by these findings. I fully expect in a few months, cloud support for hosting LLMs to significantly improve. Hosting and fine-tuning smaller language models are quite cheap and easy as I describe in this tutorial. I also expect significant performance gains in open-source LLMs. Recently, open-source LLMs have achieved performance on par and even better than ChatGPT.

Now that you know how to deploy LLMs as an API using AWS, go forth and experiment with open-source LLM APIs and see if they solve your needs!

If you like this post, follow me — I write on topics related to applying state-of-the-art NLP in real-world applications and, more generally, on the intersections between AI and society.

Feel free to connect with me on LinkedIn!

If you are not yet a Medium member and want to support writers like me, feel free to sign-up through my referral link: https://skanda-vivek.medium.com/membership

AI
AWS
Data Science
Technology
Tutorial
292

2

Skanda Vivek
Written by Skanda Vivek
2.5K Followers
Senior Data Scientist in NLP and advisor

Follow

More from Skanda Vivek
Build Industry-Specific LLMs Using Retrieval Augmented Generation
Skanda Vivek
Skanda Vivek

in

Towards Data Science

Build Industry-Specific LLMs Using Retrieval Augmented Generation
Organizations are in a race to adopt Large Language Models. Let’s dive into how you can build industry-specific LLMs Through RAG

·
10 min read
·
May 31
455

9

LLM Economics: ChatGPT vs Open-Source
Skanda Vivek
Skanda Vivek

in

Towards Data Science

LLM Economics: ChatGPT vs Open-Source
How much does it cost to deploy LLMs like ChatGPT? Are open-source LLMs cheaper to deploy? What are the tradeoffs?

·
6 min read
·
Apr 26
513

12

Build A Custom AI Based ChatBot Using Langchain, Weviate, and Streamlit
Skanda Vivek
Skanda Vivek

in

Towards AI

Build A Custom AI Based ChatBot Using Langchain, Weviate, and Streamlit
A comprehensive guide to building a customized chatbot using Generative AI, a popular vector database, prompt chaining, and UI tools

·
9 min read
·
Aug 10
81

1

When Should You Fine-Tune LLMs?
Skanda Vivek
Skanda Vivek

in

Towards Data Science

When Should You Fine-Tune LLMs?
There has been a flurry of exciting open-source LLMs which can be fine-tuned. But how does that compare to just using a closed API?

·
7 min read
·
May 15
420

8

See all from Skanda Vivek
Recommended from Medium
Best Practices for Deploying Large Language Models (LLMs) in Production
ai geek (wishesh)
ai geek (wishesh)

Best Practices for Deploying Large Language Models (LLMs) in Production
Large Language Models (LLMs) have revolutionized the field of natural language processing and understanding, enabling a wide range of AI…
10 min read
·
Jun 26
95

1

RAG vs Finetuning — Which Is the Best Tool to Boost Your LLM Application?
Heiko Hotz
Heiko Hotz

in

Towards Data Science

RAG vs Finetuning — Which Is the Best Tool to Boost Your LLM Application?
The definitive guide for choosing the right method for your use case

·
19 min read
·
Aug 25
2.2K

17

Lists

ChatGPT prompts
24 stories
·
454 saves
Image by vectorjuice on FreePik

The New Chatbots: ChatGPT, Bard, and Beyond
13 stories
·
133 saves

Predictive Modeling w/ Python
20 stories
·
445 saves

Generative AI Recommended Reading
52 stories
·
272 saves
The ChatGPT Hype Is Over — Now Watch How Google Will Kill ChatGPT.
AL Anany
AL Anany

The ChatGPT Hype Is Over — Now Watch How Google Will Kill ChatGPT.
It never happens instantly. The business game is longer than you know.

·
6 min read
·
Sep 1
10.9K

345

Costs and benefits of your own LLM
Maciej Tatarek
Maciej Tatarek

Costs and benefits of your own LLM
TL;DR
10 min read
·
Aug 2
43

1

MongoDB and Langchain Magic: Your Beginner’s Guide to Setting Up a Generative AI app with Your Own…
Han HELOIR, Ph.D.
Han HELOIR, Ph.D.

in

Artificial Corner

MongoDB and Langchain Magic: Your Beginner’s Guide to Setting Up a Generative AI app with Your Own…
Introduction:

·
7 min read
·
Sep 12
1.2K

12

Create a Clone of Yourself With a Fine-tuned LLM
Sergei Savvov
Sergei Savvov

in

Better Programming

Create a Clone of Yourself With a Fine-tuned LLM
Unleash your digital twin
11 min read
·
Jul 28
2.3K

19

See more recommendations
Help

Status

Writers

Blog

Careers

Privacy

Terms

About

Text to speech

T

Notes is a web-based application for online taking notes. You can take your notes and share with others people. If you like taking long notes, notes.io is designed for you. To date, over 8,000,000,000+ notes created and continuing...

With notes.io;

* You can take a note from anywhere and any device with internet connection.
* You can share the notes in social platforms (YouTube, Facebook, Twitter, instagram etc.).
* You can quickly share your contents without website, blog and e-mail.
* You don't need to create any Account to share a note. As you wish you can use quick, easy and best shortened notes with sms, websites, e-mail, or messaging services (WhatsApp, iMessage, Telegram, Signal).
* Notes.io has fabulous infrastructure design for a short link and allows you to share the note as an easy and understandable link.

Fast: Notes.io is built for speed and performance. You can take a notes quickly and browse your archive.

Easy: Notes.io doesn’t require installation. Just write and share note!

Short: Notes.io’s url just 8 character. You’ll get shorten link of your note when you want to share. (Ex: notes.io/q )

Free: Notes.io works for 14 years and has been free since the day it was started.

You immediately create your first note and start sharing with the ones you wish. If you want to contact us, you can use the following communication channels;

Email: [email protected]

Twitter: http://twitter.com/notesio

Instagram: http://instagram.com/notes.io

Facebook: http://facebook.com/notesio

Regards;
Notes.io Team

Notes

Notes - notes.io

Shortened Note Link

Long File

Notes