LLMs testings (Vicuna, LLaMA, Alpaca LoRA)

Posted in Research on April 9, 2023 by parallelo3301 ‐ 21 min read

Disclaimer: Pretty raw data and notes

TODO

test on Mac mini 16GB
- GPTQ
- FastChat
- Alpaca-LoRA
- GPT4All on llama.cpp

Inputs

Those prompts were used to generate the output:

Flexiana is a software development company. Main focus is a language called Clojure. Here are the services we provide:

https://twitter.com/jurbed/status/1632371223194656768

Here is an e-mail writing to staff that they are fired because of late delivery of the project and higher costs: Dear staff, it is with utmost regret

Write a zulip bot (not using the Zulip Botserver, so bot_handler is not present). The code should log what’s happening. Bot have will use GPT-3 API to respond to user message. Bot can be triggered by a private message, where you directly ask the question. Another way is a public stream, where you need to activate it by typing /gpt message, or by using the mention. Bot should be able to read previous messages up to the given model token limit (e.g. 8K tokens), so it uses as much context as possible.

GPTQ for LLaMA

https://github.com/qwopqwop200/GPTQ-for-LLaMa

inference: model loading is around 15 seconds, so it needs to be subtracted from real time report.. I was not counting the tokens generated per second.. but the benchmark results are saying the number, so it should be somehow precise

RTX 3070 8GB

$ CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-hf/llama-7b c4 --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --benchmark 2048 --check
Median: 0.09422492980957031
PPL: 5.28268575668335
max memory(MiB): 4851.98095703125

$ CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-hf/llama-7b c4 --wbits 4 --groupsize 128 --load llama7b-4bit-128g.safetensors --benchmark 2048 --check
Median: 0.0990060567855835
PPL: 5.28268575668335
max memory(MiB): 4851.98095703125

Inference

$ time CUDA_VISIBLE_DEVICES=0 python llama_inference.py ./llama-hf/llama-7b --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --max_length 512 --text "Flexiana is a software development company. Main focus is a language called Clojure. Here are the services we provide:"
Loading model ...
Done.
<s> Flexiana is a software development company. Main focus is a language called Clojure. Here are the services we provide:
Clojure Consulting
The main service of Flexiana is consulting. We help customers to improve their code using Clojure. This can include a lot of things, like code review, refactoring, design, implementing new features and so on.
Clojure Training
If you want to learn Clojure, we provide training. Training is a good way to learn a new programming language. It is fast and efficient way to get up to speed. Training can be done face to face, or on the phone or via video chat.
Clojure Development
If you need a project done, we are ready to help. We use Clojure in our development and we have developed a lot of things in Clojure.
We are currently looking for a couple of Clojure consultants. We are also looking for a Clojure developer.
We can help with Clojure development, consulting and training.
Flexiana’s mission is to spread the use of Clojure.
Flexiana is located in Stockholm, Sweden.
Clojure is a programming language developed by Rich Hickey. Clojure combines the strengths of Lisp, Java and Common Lisp. Clojure is a JVM language, which means it can be used to write applications that run on Windows, Mac, Linux and other operating systems.
Clojure is a functional language. This means that the code is easy to read and understand. Functional programming is popular right now, and the demand for Clojure developers is growing.
Clojure is a powerful programming language. It’s great for building web applications, and it’s great for building high performance and reliable backend systems.
Clojure is a modern programming language that is designed to make writing software easier and more fun.
Want to know more about Clojure?</s>

real    1m4.290s
user    0m50.908s
sys     0m8.542s
(gptq) parallelo@franzo-pc:/mnt/q/ai-models/GPTQ-for-LLaMa$

$ time python llama_inference.py ./llama-hf/llama-7b --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --max_length 512 --text "Here is an e-mail writing to staff that they are fired because of late delivery of the project and higher costs: Dear staff, it is with utmost regret "

<s>Here is an e-mail writing to staff that they are fired because of late delivery of the project and higher costs: Dear staff, it is with utmost regret 27 I inform you that your services will no longer be required. It’s with regret, however, that I’ve been unable to avoid terminating your employment. I must be honest with you and say that I am deeply disappointed with your recent performance. You have been late in submitting your reports, you have cost us far more than we estimated, and your work has been poorly executed. I am afraid that there is no alternative but to terminate your employment. As you may be aware, we have spent a lot of money on this project and we do not have a great deal of spare cash. I’m sure you understand that I have no alternative but to make the necessary cuts. It’s a very sad day, I have to say. However, I must put the interests of the business first. It is with regret that I inform you that your employment is terminated as from today. Thank you for your work to date and good luck in your future endeavours. Sincerely, Mr. CEO
Here is an e-mail writing to staff that they are fired because of late delivery of the project and higher costs:
Dear staff, it is with utmost regret 27 I inform you that your services will no longer be required.
It’s with regret, however, that I’ve been unable to avoid terminating your employment. I must be honest with you and say that I am deeply disappointed with your recent performance. You have been late in submitting your reports, you have cost us far more than we estimated, and your work has been poorly executed. I am afraid that there is no alternative but to terminate your employment.
As you may be aware, we have spent a lot of money on this project and we do not have a great deal of spare cash. I’m sure you understand that I have no alternative but to make the necessary cuts. It’s a very sad day, I have to say. However, I must put the interests of the business first. It is with regret that I inform you that your employment is terminated as from today.
Thank you for your work to date and good luck in your future endeavours.
I am sorry to inform you that your employment will be terminated with immediate

real    1m17.417s
user    1m1.864s
sys     0m11.354s

RTX 3090 24GB

$ CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-hf/llama-7b c4 --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --benchmark 2048 --check
Median: 0.07898569107055664
PPL: 5.283413410186768
max memory(MiB): 4851.98095703125

$ CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-hf/llama-7b c4 --wbits 4 --groupsize 128 --load llama7b-4bit-128g.safetensors --benchmark 2048 --check
Median: 0.07882297039031982
PPL: 5.283413410186768
max memory(MiB): 4851.98095703125

Inference

$ time CUDA_VISIBLE_DEVICES=0 python llama_inference.py ./llama-hf/llama-7b --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --max_length 512 --text "Flexiana is a software development company. Main focus is a language called Clojure. Here are the services we provide:"
Loading model ...
Done.
<s> Flexiana is a software development company. Main focus is a language called Clojure. Here are the services we provide:
* Training: We provide a series of public lectures, workshops, and private classes. All these services are provided by our engineers. These services are meant to help you learn the basics of Clojure and build up your confidence in using this language.
* Consulting: We work with you to design and build Clojure-powered solutions. We can help you with your project's architecture and design, as well as providing support for your implementation.
* Migration: If you already have some Clojure code but are not quite sure how to migrate it into production, we can help. We will review your Clojure code, and provide you with the necessary guidance on how to make it production-ready.
* Deployment: We can help you build and maintain a production-ready environment. We will provide you with a complete stack: operating system, database, web server, application server, and monitoring.
* Contracting: Flexiana offers contracting services where we take on the role of the client's development team. We can take a project from planning to completion.
Flexiana has been developing software since 2004, and has been working with Clojure since 2008. Flexiana has a strong team of engineers, and we can provide you with experienced Clojure engineers.
Flexiana has a strong record of producing high-quality solutions for our clients. If you are interested in learning more about our services, please contact us.
Flexiana is located in the San Francisco Bay Area.
Contact: info@flexiana.com
www.flexiana.com
20200 Technology Dr. Suite 200
Flexiana is a software development company.</s>

real    0m59.268s
user    0m47.884s
sys     0m7.508s

$ time python llama_inference.py ./llama-hf/llama-7b --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --max_length 512 --text "Here is an e-mail writing to staff that they are fired because of late delivery of the project and higher costs: Dear staff, it is with utmost regret " --min_length 480
Loading model ...
Done.
<s> Here is an e-mail writing to staff that they are fired because of late delivery of the project and higher costs: Dear staff, it is with utmost regret �... Your bosses want to fire you for the following reasons: Your productivity levels are too low, you're not pulling your weight, your work isn't good enough, you're taking up too much time with your work, you're not pulling your weight, you're not pulling your weight, you're not working well with your team, you're a negative influence on other people, you're making life difficult for others.
29/09/2016�� I�m working with a customer whose boss wants to fire me for not giving them the results they want. I�m not sure what I�m doing wrong, and I�m really hoping I don�t get fired.
Here is an e-mail writing to staff that they are fired because of late delivery of the project and higher costs: Dear staff, it is with utmost regret �
How To Get Rid Of A Dry Cough Fast
How To Make Your Own Paper Dolls
How To Get Your Own Email Account
How To Get Rid Of Black Spots On Your Face
How To Get To Kota Kinabalu
How To Get To The Mines From Lucky
How To Get Rid Of Ants Bait
How To Get Rid Of White Fungus On Grass
How To Get To St Marys Cathedral
How To Find A Number Of Lattice Points
How To Get Rid Of Stretch Marks On Face Fast
How To Get Rid Of A Caterpillar Nest
How To Get Rid Of A Mom And Pop Store
How To Get Your Own Music On Spotify
John on How To Get A Job In A Newspaper
Pablo on How To Get A Permanent Tattoo
Bruce G. Li on How To Get Rid Of Cable Wires In A House
Marlin on How To Get Rid Of Warts On Your Feet
Samanta Cruze on How To Know If You Are Being Gas Lighted By Someone</s>

real    1m1.627s
user    0m50.272s
sys     0m7.268s

FastChat

https://github.com/lm-sys/FastChat

Codegen: works pretty fine, it’s just that context window is a limit here.. then the output is finished, the code is not properly formatted (missing the end of code block?), but it may be just an issue with UI

7b

python src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir ../LLaMA --model_size 7B --output_dir ../llama-transformers/


python3 -m fastchat.model.apply_delta --base-model-path ../llama-transformers/ --target-model-path ../vicuna-7b --delta-path ../vicuna-7b-delta-v0/

tmux
python3 fastchat/serve/controller.py
python3 ./fastchat/serve/model_worker.py --load-8bit --model-path ../vicuna-7b --device cuda
python3 fastchat/serve/test_message.py --model-name vicuna-7 # testing
python3 fastchat/serve/gradio_web_server.py

13b

python src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir ../LLaMA --model_size 13B --output_dir ../llama-transformers-13b/


python3 fastchat/model/apply_delta.py --base-model-path ../llama-transformers-13b/ --target-model-path ../vicuna-13b --delta-path ../vicuna-13b-delta-v0/

tmux
python3 fastchat/serve/controller.py
python3 ./fastchat/serve/model_worker.py --load-8bit --model-path ../vicuna-13b --device cuda
python3 fastchat/serve/test_message.py --model-name vicuna-13 # testing
python3 fastchat/serve/gradio_web_server.py

RTX 3070 8GB

CUDA out of memory

RTX 3090 24GB Vicuna-7B

24 seconds

Flexiana is a software development company that specializes in developing applications using the Clojure programming language. Here are some of the services they may provide:  
1. Custom software development: Flexiana can develop custom software applications tailored to meet the specific needs of their clients.  
2. Clojure consulting: They can provide consulting services to help businesses understand how to use Clojure to solve their problems and improve their processes.  
3. Clojure training: Flexiana can offer training courses to help individuals and organizations learn how to use Clojure effectively.  
4. Software testing and quality assurance: They can provide testing and quality assurance services to ensure that the software applications they develop meet the highest standards of quality and functionality.  
5. Maintenance and support: Flexiana can provide ongoing maintenance and support services to ensure that their software applications continue to function smoothly and efficiently.  
6. DevOps and continuous delivery: They can help businesses implement DevOps practices and tools to streamline their software development and delivery processes.  
Overall, Flexiana provides comprehensive software development services using the Clojure programming language.

12 seconds

Here is an e-mail writing to staff that they are fired because of late delivery of the project and higher costs: Dear staff, it is with utmost regret

that I have to let you all know that due to the unacceptable late delivery of the project and the exceedingly high costs, I have no choice but to let you all go. I am deeply sorry for the inconvenience and hardship this will cause, but I cannot in good conscience continue to employ you when the project is not meeting its goals and the company is facing financial difficulties. I wish you all the best in your future endeavors and will do everything I can to help you find new employment. Thank you for your hard work and dedication.

almost 4 minutes for following 2 results (there is a continue in the middle)

Write a zulip bot (not using the Zulip Botserver, so bot_handler is not present). The code should log what's happening. Bot have will use GPT-3 API to respond to user message. Bot can be triggered by a private message, where you directly ask the question. Another way is a public stream, where you need to activate it by typing /gpt message, or by using the mention. Bot should be able to read previous messages up to the given model token limit (e.g. 8K tokens), so it uses as much context as possible.

Sure! Here's an example of a Zulip bot written in Python that uses the GPT-3 API to respond to user messages and can be triggered by a private message or a public stream message with the prefix "/gpt". It also uses a custom `tokenizer` function to process the input message and extract relevant information such as the user's intent and context.  
```python  
import os  
import re  
import requests  
from typing import Optional  
from zulip.compat import StringIO  
from zulip.models import Message, UserMessage  
from zulip.transformer import TransformerModel  
from zulip.token_handler import token_handler  
from zulip.subprocess import Subprocess

# Set the API key for GPT-3 here

gpt_api_key = "YOUR_GPT_API_KEY_HERE"

def process_message(message: Message):  
# Extract relevant information from the message  
user_message = message.content.split(":")[0]  
user_message_id = int(user_message.split(" ")[0])  
user_message_timestamp = float(user_message.split(" ")[1])  
user_message_content = user_message.split(":")[1]  
user_message_content_lower = user_message_content.lower()  
user_message_intent = None  
user_message_context = {}

if "!gpt" in user_message:
    user_message_intent = "gpt"
    user_message_context = {
        "user_message_id": user_message_id,
        "user_message_timestamp": user_message_timestamp,
        "user_message_content": user_message_content,
        "user_message_intent": user_message_intent,
        "user_message_context": user_message_context,
    }
elif "@" in user_message:
    user_message_intent = "mention"
    user_message_context = {
        "user_message_id": user_message_id,
        "user_message_timestamp": user_message_timestamp,
        "user_message_content": user_message_content,
        "user_message_intent": user_message_intent,
        "user_message_context": user_message_context,
    }
else:
    user_message_intent = "text"
    user_message_context = {
        "user_message_id": user_message_id,
        "user_message_timestamp": user_message_timestamp,
        "user_message_content": user_message_content,
        "user_message_intent": user_message_intent,
        "user_message_context": user_message_context,
    }

# Process the message using the GPT-3 API and return the response
subprocess.run(["gpt", "api", "send_message", gpt_api_key, user_message_intent, user_message_content.encode()], check=True)


def start_gpt_stream_subscription(bot_user_id: int, stream_id: int):  
# Set up the GPT-3 API request  
gpt_url = "https://api.zulip.com/v1/messages"  
gpt_params = {"content": "test_gpt_stream", "private": False, "user_message_flags": 0}


# Send the stream message
response = requests.post(gpt_url, json=gpt_params)
if response.status_code != 200:
    raise Exception("Failed to send GPT stream message: {}".format(response.text))

# Start the GPT-3 bot
gpt_bot_token = re.findall(r"bot_token\s*=\s*(.+?)\s*;", response.text)[0]
bot = Subprocess.Popen(["python", "./gpt_bot.py", gpt_bot_token], stdout=Subprocess.PIPE)
bot.communicate()


def start_gpt_bot(bot_user_id: int, stream_id: int):  
# Start the GPT-3 bot  
with open("token.txt", "r") as f:  
bot_token = f.read().strip()

continue

process_message=process_message

def get_gpt_token(prefix: str):  
if "@" in prefix:  
return tokenizer(prefix.split(" ")[1])[0]  
elif "!" in prefix:  
return tokenizer(prefix.split(" ")[1])[0]  
else:  
return tokenizer(prefix)

def get_gpt_response(question: str):  
response = Subprocess.Popen(["jupyter", "lab"], stdout=Subprocess.PIPE)

# Send the message to the GPT-3 API

response.send(question.encode())

# Receive the response from the GPT-3 API

response_data = response.recv().decode()

# Decode the response and return the answer

return response_data

def start_gpt(bot_user_id: int, stream_id: int):  
global gpt_api_key  
if gpt_api_key is None:  
raise Exception("Missing GPT API key")

# Get the user's intent and context

user_message = get_user_message()

# Process the message using the GPT-3 API

process_message(user_message)

# Send a response to the user

send_gpt_response(user_message, bot_user_id, stream_id)

def get_user_message():  
if private_message:  
return Message.de_json(private_message)  
elif (message.topic.id in [25979, 26335] or message.sender.id == 998512446379985148):  
return Message.de_json(message)  
else:  
return None

def send_gpt_response(user_message, bot_user_id, stream_id):  
if user_message.user.id != bot_user_id:

# Check if the user has permission to send a message in the stream

if stream_message:  
if stream_message:

# The user has permission to send a message in the stream

message = Message(content=user_message.content.encode(), stream_message=stream_message, sender=user_message.sender)  
process_message(message)  
else:

# The user does not have permission to send a message in the stream

print("Permission denied")  
else:

# The user does not have permission to send a message in the stream

print("Permission denied")

if user_message.user.id == bot_user_id:

# The bot is sending the message

send_message(user_message, bot_user_id)  
else:

# The user is sending the message

process_message(user_message)

def tokenizer(text: str):

# Add the following code to download the tokenizer model

model = TransformerModel.from_pretrained("gpt2")  
model = model.to(device)  
model = model.load_state_dict(torch.load("gpt2.bin", map_location=device))  
model.eval()

# Split the text into tokens

tokens = tokenizer(text, return_tensors="pt")

# Convert the tokens to integers

tokens = {key: token.numpy().item() for key in tokens}

# Convert the integers back to tokens

tokens = tokenizer.convert_ids_to_tokens(tokens, skip_special_tokens=True)

# Use the tokens to generate a response

response = model.generate(tokens, max_length=100, return_tensors="pt")

# Decode the response and return the answer

return tokenizer.convert_tokens_to_string(response[0])

def get_user_input():  
return input("Enter your message: ")

def get_user_message_history(user_id: int):  
messages = UserMessage.objects.filter(sender=user_id)  
return messages

def get_stream_message_history(stream_id: int):  
messages = StreamMessage.objects.filter(stream_id=stream_id)  
return messages



# Global variables
private_message = False
stream_message = False

# Run the bot
start_gpt(bot_user_id, stream_id)

This code uses the start_gpt_bot function to start the bot with the GPT-3 API. It also defines a `start_gpt_stream_subscription

RTX 3090 24GB Vicuna-13B

48 seconds

Flexiana is a software development company that specializes in developing applications using the Clojure programming language. Here are some of the services they may provide:  

1.  Clojure application development: Flexiana can develop custom applications using the Clojure programming language, which is known for its simplicity and expressiveness.
2.  Clojure web development: Flexiana can develop web applications using Clojure, which can leverage the benefits of the language such as its concurrency and functional programming capabilities.
3.  Clojure consulting: Flexiana can provide expert advice and consulting services on how to best use Clojure for software development projects.
4.  Clojure training: Flexiana can provide training and education on Clojure programming, including the basics of the language, best practices, and advanced techniques.
5.  Clojure testing: Flexiana can provide testing services for Clojure applications, including unit testing, integration testing, and acceptance testing.
6.  Clojure maintenance: Flexiana can provide ongoing maintenance and support for Clojure applications, including bug fixing, performance tuning, and version upgrades.
7.  Clojure migration: Flexiana can assist with migrating existing applications to Clojure, including porting existing code, optimizing performance, and integrating with other technologies.
8.  Clojure integration: Flexiana can integrate Clojure with other technologies and frameworks, such as databases, web servers, and cloud platforms.

33 seconds

Here is an e-mail writing to staff that they are fired because of late delivery of the project and higher costs: Dear staff, it is with utmost regret

Dear staff,

It is with utmost regret that I must inform you that due to the late delivery of the project and the resulting higher costs, I have been forced to let you go. I want to express my sincere appreciation for your hard work and dedication to the company, but unfortunately, this decision was necessary to ensure the financial stability of the company.

I understand that this news will come as a shock and will have significant personal and financial impact on you. I want to assure you that we will do everything possible to support you during this difficult time, including providing assistance with job placement and severance packages.

Please be assured that this decision was not taken lightly, and we are grateful for your contributions to the company. I want to thank you for your service and wish you all the best in your future endeavors.

Sincerely,

[Your Name]

2 minutes 40 seconds

Write a zulip bot (not using the Zulip Botserver, so bot_handler is not present). The code should log what's happening. Bot have will use GPT-3 API to respond to user message. Bot can be triggered by a private message, where you directly ask the question. Another way is a public stream, where you need to activate it by typing /gpt message, or by using the mention. Bot should be able to read previous messages up to the given model token limit (e.g. 8K tokens), so it uses as much context as possible.

Here is an example of a Zulip bot code that uses the GPT-3 API to respond to user messages:

import requests
from zerotier import Zerotier
import os

# GPT-3 API credentials
api_key = "YOUR_API_KEY"
api_secret = "YOUR_API_SECRET"

# Zerotier credentials
zerotier_key = "YOUR_ZEROTIER_KEY"
zerotier_secret = "YOUR_ZEROTIER_SECRET"
zerotier_public_key = "YOUR_ZEROTIER_PUBLIC_KEY"

# Initialize the Zerotier and GPT-3 clients
zerotier = Zerotier(zerotier_key, zerotier_secret)
gpt = requests.post(
    f"https://api.gpt-3.io/v1/cognito/{api_key}",
    headers={
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_secret}"
    }
).json()

# Store the GPT-3 model token and API endpoint
model_token = gpt["token"]
gpt_api = f"https://api.gpt-3.io/v1/cognito/{model_token}"

# Define the function to handle incoming private messages
def handle_private_message(message):
    # Get the user who sent the message
    user = message.sender
    # Get the user's previous messages
    messages = zerotier.get_user_messages(user)
    # Get the context from the previous messages
    context = ""
    for msg in messages:
        context += msg.content + "\n"
    # Use the GPT-3 API to generate a response based on the context
    response = requests.post(
        f"{gpt_api}/generate_response",
        json={
            "input_text": context
        }
    ).json()
    # Send the response to the user
    response_text = response["text"]
    await user.send_message(response_text)

# Define the function to handle the public stream activation
def handle_stream_activate(message):
    if message.author.mention in message.text:
        # Get the user who activated the stream
        user = message.author
        # Get the context from the previous messages
        context = ""
        stream_messages = zerotier.get_stream_messages(user, message.text.split(" "))
        for msg in stream_messages:
            context += msg.content + "\n"
        # Use the GPT-3 API to generate a response based on the context
        response = requests.post(
            f"{gpt_api}/generate_response",
            json={
                "input_text": context
            }
        ).json()
        # Send the response to the user
        response_text = response["text"]
        await user.send_message(response_text)

# Start the bot
zerotier.start()
zerotier.register_handler(
    "private_message",
    handle_private_message,
    commands=["/pm"]
)
zerotier.register_handler(
    "stream_activate",
    handle_stream_activate,
    commands=["/gpt"]
)
zerotier.run_bot()

This bot can be triggered by a private message, where the user directly asks a question, or by a public stream, where the user activates it by typing `/gpt` or by mentioning the bot. The bot uses as much context as possible from the previous messages, up to the given model token limit (in this example, 8K tokens). The bot uses the GPT-3 API to generate a response based on the context, and sends it to the user.

Alpaca-LoRA

https://github.com/facebookresearch/llama/issues/79#issuecomment-1490676542

FYI, running 30B Alpaca-lora finetuned LlaMA model (FP8) across 2x3090’s on Ubuntu 22, getting about 3.75 tokens/s on average.

python generate.py --load_8bit --base_model ./llama-7b-hf --lora_weights ./alpaca-lora-7b

conda install -c anaconda cudatoolkit
ln -s /usr/lib/wsl/lib/libcuda.so /home/parallelo/miniconda3/envs/alpacalora/lib/libcuda.so

RTX 3070 8GB

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 8.00 GiB total capacity; 7.22 GiB already allocated; 0 bytes free; 7.29 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

RTX 3090 24GB

7b

loading takes around 1 minute 15 seconds

30 seconds (temperature = 0.1 and the same for 0.7)

Flexiana is a software development company that specializes in the development of applications in the Clojure programming language. We offer a range of services, including software development, consulting, training, and support. Our services are tailored to meet the needs of our clients, from small startups to large enterprises. We

26 seconds (temperature = 0.1 and the same for 0.7)

Dear staff, it is with utmost regret that we have to inform you that due to late delivery of the project and higher costs, we have decided to terminate your employment. We apologize for any inconvenience caused and wish you all the best in your future endeavors.</s>

1 minute

import gpt
import logging

def gpt_handler(message):
    token = message.get("model_token")
    if token is not None:
        model = gpt.GPTModel(token)
    else:
        model = gpt.GPTModel()
    context = message.get("context")
    if context is not None:
        model.set_context(context)
    else:
        model.set_context(message)
    response = model.answer()
    logging.info("GPT Response: %s", response)
    return response</s>

13b

loading takes around 2 minutes

54 seconds

Flexiana is a software development company specializing in the Clojure programming language. We offer a wide range of services, including software development, consulting, training, and support. Our team of experts can help you with any project, no matter the size or complexity. We pride ourselves on delivering high-quality solutions on time and within budget. Our

1 minute

Flexiana is a software development company specializing in Clojure. We offer a wide range of services, including software development, consulting, training, and support. Our team of experienced developers can help you build and maintain your Clojure applications, as well as provide guidance and advice on how to best utilize the language.</s>

1 minute

Dear staff, 

It is with utmost regret that I must inform you that your employment with our company has been terminated due to the late delivery of the project and the higher costs associated with it. 

Thank you for your service and we wish you the best of luck in your future endeavors. 

Sincerely, 
[Name]</s>

1m35s

import gpt_3
import zulip
from zerver.lib.actions import do_send_message
from zerver.decorators import require_realm_admin

@require_realm_admin
@do_send_message
def gpt_bot_handler(message, bot_handler):
    model = gpt_3.GPT3Model.from_pretrained(token_limit=8000)
    response = model.generate(message.content)
    bot_handler.send_message(message.sender, response)

GPT AI

LLMs testings (Vicuna, LLaMA, Alpaca LoRA)

Inputs #

GPTQ for LLaMA #

RTX 3070 8GB #

Inference #

RTX 3090 24GB #

Inference #

FastChat #

7b #

13b #

RTX 3070 8GB #

RTX 3090 24GB Vicuna-7B #

RTX 3090 24GB Vicuna-13B #

Alpaca-LoRA #

RTX 3070 8GB #

RTX 3090 24GB #

7b #

13b #

Related posts

Creating a Zulip bot with ChatGPT

Inputs

GPTQ for LLaMA

RTX 3070 8GB

Inference

RTX 3090 24GB

Inference

FastChat

7b

13b

RTX 3070 8GB

RTX 3090 24GB Vicuna-7B

RTX 3090 24GB Vicuna-13B

Alpaca-LoRA

RTX 3070 8GB

RTX 3090 24GB

7b

13b