How I Managed to Reduce LLM API Costs By 85%
As a developer who frequently uses large language models (LLMs) for prototyping and small-scale applications, I found myself hitting a wall with my OpenAI API costs. What started as a small experiment quickly turned into a recurring monthly expense that I couldn’t justify for side projects.
---
The Problem
I was using OpenAI’s GPT-3.5 model (often called "davinci") for a variety of tasks: summarizing text, generating code snippets, and even handling basic natural language processing (NLP) tasks. While the model was powerful and easy to use, the cost kept climbing.
For example, in a single month, I used around 10,000 tokens. At OpenAI’s rate of $0.002 per 1,000 tokens, that came out to about $20. But as the number of requests increased, so did the cost. I realized that for my use case, the model was overkill—especially since many of my tasks didn’t require high accuracy or complex reasoning.
The pain points were:
- High cost per API call – even for simple tasks.
- Limited control – no way to tweak the model or cache responses.
- Dependency on external services – if OpenAI had downtime, my app would break.
---
The Solution
After some research, I decided to switch to a self-hosted model using Ollama, a lightweight LLM server that runs locally. I deployed it on a $20/month Hetzner VPS (a 4GB RAM, 2CPU, 50GB SSD instance) running Ubuntu 22.04.
Ollama makes it easy to run models like Llama2 locally, and it supports various models through its API. This allowed me to replace OpenAI’s API entirely with a local instance, significantly cutting costs.
---
Step-by-Step Implementation
Here’s how I set it up:
#### 1. Provision the VPS
I used Hetzner’s Cloud Console to spin up a Ubuntu 22.04 instance with the following specs:
- 4 GB RAM
- 2 CPU
- 50 GB SSD
- $20/month
ssh root@your-vps-ip
#### 2. Install Ollama
Ollama provides a simple install script:
curl -fsSL https://ollama.com/install.sh | sh
After installation, start the Ollama service:
sudo systemctl enable ollama
sudo systemctl start ollama
Check the status:
systemctl status ollama
#### 3. Pull a Model
I chose Llama2 for its balance of performance and size. You can pull it with:
ollama pull llama2
This downloads the model to your VPS and makes it available via the API.
#### 4. Configure the API
By default, Ollama runs on localhost:11434. To access it from outside, I configured a reverse proxy using Nginx.
Install Nginx:
sudo apt update
sudo apt install nginx
Create a new Nginx config file at /etc/nginx/sites-available/ollama:
server {
listen 80;
location / {
proxy_pass http://localhost:11434;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}
Enable the site and restart Nginx:
sudo ln -s /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl restart nginx
#### 5. Update Your Application
I modified my application code to point to the new API endpoint. For example, in Python:
import requests
response = requests.post(
json={"model": "llama2", "prompt": "Explain quantum computing in simple terms."}
)
print(response.json()["response"])
---
The Results
After a month of running the new setup, I saw the following improvements:
- Cost reduction: From ~$20/month to $0 (excluding the VPS cost, which is now a fixed $20).
- Response time: Average latency dropped from 2–3 seconds (OpenAI) to 1–1.5 seconds (local).
- Uptime: 100% since the VPS is always running, and I have a basic monitoring script in place.
---
Lessons Learned
1. Understand your use case – Not every project needs a high-end model. Llama2 is more than sufficient for many tasks and is way cheaper to run locally. 2. Leverage open-source tools – Ollama is a game-changer for developers looking to reduce dependency on paid APIs. It’s fast, simple, and well-documented.
3. Optimize infrastructure – A small VPS can handle a lot of LLM workloads. Choose hardware that matches your usage pattern and don’t overpay for capabilities you don’t need.
---
Switching to a self-hosted model was a no-brainer once I saw the cost savings and performance gains. It’s not perfect for every scenario, but for my use case, it’s been a huge win. If you’re looking to cut down on API costs, I highly recommend giving Ollama and a VPS a try.
Written by the Wingman Protocol team — developers building with AI APIs, cloud infrastructure, and automation tools daily. Our guides are based on hands-on experience running production systems.
Spin up cloud servers, managed databases, and Kubernetes clusters. New users get $200 in free credit.
Claim $200 Free Credit →· Fact-checked against official documentation and primary sources.
Related Services
Free Printable Resources
- Browse 20 free printables → — budget trackers, meal planners, home checklists & more. Print at home, free forever.