Back to Blog
5 min read
February 23, 2025TECHNICAL

Making AI Faster: The Story Behind Bhumi

How I built Bhumi, a client that improves client-side AI inference performance.

Rach Pradhan

Rach Pradhan

Design Engineer

Hey everyone, Rach here. Welcome back to the blog! Today, I want to share the journey behind Bhumi—a tool I built to make inference clients faster and more efficient.

If you've ever waited for an app to load, a chatbot to respond, or an AI tool to generate something, you know the pain of slow inference. Bhumi fixes that by optimizing the client side of AI interactions. And in this post, I'll break down how it works— with the technical details.


Why I Built Bhumi

A while back, while working on finbro.ai, a project where we built AI-powered agents, I ran into a frustrating problem: latency. Every time we asked an AI to do something, it took forever to respond. And since we had multiple agents working together, the delays stacked up, making everything painfully slow.

I knew AI could be faster. But the existing solutions weren't cutting it. So, I built Bhumi—a fast AI inference client that optimizes speed and efficiency, letting AI models run as smoothly as possible.

Bhumi Diagram1

What Bhumi does is that it speeds up the client part of the LLM pipeline, nothing changes at your providers side, by simply optimizing buffers and making the client more efficient [^1]


Why Is AI Slow in the First Place?

Think of streaming a movie. You don't want to wait for the entire movie to download before you start watching, right? You just want it to start playing instantly while the rest loads in the background.

Most AI models don't work that way. Instead of "streaming" small chunks of information as they become available, they often wait to generate everything at once before showing you results. That's like downloading a whole movie before you can watch the first scene. Inefficient and frustrating.

Another problem? The tools that manage AI requests (like LiteLLM) weren't handling multiple requests well, leading to even more delays.


As such I had a few hypothesis on how to speed up the client part of the LLM pipeline

Hypothesis 1: Using a Rust based streaming solution would natively provide a performance boost with pyO3 as the interface

alt text

Well if you can see the diagram above, it all seems to work fine! Right? Not quite, everything broke loose when I introduced types.

alt text

Specifically Pydantic was the bottleneck, as it was using a lot of memory and was not able to handle the requests as fast as I would have liked.

Hypothesis 1.5 Implementing a similar streaming pattern on a validation library might make it faster

The question then was, was it worthwhile to implement a similar streaming pattern on a validation library? Maybe we could also use Rust and map it to Python with PyO3? And I opened an issue and started working on it.

alt text

It took about an hour, and a week's worth of refining the dev experience for myself, but I got it working! And boy was it fast!

alt text

And with that the 1.5 hypothesis was proven to be true!

alt text

Even though now we had a pretty performant library, it was still not fast enough for my liking.

The buffer sizes were now a bottlenbeck, and I had to find a way to make it faster. And then I crafted my second hypothesis.

Hypothesis 2: There could be some buffer sizes which are optimal for llm data outputs

alt text alt text alt text

I did run a few benchmarks, and found that there were some buffer sizes which were optimal for the data output of the llm.(or so I thought)

It turned out to be totally wrong, and I had to scrap the whole idea. And ergo the second hypothesis was proven to be false.


I had an idea

Is there a method to create optimal buffer conditions which might adapt to differnt providers over time? And there was one algorithm that I knew which prioritised both quality and diversity.

Hypothesis 3: Using Quality Diversity algorithms, could we map out the most optimal performing chunk sizes for the type of buffer that comes thru?

The quality-diversity algorithms are a class of algorithms that explore a search space but also seeks to discover a diverse set of high-performing solutions across multiple dimensions. And I thought, if we could map out the optimal chunk sizes for the type of buffer that comes thru, we could potentially speed up the client part of the pipeline.

The algorithm I chose was Map-Elites because it's a quality-diversity algorithm that explores a search space but also seeks to discover a diverse set of high-performing solutions across multiple dimensions. Furthermore, it's a very performant algorithm, and I could use it to map out the optimal chunk sizes for the type of buffer that comes thru.(and again it was just a hypothesis)

So I started working on it, and I got it working! After 15 iterations, the throughput had improved from 600 characters per second to over 1400 characters per second.

Here was the Grid at 15 iterations:

alt text

Furthermore, as more iterations are run and more of the grid is explored, we observed an interesting phenomenon in the MAP-Elites algorithm's performance. The rate of improvement begins to plateau, and in some cases, we see a decrease in overall performance. This behavior stems from several key factors:

  1. Search Space Saturation

    • Initially, the algorithm easily finds high-performing solutions because the behavior space is largely unexplored
    • As the grid fills up, finding better solutions becomes exponentially harder
    • The algorithm must work harder to discover solutions that outperform existing elites
  2. Local Optima Traps

    • The algorithm can get stuck in local optima within certain regions of the behavior space
    • Mutations and crossovers start producing increasingly similar solutions
    • Breaking out of these local optima requires larger, potentially disruptive variations
  3. Exploration-Exploitation Balance

    • Early iterations benefit from broad exploration of the behavior space
    • Later iterations tend to focus more on exploitation (refining existing solutions)
    • This shift can lead to decreased diversity in the candidate pool

In our specific case with buffer optimization, we observed:

  • Peak performance around iteration 15 (~1400 characters/second)
  • Gradual decline in improvement rate after iteration 20
  • Increased computational cost per improvement as the grid filled up

Performance Over Iterations

This pattern is actually expected in quality-diversity algorithms like MAP-Elites, and it helped us identify the optimal point to stop training and deploy the solution.


How Bhumi Makes AI Faster

Bhumi fixes these issues with three key optimizations:

1. Optimized Request Handling with MAP-Elites

Instead of traditional HTTP request handling, Bhumi uses an adaptive optimization approach. Using the MAP-Elites algorithm, it dynamically adjusts buffer sizes and processing patterns based on:

  • Provider-Specific Optimization: Different buffer sizes for different AI providers
  • Adaptive Processing: Buffer management that evolves based on performance data
  • Quality-Diversity Balance: Maintaining both speed and reliability
  • Continuous Improvement: Learning from each request to optimize future ones

Our testing showed throughput improvements from 600 to over 1400 characters per second after just 15 iterations.

2. Rust + Python Architecture

Bhumi's core is built in Rust for maximum performance, with a Python interface for ease of use. This hybrid approach delivers:

  • Native-speed processing with PyO3
  • Developer-friendly API
  • Minimal overhead

3. Optimized Validation with Satya

We replaced the standard Pydantic validation with Satya, our custom validation library that:

  • Reduces memory overhead
  • Processes types faster
  • Maintains full type safety

Results and Impact

These optimizations deliver significant performance improvements:

Response Time Improvements

  • OpenAI: 2.5x faster than raw implementation, 1.9x faster than native
  • Gemini: 1.5x faster than raw, 1.6x faster than native
  • Anthropic: 1.8x faster than raw, 1.4x faster than native

Memory Efficiency

  • Only 1.1x memory overhead vs native implementations
  • Stable performance under load
  • Efficient resource utilization

Real-World Impact

These improvements translate to real-world benefits:

  • Faster response times for user interactions
  • More efficient resource utilization
  • Better scaling for multi-agent systems
  • Reduced operational costs

The combination of MAP-Elites optimization, Rust-based streaming, and efficient buffer management has created a solution that's not just incrementally better, but fundamentally more efficient. And we're just getting started—there's still room for optimization and improvement.


Supported AI Providers & Structured Outputs

Bhumi supports multiple AI providers, allowing seamless switching between them. Currently supported providers include:

  • OpenAI (openai/{model_name})
  • Anthropic (anthropic/{model_name})
  • Gemini (gemini/{model_name})
  • Groq (groq/{model_name})
  • SambaNova (sambanova/{model_name})

Bhumi also supports structured outputs and tool use, making it easy to integrate external functions into AI responses.


Using Bhumi for Tool Use & Structured Outputs

Bhumi allows AI models to call external tools for better interactivity. Here's an example that registers a weather tool and lets AI call it dynamically:

import asyncio
from bhumi.base_client import BaseLLMClient, LLMConfig
import os
import json
from dotenv import load_dotenv

load_dotenv()

# Example weather tool function
async def get_weather(location: str, unit: str = "f") -> str:
    result = f"The weather in {location} is 75°{unit}"
    print(f"\nTool executed: get_weather({location}, {unit}) -> {result}")
    return result

async def main():
    config = LLMConfig(
        api_key=os.getenv("OPENAI_API_KEY"),
        model="openai/gpt-4o-mini"
    )
    
    client = BaseLLMClient(config)
    
    # Register the weather tool
    client.register_tool(
        name="get_weather",
        func=get_weather,
        description="Get the current weather for a location",
        parameters={
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "The city and state e.g. San Francisco, CA"},
                "unit": {"type": "string", "enum": ["c", "f"], "description": "Temperature unit (c for Celsius, f for Fahrenheit)"}
            },
            "required": ["location", "unit"],
            "additionalProperties": False
        }
    )
    
    print("\nStarting weather query test...")
    messages = [{"role": "user", "content": "What's the weather like in San Francisco?"}]
    
    print(f"\nSending messages: {json.dumps(messages, indent=2)}")
    
    try:
        response = await client.completion(messages)
        print(f"\nFinal Response: {response['text']}")
    except Exception as e:
        print(f"\nError during completion: {e}")

if __name__ == "__main__":
    asyncio.run(main())

With Bhumi, AI models can generate structured responses and interact with external tools effortlessly.


Final Thoughts

Bhumi isn't just about speed—it's about flexibility and efficiency. Whether you need to switch AI providers on the fly or enable structured outputs and tool use, Bhumi makes it seamless.

Drop a comment below—I'd love to hear your thoughts! 🚀

There are a few things that are being worked on!

  • Structured Outputs
  • Tool Use
  • More Providers
  • More Models

If you'd like to help out, please reach out to me at me@rachit.ai

Making AI Faster: The Story Behind Bhumi - Rach Pradhan