Getting started with Open source models using Groq API

1. What Is Groq?

Groq is an AI infrastructure company focused entirely on high-performance inference for large language models (LLMs).

Groq does not train models. Instead, it:

Hosts popular open-source LLMs
Provides ultra-low latency inference APIs
Uses custom hardware called LPU (Language Processing Unit)

In simple terms, Groq makes LLMs run extremely fast in production.

2. Models Available on Groq

Groq hosts popular open-source LLMs, including:

LLaMA 3.1 by Meta
Gemma family by Google (older variants now deprecated)
Mistral models

3. Groq API: How Developers Use It

Groq provides:

REST APIs
SDK-style integrations (via frameworks like LangChain)
Playground for experimentation

Key properties:

Simple authentication via API key
No model training required
No infrastructure management

4. Why Inference Is the Real Bottleneck in GenAI

Most discussions about Generative AI focus on:

Bigger models
Better benchmarks
New architectures

But in real systems, the biggest challenges appear after the model is already trained. The real bottleneck in GenAI is inference — the process of running a trained model to generate outputs.

5. Training vs Inference: A Critical Distinction

Training (One-Time or Infrequent)

Happens once or occasionally
Done on massive GPU clusters
Offline process
Expensive but amortized over time

Inference (Continuous and Real-Time)

Happens every user request
Must respond in milliseconds
Scales with traffic
Directly impacts user experience and cost

A model is trained once, but inference runs millions or billions of times.

What Happens During LLM Inference?

When a user asks a question:

The input text is tokenized
The model processes tokens
Output tokens are generated one by one
Each token depends on the previous token

This sequential nature is the root of the problem.

Sequential Token Generation: The Core Issue

Unlike image or matrix workloads:

LLMs cannot generate tokens in parallel
Each token requires:
- Memory access
- Compute
- Attention over previous tokens

Even if you have massive parallel hardware:

Only one token can be finalized at a time

This makes latency per token extremely important.

Why GPUs Are Not Ideal for Inference

GPUs were designed for:

Parallel numerical computation
Batch processing
High throughput

LLM inference needs:

Low latency
Fast memory access
Deterministic execution

GPU Bottlenecks in Inference

1. Memory Bandwidth Bottleneck

Attention layers constantly read large tensors
GPUs struggle to feed compute units fast enough

2. Scheduling Overhead

GPUs dynamically schedule thousands of threads
This adds latency for token-by-token execution

3. Poor Utilization for Small Batches

Real-time apps often use batch size = 1
GPUs are inefficient at small batch sizes

6. Latency vs Throughput: The Hidden Trade-off

Most GPU systems optimize for throughput:

Tokens per second across many requests

But GenAI apps need low latency:

Time to first token
Time between tokens

Example

Metric	GPU Optimized	User Cares About
Throughput	High	❌
Latency	Variable	✅
First token delay	Often high	❌

Users perceive slowness even if throughput is high.

Cost Explosion at Inference Time

Inference cost grows with:

Number of users
Length of prompts
Length of responses
Number of API calls

Why This Is Dangerous

GPU inference is expensive per token
Costs scale linearly with usage
Free or freemium apps become unsustainable

Many GenAI startups fail not because models are bad, but because inference costs kill them.

Determinism: An Overlooked Requirement

Production systems require:

Predictable response times
Stable latency under load
No random slowdowns

GPU inference is:

Non-deterministic
Sensitive to batching and scheduling
Unpredictable at scale

This makes system design and SLAs difficult.

Why Inference Dominates Total System Cost

Let’s compare:

Phase	Cost Pattern
Training	One-time / rare
Inference	Continuous / unbounded

Even a modest model:

Can cost more in inference than training over time
Especially at scale

This flips the traditional ML cost model.

The Shift in GenAI Optimization Priorities

Modern GenAI optimization focuses on:

Faster inference hardware
Lower latency per token
Efficient memory access
Deterministic execution
Cost-per-token reduction

This is why inference-first companies are emerging.

How Groq Addresses the Inference Bottleneck

Groq was built specifically to solve inference problems.

Groq’s Key Ideas

Custom hardware (LPU)
Deterministic execution model
Token-optimized pipelines
No dynamic scheduling overhead

Result

Faster time-to-first-token
Faster token generation
Predictable latency
Lower cost per request

Groq optimizes execution, not model ownership.

Inference Is Where User Experience Lives

Users don’t care about:

Parameter count
Training dataset size
Research papers

They care about:

How fast the answer appears
How smooth the interaction feels
Whether the system keeps up under load

All of this depends on inference performance.

What Is an LPU (Language Processing Unit)?

An LPU (Language Processing Unit) is a special-purpose AI processor designed specifically for running large language models (LLMs) efficiently during inference.

Unlike GPUs or CPUs, which are general-purpose processors, an LPU is purpose-built for language workloads, especially token-by-token text generation.

LPUs were introduced and popularized by Groq to solve the core performance and cost problems of LLM inference.

Why LPUs Exist

To understand LPUs, you first need to understand a key problem:

LLMs generate text sequentially, one token at a time.

Most existing hardware (GPUs, CPUs) is optimized for parallel computation, not for sequential token generation with heavy memory access.

This mismatch creates:

High latency
Poor utilization
High cost per request

LPUs exist to fix this mismatch.

7. GenAI example using Groq

import os
import streamlit as st
from dotenv import load_dotenv

from langchain_groq import ChatGroq
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser


# ------------------------------------------------------
# Load environment variables
# ------------------------------------------------------
load_dotenv()

GROQ_API_KEY = os.getenv("GROQ_API_KEY")

if not GROQ_API_KEY:
    st.error("GROQ_API_KEY is missing. Please set it in the .env file.")
    st.stop()

# ------------------------------------------------------
# Initialize LLM (cached)
# ------------------------------------------------------
@st.cache_resource
def load_llm():
    return ChatGroq(
        model="llama-3.1-8b-instant",
        groq_api_key=GROQ_API_KEY,
        temperature=0.4
    )

llm = load_llm()

# ------------------------------------------------------
# Prompt Template (Very Simple)
# ------------------------------------------------------
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a helpful AI assistant."),
        ("human", "{question}")
    ]
)

# ------------------------------------------------------
# Output Parser
# ------------------------------------------------------
output_parser = StrOutputParser()

# ------------------------------------------------------
# LCEL Chain
# ------------------------------------------------------
chain = prompt | llm | output_parser

# ------------------------------------------------------
# Streamlit UI
# ------------------------------------------------------
st.set_page_config(page_title="Simple GenAI App", page_icon="🤖")

st.title("🤖 Simple GenAI App (LangChain + Groq)")
st.write("Ask any question and get an instant answer from an open-source LLM.")

question = st.text_area(
    "Your Question",
    placeholder="What is LangChain?",
    height=120
)

if st.button("Ask"):
    if not question.strip():
        st.warning("Please enter a question.")
    else:
        with st.spinner("Thinking..."):
            try:
                answer = chain.invoke({"question": question})
                st.success("Answer")
                st.write(answer)
            except Exception as e:
                st.error(f"Error: {e}")

st.markdown("---")
st.caption("Powered by LangChain + Groq")