Learnitweb

Getting started with Open source models using Groq API

1. What Is Groq?

Groq is an AI infrastructure company focused entirely on high-performance inference for large language models (LLMs).

Groq does not train models. Instead, it:

  • Hosts popular open-source LLMs
  • Provides ultra-low latency inference APIs
  • Uses custom hardware called LPU (Language Processing Unit)

In simple terms, Groq makes LLMs run extremely fast in production.

2. Models Available on Groq

Groq hosts popular open-source LLMs, including:

  • LLaMA 3.1 by Meta
  • Gemma family by Google (older variants now deprecated)
  • Mistral models

3. Groq API: How Developers Use It

Groq provides:

  • REST APIs
  • SDK-style integrations (via frameworks like LangChain)
  • Playground for experimentation

Key properties:

  • Simple authentication via API key
  • No model training required
  • No infrastructure management

4. Why Inference Is the Real Bottleneck in GenAI

Most discussions about Generative AI focus on:

  • Bigger models
  • Better benchmarks
  • New architectures

But in real systems, the biggest challenges appear after the model is already trained. The real bottleneck in GenAI is inference — the process of running a trained model to generate outputs.

5. Training vs Inference: A Critical Distinction

Training (One-Time or Infrequent)

  • Happens once or occasionally
  • Done on massive GPU clusters
  • Offline process
  • Expensive but amortized over time

Inference (Continuous and Real-Time)

  • Happens every user request
  • Must respond in milliseconds
  • Scales with traffic
  • Directly impacts user experience and cost

A model is trained once, but inference runs millions or billions of times.

What Happens During LLM Inference?

When a user asks a question:

  1. The input text is tokenized
  2. The model processes tokens
  3. Output tokens are generated one by one
  4. Each token depends on the previous token

This sequential nature is the root of the problem.

Sequential Token Generation: The Core Issue

Unlike image or matrix workloads:

  • LLMs cannot generate tokens in parallel
  • Each token requires:
    • Memory access
    • Compute
    • Attention over previous tokens

Even if you have massive parallel hardware:

  • Only one token can be finalized at a time

This makes latency per token extremely important.

Why GPUs Are Not Ideal for Inference

GPUs were designed for:

  • Parallel numerical computation
  • Batch processing
  • High throughput

LLM inference needs:

  • Low latency
  • Fast memory access
  • Deterministic execution

GPU Bottlenecks in Inference

1. Memory Bandwidth Bottleneck

  • Attention layers constantly read large tensors
  • GPUs struggle to feed compute units fast enough

2. Scheduling Overhead

  • GPUs dynamically schedule thousands of threads
  • This adds latency for token-by-token execution

3. Poor Utilization for Small Batches

  • Real-time apps often use batch size = 1
  • GPUs are inefficient at small batch sizes

6. Latency vs Throughput: The Hidden Trade-off

Most GPU systems optimize for throughput:

  • Tokens per second across many requests

But GenAI apps need low latency:

  • Time to first token
  • Time between tokens

Example

MetricGPU OptimizedUser Cares About
ThroughputHigh
LatencyVariable
First token delayOften high

Users perceive slowness even if throughput is high.

Cost Explosion at Inference Time

Inference cost grows with:

  • Number of users
  • Length of prompts
  • Length of responses
  • Number of API calls

Why This Is Dangerous

  • GPU inference is expensive per token
  • Costs scale linearly with usage
  • Free or freemium apps become unsustainable

Many GenAI startups fail not because models are bad, but because inference costs kill them.

Determinism: An Overlooked Requirement

Production systems require:

  • Predictable response times
  • Stable latency under load
  • No random slowdowns

GPU inference is:

  • Non-deterministic
  • Sensitive to batching and scheduling
  • Unpredictable at scale

This makes system design and SLAs difficult.


Why Inference Dominates Total System Cost

Let’s compare:

PhaseCost Pattern
TrainingOne-time / rare
InferenceContinuous / unbounded

Even a modest model:

  • Can cost more in inference than training over time
  • Especially at scale

This flips the traditional ML cost model.


The Shift in GenAI Optimization Priorities

Modern GenAI optimization focuses on:

  • Faster inference hardware
  • Lower latency per token
  • Efficient memory access
  • Deterministic execution
  • Cost-per-token reduction

This is why inference-first companies are emerging.


How Groq Addresses the Inference Bottleneck

Groq was built specifically to solve inference problems.

Groq’s Key Ideas

  • Custom hardware (LPU)
  • Deterministic execution model
  • Token-optimized pipelines
  • No dynamic scheduling overhead

Result

  • Faster time-to-first-token
  • Faster token generation
  • Predictable latency
  • Lower cost per request

Groq optimizes execution, not model ownership.


Inference Is Where User Experience Lives

Users don’t care about:

  • Parameter count
  • Training dataset size
  • Research papers

They care about:

  • How fast the answer appears
  • How smooth the interaction feels
  • Whether the system keeps up under load

All of this depends on inference performance.

What Is an LPU (Language Processing Unit)?

An LPU (Language Processing Unit) is a special-purpose AI processor designed specifically for running large language models (LLMs) efficiently during inference.

Unlike GPUs or CPUs, which are general-purpose processors, an LPU is purpose-built for language workloads, especially token-by-token text generation.

LPUs were introduced and popularized by Groq to solve the core performance and cost problems of LLM inference.


Why LPUs Exist

To understand LPUs, you first need to understand a key problem:

LLMs generate text sequentially, one token at a time.

Most existing hardware (GPUs, CPUs) is optimized for parallel computation, not for sequential token generation with heavy memory access.

This mismatch creates:

  • High latency
  • Poor utilization
  • High cost per request

LPUs exist to fix this mismatch.

7. GenAI example using Groq

import os
import streamlit as st
from dotenv import load_dotenv

from langchain_groq import ChatGroq
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser


# ------------------------------------------------------
# Load environment variables
# ------------------------------------------------------
load_dotenv()

GROQ_API_KEY = os.getenv("GROQ_API_KEY")

if not GROQ_API_KEY:
    st.error("GROQ_API_KEY is missing. Please set it in the .env file.")
    st.stop()

# ------------------------------------------------------
# Initialize LLM (cached)
# ------------------------------------------------------
@st.cache_resource
def load_llm():
    return ChatGroq(
        model="llama-3.1-8b-instant",
        groq_api_key=GROQ_API_KEY,
        temperature=0.4
    )

llm = load_llm()

# ------------------------------------------------------
# Prompt Template (Very Simple)
# ------------------------------------------------------
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a helpful AI assistant."),
        ("human", "{question}")
    ]
)

# ------------------------------------------------------
# Output Parser
# ------------------------------------------------------
output_parser = StrOutputParser()

# ------------------------------------------------------
# LCEL Chain
# ------------------------------------------------------
chain = prompt | llm | output_parser

# ------------------------------------------------------
# Streamlit UI
# ------------------------------------------------------
st.set_page_config(page_title="Simple GenAI App", page_icon="🤖")

st.title("🤖 Simple GenAI App (LangChain + Groq)")
st.write("Ask any question and get an instant answer from an open-source LLM.")

question = st.text_area(
    "Your Question",
    placeholder="What is LangChain?",
    height=120
)

if st.button("Ask"):
    if not question.strip():
        st.warning("Please enter a question.")
    else:
        with st.spinner("Thinking..."):
            try:
                answer = chain.invoke({"question": question})
                st.success("Answer")
                st.write(answer)
            except Exception as e:
                st.error(f"Error: {e}")

st.markdown("---")
st.caption("Powered by LangChain + Groq")