Building Production-Ready LLM Applications with Azure OpenAI and Azure AI Search

Marco Farina
Jun 10, 2025
4 min read

Large Language Models (LLMs) are powerful but inherently limited by two major constraints:

Static knowledge cutoff – models cannot access new or proprietary data.
Hallucinations – models may generate plausible but incorrect answers.

To address these limitations, modern enterprise AI applications rely on Retrieval-Augmented Generation (RAG). RAG combines LLM reasoning with external knowledge retrieval, grounding responses in authoritative data sources.

In the Microsoft ecosystem, the most common enterprise implementation uses:

Azure OpenAI Service
Azure AI Search

This architecture enables scalable, secure, and production-ready AI applications that integrate private enterprise data with generative AI capabilities.

This article explores:

Production RAG architecture on Azure
Vector search with embeddings
Indexing pipelines
Prompt grounding
Implementation examples
Performance and cost optimization

Understanding the RAG Architecture on Azure

Retrieval-Augmented Generation integrates three primary components:

Embedding pipeline
Vector search index
LLM generation layer

The workflow typically follows these steps:

User Query

↓

Query Embedding

↓

Vector Search (Azure AI Search)

↓

Relevant Documents Retrieved

↓

Prompt Grounding

↓

Azure OpenAI Completion

↓

Final Response

The LLM does not directly access the database. Instead, relevant knowledge is retrieved first and then injected into the prompt context.

This approach significantly reduces hallucinations and improves factual accuracy.

Azure Services Used in a Production RAG Pipeline

A robust architecture typically includes the following services.

Component	Azure Service
LLM inference	Azure OpenAI
Vector database	Azure AI Search
Embedding generation	Azure OpenAI Embeddings
Data ingestion	Azure Functions / Data Factory
Storage	Azure Blob Storage
Monitoring	Azure Monitor

Each component serves a specific role in the overall architecture.

Document Ingestion and Chunking

Before documents can be searched semantically, they must be preprocessed.

Typical ingestion pipeline:

Document ingestion
Text extraction
Chunking
Embedding generation
Vector indexing

Why Chunking Matters

LLMs and embedding models have token limits. Large documents must be split into smaller segments.

Example strategy:

chunk size: 500 tokens
overlap: 50 tokens

This improves semantic continuity during retrieval.

Example Python Chunking

from openai import AzureOpenAI

client = AzureOpenAI(

api_key="AZURE_OPENAI_KEY",

api_version="2024-02-15-preview",

azure_endpoint="https://your-resource.openai.azure.com/"

)

response = client.embeddings.create(

input="Azure AI enables powerful enterprise search",

model="text-embedding-3-large"

)

embedding = response.data[0].embedding

def chunk_text(text, chunk_size=500, overlap=50):

chunks = []

start = 0

while start < len(text):

end = start + chunk_size

chunks.append(text[start:end])

start += chunk_size - overlap

return chunks

Chunking improves recall and ensures vector embeddings remain meaningful.

Generating Embeddings with Azure OpenAI

Embeddings convert text into high-dimensional vectors that capture semantic meaning.

Using Azure OpenAI embeddings:

from openai import AzureOpenAI

client = AzureOpenAI(

api_key="AZURE_OPENAI_KEY",

api_version="2024-02-15-preview",

azure_endpoint="https://your-resource.openai.azure.com/"

)

response = client.embeddings.create(

input="Azure AI enables powerful enterprise search",

model="text-embedding-3-large"

)

embedding = response.data[0].embedding

These vectors are stored inside Azure AI Search vector indexes.

Creating a Vector Index in Azure AI Search

Azure AI Search supports native vector search, enabling similarity search across embeddings.

Example index schema:

{

"name": "documents-index",

"fields": [

{"name": "id", "type": "Edm.String", "key": true},

{"name": "content", "type": "Edm.String"},

{

"name": "embedding",

"type": "Collection(Edm.Single)",

"dimensions": 1536,

"vectorSearchConfiguration": "vector-config"

}

]

}

The embedding field stores vector representations of text chunks.

Azure AI Search then uses approximate nearest neighbor (ANN) algorithms to perform semantic retrieval.

Performing Vector Search

When a user query arrives:

Query is converted into an embedding
Vector similarity search is performed
Top-K documents are retrieved

Example search request:

results = search_client.search(

search_text=None,

vectors=[

{

"value": query_embedding,

"fields": "embedding",

"k": 5

}

]

)

This returns the most semantically relevant document chunks.

Prompt Grounding

Retrieved documents are injected into the prompt context before sending it to the LLM.

Example prompt template:

You are an AI assistant answering based on company knowledge.

Context:

{retrieved_documents}

Question:

{user_question}

Answer using only the provided context.

Grounding ensures the LLM uses authoritative data instead of generating unsupported answers.

Generating the Final Answer with Azure OpenAI

Once the prompt is constructed, it is sent to the LLM.

Example completion call:

response = client.chat.completions.create(

model="gpt-4o",

messages=[

{"role": "system", "content": "You are a helpful enterprise assistant."},

{"role": "user", "content": grounded_prompt}

temperature=0.2

)

answer = response.choices[0].message.content

A low temperature improves determinism and factual consistency.

Production Architecture (Reference Diagram)

Below is a typical enterprise architecture for RAG systems on Azure.

┌─────────────────────┐

│ Enterprise Data │

│ PDFs / Docs / DB │

└──────────┬──────────┘

│

▼

┌─────────────────────┐

│ Ingestion Pipeline │

│ (Functions / ETL) │

└──────────┬──────────┘

│

▼

┌─────────────────────┐

│ Embedding Generation│

│ Azure OpenAI │

└──────────┬──────────┘

│

▼

┌─────────────────────┐

│ Vector Index │

│ Azure AI Search │

└──────────┬──────────┘

│

▼

┌─────────────────────┐

User Query ───► │ Query Embedding │

│ Azure OpenAI │

└──────────┬──────────┘

│

▼

┌─────────────────────┐

│ Vector Retrieval │

│ Azure AI Search │

└──────────┬──────────┘

│

▼

┌─────────────────────┐

│ Prompt Grounding │

│ Context Injection │

└──────────┬──────────┘

│

▼

┌─────────────────────┐

│ LLM Response │

│ Azure OpenAI │

└─────────────────────┘

You can later recreate this architecture diagram using tools like:

draw.io
Lucidchart
Microsoft Visio

Performance Optimization Strategies

Production systems require careful optimization.

1. Hybrid Search

Combine vector search with keyword search.

Benefits:

better recall
improved ranking

2. Caching LLM Responses

Using Redis cache reduces latency and cost.

Recommended Azure service:

Azure Cache for Redis

3. Reduce Prompt Size

Large prompts increase token cost.

Strategies:

top-k retrieval
summarization
semantic filtering

4. Parallel Retrieval

Retrieve documents asynchronously to improve response time.

Security and Governance

Enterprise deployments must ensure:

data isolation
private networking
identity-based authentication

Recommended configurations:

Managed Identity
Private Endpoints
Azure Key Vault for secrets

Monitoring and Observability

Production AI systems require monitoring.

Recommended tools:

Azure Monitor
Application Insights
OpenTelemetry tracing

Metrics to track:

token usage
latency
hallucination rate
retrieval precision

Conclusion

Retrieval-Augmented Generation is the dominant architecture pattern for enterprise AI applications. By combining Azure OpenAI with Azure AI Search, organizations can build scalable, secure, and accurate AI systems that leverage proprietary knowledge.

A production-ready RAG system requires:

robust ingestion pipelines
vector indexing
prompt grounding
monitoring and optimization

As the Microsoft AI ecosystem continues to evolve, this architecture will remain a foundational pattern for enterprise generative AI solutions.

References

Microsoft Azure OpenAI documentationhttps://learn.microsoft.com/azure/ai-services/openai/
Azure AI Search vector searchhttps://learn.microsoft.com/azure/search/vector-search-overview
RAG architecture patternshttps://learn.microsoft.com/azure/architecture/ai-ml/guide/rag
Azure AI architecture centerhttps://learn.microsoft.com/azure/architecture/ai-ml/

Building Production-Ready LLM Applications with Azure OpenAI and Azure AI Search

Understanding the RAG Architecture on Azure

Azure Services Used in a Production RAG Pipeline

Document Ingestion and Chunking

Why Chunking Matters

Example Python Chunking

Generating Embeddings with Azure OpenAI

Creating a Vector Index in Azure AI Search

Performing Vector Search

Prompt Grounding

Generating the Final Answer with Azure OpenAI

Production Architecture (Reference Diagram)

Performance Optimization Strategies

1. Hybrid Search

2. Caching LLM Responses

3. Reduce Prompt Size

4. Parallel Retrieval

Security and Governance

Monitoring and Observability

Conclusion

References

Recent Posts

Comments

CONTACT US

Are you a university or student interested in collaborating with MSAIHub? We'd love to hear from you!