Prev Next

Spring / Spring AI interview questions

What are the performance tuning strategies for a Spring AI RAG application at scale?

When a RAG application moves from prototype to production load, several bottlenecks emerge. Addressing them requires tuning at the ingestion layer, retrieval layer, LLM call layer, and infrastructure layer.

Ingestion layer: Run chunking and embedding in parallel using a thread pool or Spring Batch. Batch embedding requests — most providers accept up to 100 texts per API call. Cache the result of ingestion so unchanged documents are not re-embedded on restarts.

Retrieval layer: Use HNSW indexes on PgVector or equivalent ANN indexes on other stores. Tune topK conservatively — fetching 10 chunks when 3 would suffice inflates prompt size and increases LLM cost. Add a reranker step (a cross-encoder model) to reorder retrieved chunks by relevance before truncating to the top 3 for the prompt.

LLM call layer: Cache responses to identical or near-identical prompts using a semantic cache backed by a VectorStore. If the cosine similarity between a new query and a cached query embedding exceeds a threshold, return the cached answer rather than calling the LLM. This can reduce API cost by 30-70% for FAQ-style workloads.

Parallel and async calls: For workflows that need multiple independent LLM calls (e.g. analysing several documents separately), use Flux merging or virtual threads to fire calls concurrently rather than sequentially.

Model selection: Use the cheapest model that meets quality requirements for each step. Metadata extraction during ingestion can use a cheap model; the final answer generation uses the flagship model. This is called model routing or cascading.

Invest now in Acorns!!! 🚀 Join Acorns and get your $5 bonus!

Invest now in Acorns!!! 🚀
Join Acorns and get your $5 bonus!

Earn passively and while sleeping

Acorns is a micro-investing app that automatically invests your "spare change" from daily purchases into diversified, expert-built portfolios of ETFs. It is designed for beginners, allowing you to start investing with as little as $5. The service automates saving and investing. Disclosure: I may receive a referral bonus.

Invest now!!! Get Free equity stock (US, UK only)!

Use Robinhood app to invest in stocks. It is safe and secure. Use the Referral link to claim your free stock when you sign up!.

The Robinhood app makes it easy to trade stocks, crypto and more.


Webull! Receive free stock by signing up using the link: Webull signup.

More Related questions...

What is Spring AI and what problem does it solve? What AI model providers does Spring AI support? What is the difference between ChatModel and ChatClient in Spring AI? How do you create and use a ChatClient in a Spring Boot application? What message types does Spring AI support in a Prompt? What is Retrieval-Augmented Generation (RAG) and how does Spring AI implement it? What is a VectorStore in Spring AI and which implementations are available? What is an EmbeddingModel in Spring AI and why must the same model be used for ingestion and retrieval? How does PromptTemplate work in Spring AI? What is structured output in Spring AI and how does it work internally? What are Advisors in Spring AI and what built-in advisors are available? How does conversation memory work in Spring AI? What is function calling (tool use) in Spring AI and how do you register a function? How do you stream responses from an LLM in Spring AI? What is the Document class in Spring AI and how is it used in RAG? What is TokenTextSplitter and why is document chunking necessary? What DocumentReaders does Spring AI provide for loading content into the RAG pipeline? What is the Spring AI ETL pipeline and how does it work? How does Spring AI integrate with Spring Boot auto-configuration? What are ChatOptions in Spring AI and how do you apply them per-request? What is the SearchRequest API in Spring AI's VectorStore? How does Spring AI support multimodal inputs such as images? What is image generation in Spring AI and how do you use ImageModel? How does Spring AI handle observability and what metrics does it expose? How do you test Spring AI components without calling real AI APIs? What is the Spring AI MCP (Model Context Protocol) integration? What is the role of MetadataEnricher and KeywordMetadataEnricher in Spring AI? What are the Spring AI Chat Model options for controlling response determinism? What is the Spring AI Agentic pattern and how does it differ from a single-turn chat call? What does the spring-ai-bom do and why should you use it? What is PgVector and how do you configure it as a VectorStore in Spring AI? How does Spring AI's retry and resilience mechanism work for LLM API calls? What is the Spring AI Evaluation framework and how do you use it? How do you use Spring AI with Spring WebFlux for a reactive AI endpoint? What are the Spring AI Spring Initializr options and how do you bootstrap a project? What is the Spring AI content moderation strategy and how do you implement it? How does Spring AI support multi-tenancy where different users need different LLM configurations? What is the Spring AI AudioModel and how does it support speech synthesis? How does Spring AI handle prompt injection attacks? What are the performance tuning strategies for a Spring AI RAG application at scale? How does Spring AI support the Ollama provider for local model development? What is semantic caching in Spring AI and how would you implement it? How does Spring AI integrate with Spring Security for securing AI endpoints? How does Spring AI's Document metadata filtering work with PgVector and what filter operators are available?
Show more question and Answers...

Hibernate

Comments & Discussions