The Gap Between Demo and Production
Every AI demo looks impressive. A chatbot that answers questions. A document summariser. An AI that generates reports.
Then you ship it to real users and discover:
- It hallucinates facts your legal team cannot defend
- It costs ₹8 per query at scale, not ₹0.08
- It is 4 seconds slower than your users will tolerate
- It does not work for queries in Hinglish
We have shipped AI features into five products this year. Here is what we actually learned.
What Works: Constrained Tasks
AI works best when you constrain what it can say. The worst AI features are open-ended chatbots. The best are highly scoped tools.
Example: Instead of "ask our AI anything about your portfolio," we built "AI explains why your portfolio is down today" — a constrained task using only the portfolio data we control.
The result: no hallucinations, consistent quality, and users who trust it because it only speaks about what it knows.
What Doesn't Work: Replacing Human Judgment
We were asked to build an AI that would approve loan applications. We declined and explained why: LLMs are pattern-matchers trained on historical data. They will encode historical biases. For high-stakes decisions affecting people's financial lives, AI should assist humans, not replace them.
We built a system instead that highlights the key risk factors and gives the loan officer a structured summary. The officer makes the call. Approval time went from 4 days to 6 hours.
The Cost Problem Nobody Talks About
OpenAI pricing looks cheap in a demo. It does not look cheap when you are processing 50,000 documents per day.
Our approach to cost control:
- Cache aggressively. The same question gets asked by thousands of users. Cache the answer.
- Use smaller models where possible. GPT-4o is overkill for classification tasks.
gpt-4o-miniis 20x cheaper and good enough. - Batch where latency allows. Document processing does not need real-time responses.
- Set hard cost budgets per user. Track token usage. Alert before you hit limits.
RAG Is Not Magic
Retrieval-Augmented Generation (RAG) is the right approach for knowledge base Q&A. But a bad RAG implementation is worse than no AI at all.
The mistakes we see most often:
- Chunking documents arbitrarily rather than at semantic boundaries
- Not re-ranking retrieved chunks by relevance to the actual query
- Skipping evaluation entirely and hoping the LLM gets it right
Build an evaluation dataset. Test with real queries from real users. Measure precision and recall. This is not optional.
The One Rule
Ship AI features like you ship any other feature: with monitoring, fallbacks, and the ability to turn them off. If your AI feature goes down, your product should still work. AI is an enhancement, not a dependency.