Voice-based assistance systems have experienced a significant boost in development. More and more interactions that used to have to be led by employees are now being taken over by digital voice assistants. Whether in the call center, in technical support or at information points – voicebots offer the potential to automate communication and at the same time increase the quality of service processes. Other fields of application will soon be added, for example in human robotics: Voice-capable assistants accompany tourists through cities or provide support in the care environment with strength and language skills.
AI voicebots today: rarely deliver what they promise
Despite these attractive prospects, the reality of many voicebot projects still falls short of expectations. Companies associate the introduction with clearly formulated goals: an increase in customer satisfaction, relief for employees, more efficient processes or even a reduction in staff turnover in stress-intensive telephone areas. But as soon as standard voicebots are confronted with real, company-specific requirements, fundamental technical limits quickly become apparent, as our AKQUINET team has found in many consultations. Why is that? A major reason is that many systems are only supported by large language models that are optimized for a generic conversation. They can use it to manage small talk or answer general questions. However, as soon as concrete process data is required or a system-side action is to be triggered, functionality breaks off or is severely limited. For example, if a customer not only wants to know when her order is expected to arrive, but also actively wants to change the delivery date, many voicebots are already over. For some, rigid information is stored in the form of lists or instructions. However, the depth and up-to-dateness of this data is inadequate.
Hallucinations: When AI prefers to invent something rather than not answer
This creates a second, equally relevant risk: hallucinations. Large language models tend to produce answers that sound convincing but don't correspond to reality. A 2023 study shows that almost 20 percent of all responses to Large Language Models (LLMs) contain hallucinations (see HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models, October 2023). Of course, LLMs evolve very quickly. Without knowing exactly how high the hallucination rate of common LLMs currently is, however, one can say: There are hallucinations, almost everyone experiences them when working with an LLM. Hallucinations can have serious consequences. Incorrectly invented system states, alleged process decisions or information on contractual details can be confusing for customers and risky for companies.
The AI is supposed to access data, but how?
In addition, there is the technical question of system connection. If a voicebot accesses operational systems directly, this is problematic for several reasons: On the one hand, it increases the risk of unintentional interventions by AI in productive data sets. On the other hand, it often leads to performance bottlenecks. ERP or ticketing systems are not designed to handle additional load from AI requests. The result: pauses of five seconds or longer occur in the conversation. Such delays are fatal for conducting a conversation on the phone. In addition, practical operation shows that rigid voicebot architectures quickly reach their limits. As soon as requirements change, for example processes are adapted or new topics are added – this is difficult or expensive to expand in classic voicebots.
Principles of the AKQUINET AI Voicebot
To address these challenges, our AKQUINET team's new technical approach was based on three interrelated principles:
1. Receive reliable answers
2. Write back information securely
3. Well-defined decoupling of the system components
What did we do?
The central element for the AKQUINET AI voicebot is the introduction of an intermediate level, a so-called staging area. Enterprise systems no longer deliver their data directly to AI, but to this intermediate layer, which is permanently synchronized in real time. The voicebot only accesses this level. This prevents the AI model from interfering with operational systems in an uncontrolled manner, while keeping up-to-date data accessible. This approach is supplemented by short-term memory based on a retrieval method. In contrast to classic long-term training, the AI receives the required information according to the situation and then discards it again. This reduces effort and at the same time minimizes hallucination risks, as the AI always works with precise and contextual information.
The AI never interacts with the source system
Another core component concerns the triggering of actions. The AI does not make changes in systems itself. Instead, it has previously defined tools at its disposal, such as functions or automated process modules. The voicebot decides which of these tools is needed, triggers it, but never interacts directly with the source systems. This ensures that security requirements are met and all operations are fully logged.
Key advantages of this approach are:
- Reduction of misconduct through clear separation of data access and action logic
- Auditability of all triggered processes
- Stability even with complex requests
We have also further developed the technical implementation of communication. In the AKQUINET AI voicebot, we use a real-time audio model. In this case, the spoken word is not first converted into a text, but processed directly. So instead of speech-to-text and then text-to-speech chains again, there is only direct speech-to-speech. This leads to much more natural dialogues, higher speed and better robustness against accents, dialects or speech variants. A fallback mechanism ensures reliability, which automatically switches to the predefined text-based language model in the event of interruptions in the audio model and continues the dialogue – with a slight delay, but without interrupting the call.
After all, integration into existing IT landscapes plays a central role. The AKQUINET AI voicebot runs entirely on the Azure stack. If the voicebot is operated within an existing Microsoft tenant system, company data does not leave the infrastructure. At the same time, organizations benefit from the stability and documentation of standardized technologies.
When integrating the LLM, the architecture is crucial
Overall, it turns out that the success of a voicebot depends less on the size of the underlying language model and more on architectural questions: How is data provided? How is AI prevented from acting uncontrollably? How is speed guaranteed? And how does a system remain flexible enough to grow in dynamic corporate environments? The approaches described above make it clear that powerful voicebots can only be created if technological possibilities and operational requirements are carefully interlinked. They thus mark an important step towards secure, scalable and realistically usable voice assistance systems in everyday business life. With the AKQUINET AI voicebot, we have already planned these steps for your company in advance.