Evaluation of business quality AIA: a reference for complex and vocal workflows

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

As companies are increasingly incorporating AI assistants, assessing the effectiveness of these systems perform real world tasks, in particular through vocal interactions, is essential. Existing evaluation methods focus on large conversational skills or limited use of specific tools. However, these benchmarks fail when the capacity of an AI agent is able to manage complex work flows specialized in various fields. This gap highlights the need for more complete evaluation frameworks that reflect the challenges that AI assistants face in practical business contexts, ensuring that they can really support complex and vocal operations in real environments.

To meet the limits of existing references, Salesforce AI Research & Engineering has developed a suitable robust assessment system to assess AI agents in complex corporate tasks through text and voice interfaces. This internal tool supports product development as an agentForce. It offers a standardized framework to assess the performance of AI assistants in four key areas of activity: manage health care meetings, manage financial transactions, process incoming sales and execute electronic commerce orders. Using cases of carefully organized and verified human test, the reference requires that agents end operations in several steps, use specific tools and adhere to strict security protocols on the two communication modes.

Traditional AI benchmarks often focus on general knowledge or basic instructions, but business parameters require more advanced capacities. AI agents in these contexts must integrate into several tools and systems, follow strict safety and conformity procedures and include specialized terms and workflows. Vocal interactions add another layer of complexity due to the potential errors of vocal recognition and synthesis, in particular in tasks in several stages. Responding to these needs, the reference guides AI development to more reliable and effective assistants adapted to the use of the company.

Salesforce's reference uses a modular framework with four key components: environments specific to the field, tasks predefined with clear objectives, simulated interactions that reflect real world conversations and measurable performance measures. He assesses AI in four business areas: management of health appointments, financial services, sales and electronic commerce. Tasks range from simple requests to complex operations involving conditional logic and several system calls. With the test cases verified by humans, the reference guarantees realistic challenges that test the reasoning, accuracy and management of an agent's tools in text and voice interfaces.

The evaluation framework measures the performance of the AI ​​agent as a function of two main criteria: precision, the way in which the agent correctly ends the task and efficiency, which are evaluated by conversational length and use of tokens. Textual and vocal interactions are evaluated, with the possibility of adding audio noise to test the resilience of the system. Implemented in Python, the modular benchmark supports realistic customer agent dialogues, several AI model suppliers and configurable vocal processing using integrated text components in text and dispection text. An open source version is planned, allowing developers to extend the framework to new use cases and communication formats.

The initial tests between higher models such as GPT-4 variants and Llama have shown that financial tasks were the most prone to errors due to strict verification requirements. Vocal tasks have also experienced a decrease of 5 to 8% of performance compared to the text. Precision has decreased more on tasks in several stages, in particular those requiring conditional logic. These results highlight continuous challenges in the use of tools, compliance with the protocol and the processing of speech. Although robust, the reference lacks personalization, diversity of behavior of real world users and multilingual capacities. Future work will fill these gaps by widening the domains, by introducing user modeling and incorporating more subjective and inter-aging assessments.


Discover the Technical details. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 95K + ML Subdreddit and subscribe to Our newsletter.


Sana Hassan, consulting trainee at Marktechpost and double -degree student at Iit Madras, is passionate about the application of technology and AI to meet the challenges of the real world. With a great interest in solving practical problems, it brings a new perspective to the intersection of AI and real life solutions.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.