AI & TechArtificial IntelligenceBigTech CompaniesDigital PublishingNewswireTechnology

Microsoft tool lets devs create AI tests with text descriptions

▼ Summary

– Microsoft released ASSERT, an open source framework that uses AI to turn natural-language descriptions of intended behaviors into scored tests for application-specific AI evaluation.
– ASSERT generates test cases from plain-language rules, runs them against the target system, and records the AI’s decision paths to help developers inspect failures.
– Developers can customize evaluations by providing system context, tools, and constraints, such as rules about emailing outside the company or limiting confidential data.
– The framework addresses a gap left by general evaluations, focusing on behaviors shaped by an application’s context, policies, and tools.
– ASSERT supports evaluation during development, after deployment, and for continuous monitoring, aligning with the industry’s shift toward repeatable testing and regression checks.

AI researchers and labs have made tremendous strides in evaluating models for safety, compliance, sycophancy, and alignment. Yet a more targeted challenge has emerged: ensuring that an AI system behaves exactly as intended within the context of a specific product or service. To simplify that process, Microsoft unveiled ASSERT on Tuesday, an open source framework that stands for Adaptive Spec-driven Scoring for Evaluation and Regression Testing.

The tool leverages AI to transform high-level, natural-language descriptions of goals, policies, or intended behaviors into thorough, scored tests. Developers simply provide plain-language rules about how their AI should act, and ASSERT converts them into a structured set of acceptable and unacceptable behaviors. It then generates problem scenarios and test cases, runs them against the target system, scores the results, and records the paths the AI took, including intermediate actions and tool calls. This allows developers to pinpoint exactly where failures occur.

Developers can further customize evaluations by supplying system context, tools, and constraints. For instance, a developer might specify that a document research AI agent should not send emails to people outside the company, must limit confidential data to C-level executives, and should deliver concise summaries that account for prior context. ASSERT would then generate test cases to verify the system consistently follows those rules.

Microsoft sees ASSERT as filling a critical gap that broader, more general evaluations cannot address. When AI models are shaped by an application’s unique context, policies, and tools, standard benchmarks often fall short. “One of the things we’ve learned is that evaluations are absolutely critical to making good decisions,” said Sarah Bird, chief product officer of Responsible AI at Microsoft. “Because if you don’t understand the behavior of the AI system, it’s really hard to know if it’s meeting your organization’s bar … What we found is that if you really want to have a trustworthy system, you should evaluate many more dimensions that are application-specific.”

Bird noted that ASSERT can evaluate systems during development, after deployment, and even for continuous monitoring. The launch arrives amid a broader industry shift toward repeatable testing and regression checks, with organizations like Stanford’s HELM, MLCommonsAILuminate, and evaluation group METR rolling out benchmarks to measure model behavior under various conditions.

(Source: TechCrunch)

Topics

ai evaluation 95% assert framework 93% application-specific testing 90% natural language descriptions 85% regression testing 82% responsible ai 80% continuous monitoring 78% microsoft ai tools 76% open source framework 74% ai safety 72%