Back to Blog

Microsoft Launches ASSERT: Text-Driven AI Behavior Testing Framework

Microsoft Launches ASSERT: Text-Driven AI Behavior Testing Framework Microsoft Launches ASSERT: Text-Driven AI Behavior Testing Framework Microsoft Launches ASSERT: Text-Driven AI Behavior Testing Framework

Microsoft Launches ASSERT: Text-Driven AI Behavior Testing Framework

Microsoft has released ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing), an open source framework that enables developers to create AI behavior tests using plain-language descriptions.

What ASSERT Does

ASSERT transforms natural language specifications into comprehensive AI evaluation tests:

  • Input: High-level descriptions of goals, policies, or intended behaviors
  • Process: Generates structured acceptable/unacceptable behavior scenarios
  • Output: Scored test results with detailed execution paths for debugging

Key Capabilities

Test Generation & Execution

  • Converts plain-language rules into test cases automatically
  • Runs scenarios against target AI systems
  • Records intermediate actions and tool calls for failure investigation
  • Supports custom system context, tools, and constraints

Example Use Case A developer specifies that a document research agent should:

  • Not send emails outside the company
  • Limit confidential info to C-level executives
  • Provide concise summaries with context

ASSERT generates test cases validating these behaviors automatically.

Why This Matters

Fills an Application-Specific Gap Sarah Bird, Microsoft's Chief Product Officer of Responsible AI, explains:

"What we found is that if you really want to have a trustworthy system, you should evaluate many more dimensions that are application-specific."

General AI evaluations can't capture behavior shaped by specific:

  • Application context
  • Product policies
  • Custom tools and workflows

Multi-Stage Testing ASSERT supports evaluation at:

  • Build time
  • Post-deployment
  • Continuous monitoring

Industry Context

This release aligns with broader industry trends toward systematic AI testing:

  • Stanford HELM: Holistic evaluation framework
  • MLCommons AILuminate: Standardized benchmarks
  • METR: Behavioral evaluation under different conditions

As models grow more capable, repeatable regression testing and behavior verification are becoming critical for production AI systems.

Availability

ASSERT is available as an open source framework on GitHub.