AgentOptim Evaluation Toolkit

MCPOfficialOpen Source10.0

by ericflo • Testing & QA

An open-source MCP server for measuring, comparing, and improving AI conversation quality with persistent storage, standardized metrics, and multi-model judging.

Example Use Cases

1
Objectively evaluate and score assistant responses against customizable quality criteria (helpfulness, clarity, accuracy).
2
Compare different prompts, models, or response strategies over time and generate exportable reports (HTML/Markdown/CSV).
3
Persistent, paginated storage of evaluation runs and the ability to retrieve and analyze historical judgment data.

Description

AgentOptim provides a 2-tool architecture (manage_evalset_tool and manage_eval_runs_tool) to create evaluation criteria and run/stash conversation evaluations. It supports multiple judge models (OpenAI, Anthropic, local LM Studio), parallel processing, and persistent on-disk storage of evaluation runs with pagination and metadata. The project includes a user-friendly CLI, export options (HTML/Markdown/CSV), and features for automation, comparison, and reporting to help teams track and improve agent performance over time.

Quick Actions

View on GitHub

Security

Scanned 2 month(s) ago

Risk Level

MINIMAL

Read-only data retrieval, no side effects

Trust Score

D37/100

5/17 checks passed

Scores are informational only and provided “as is” without warranty. AgentHotspot assumes no liability for actions taken based on these ratings.

Quick Stats

Service TypeMCP

Pricing ModelFree

Capabilities0 Tools / 0 Prompts / 0 Resources

Ownerericflo

CategoryTesting & QA

DependenciesStandalone