Loading connector details…
Loading connector details…
Choose a unique username to continue using AgentHotspot
by ericflo • Testing & QA
An open-source MCP server for measuring, comparing, and improving AI conversation quality with persistent storage, standardized metrics, and multi-model judging.
Objectively evaluate and score assistant responses against customizable quality criteria (helpfulness, clarity, accuracy).
Compare different prompts, models, or response strategies over time and generate exportable reports (HTML/Markdown/CSV).
Persistent, paginated storage of evaluation runs and the ability to retrieve and analyze historical judgment data.
AgentOptim provides a 2-tool architecture (manage_evalset_tool and manage_eval_runs_tool) to create evaluation criteria and run/stash conversation evaluations. It supports multiple judge models (OpenAI, Anthropic, local LM Studio), parallel processing, and persistent on-disk storage of evaluation runs with pagination and metadata. The project includes a user-friendly CLI, export options (HTML/Markdown/CSV), and features for automation, comparison, and reporting to help teams track and improve agent performance over time.
Scores are informational only and provided “as is” without warranty. AgentHotspot assumes no liability for actions taken based on these ratings.