THE AGENSI STORE

Browse The Skill Store

AI search

20 skills found

benchmarking ai agents beyond models

Free

Published AI benchmarks measure brains in jars. They test models in isolation or within a single reference harness — and then attribute all performance to the model. This skill teaches you to decompose agent performance into its two actual components: model capability and harness multiplier. The result is evaluations that predict real-world behavior instead of benchmark theater.

155.0(1)

evaluating ai harness dimensions

by loreto

$10

Evaluates AI coding agent platforms across five structural dimensions that determine real-world performance independently of model quality, so teams select on architectural fit rather than benchmark scores.

0No reviews

prompt engineer pro

by Roy Yuen

Professional prompt engineering, audit, and evaluation system for production-grade AI agents and workflows.

0No reviews

production agent architect

by Roy Yuen

Architect, scaffold, and harden production-grade AI agents with battle-tested patterns and systematic evaluation.

2No reviews

agent eval coverage audit

by Roy Yuen

Audit your AI agent's evaluation coverage to identify missing release gates and production risks.

0No reviews

Skill Health Scanner

by Markus Isaksson

Free

Instantly diagnose any skill or prompt and get a clear, prioritized report on what’s wrong and how to fix it — across any agent.

14No reviews

AI Eval & Test Suite Quality Gate

by PubsProToolkit

$14

An adversarial gate that audits an AI eval or test suite — LLM-judge rubrics, datasets, regression tests, metrics — for gameable criteria, data leakage, missing edge cases, and non-determinism, then returns one PASS/REVISE/FAIL verdict.

1No reviews

Optimization Loop

by Martin Gunderman

$19

Autonomous loop that iteratively modifies, evaluates, and selects the best version of any text resource — skills, prompts, or campaigns — using a modify-measure-keep/discard cycle.

1No reviews

Investment Analysis Engine

by Joker

Free

Investment analysis across stocks/funds/bonds/real-estate/crypto. Valuation methods, risk frameworks, portfolio construction.

2No reviews

KOL Marketing Engine

by Joker

Free

6-tier KOL classification, 6-platform mapping, 5-dimension screening, ROI evaluation, risk management for influencer campaigns.

1No reviews

Financial Analysis Decision Engine

by Joker

Free

Financial analysis engine with valuation decision tree (DCF/Comparable/Precedent/VC), 3-statement model, 5-stage due diligence SOP, and industry benchmarks.

1No reviews

werfveiligheid rondgang

by Nex AI

Structureert een veiligheidsrondgang op de werf in een concept-vaststellingenrapport: risicopunten per zone, ernst, verantwoordelijke en opvolging: naast (niet in plaats van) de veiligheidscoördinator.

0No reviews

vruchtgebruik waardering

by Nex AI

Automated Belgian fiscal valuation of usufruct and bare ownership for notarial deeds and tax calculations.

0No reviews

rag eval

by Ifásola

Diagnose RAG bottlenecks with precision metrics (Recall, MRR, nDCG) to identify retrieval or ranking failures.

0No reviews

AI Feature Eval Writer

by PubsProToolkit

$14

Design and write the eval suite for your LLM-powered feature — the metrics that match your failure modes, a golden dataset plan with starter cases, anchored rubrics, LLM-as-judge prompts with the known bias mitigations, and pass/fail gates wired for CI.

0No reviews

Agent Harness Architect

by PubsProToolkit

$14

Model quality is table stakes — the harness is where agents win or fail. This designs yours: it writes a structured, testable system prompt (role, tools, boundaries, method, output contract, failure handling) and maps every concern to the right layer — prompt, tool, guardrail, or evaluation — so the pieces reinforce each other instead of fighting.

0No reviews

RAG Knowledge Base Auditor

by PromptWagon

$10

Reviews document sets, source quality, chunking logic, metadata, retrieval coverage, citation traceability, answer grounding, source gaps, stale content, duplicate content, and failure patterns for RAG knowledge-base chatbots. Helps AI, product, support, governance, and engineering teams diagnose common and costly RAG quality problems before deployment or after incidents.

0No reviews

Agentic Engineering Patterns Library

by PubsProToolkit

$49

A deep reference library of production agent patterns — orchestration, context, tool design, failure and recovery, oversight, and evaluation. Every pattern states when it applies, when it's the wrong answer, what it costs, and the failure it prevents. Seven reference files, not a checklist.

0No reviews

Model Evaluation Report Builder

by PromptWagon

$10

Turns model test results, prompts, outputs, benchmarks, scoring notes, evaluation datasets, failure examples, comparison results, and reviewer observations into clear model evaluation reports with findings, recommendations, evidence gaps, deployment considerations, and repeatable evaluation documentation for AI teams.

0No reviews

Prompt Change Log Generator

by PromptWagon

$10

Turns prompt edits, version notes, testing observations, failure fixes, evaluation outcomes, approval notes, and rollback details into clean prompt change logs, release notes, testing notes, version history, impact summaries, and rollback instructions for prompt-heavy teams, AI workflow owners, agencies, and product teams.

0No reviews