EVA-LLM

From AI Experiments to Industrial-Grade Validation

A comprehensive open-source ecosystem for professional AI Quality Assurance.

Created by the author of G-Eval implementation in AI automated testing and security platform Promptfoo

Don't just hope your AI is safe — prove it with data.

In the era of the EU AI Act and evolving global regulations, the gap between experimental R&D and production-ready verification is widening. To meet strict transparency and safety requirements, enterprises need automated AI QA instruments.

EVA-LLM provides the ecosystem to run massive-scale automated testing, transforming unpredictable model behavior into a Statistical SLA.

Explore GitHub Hub
Release

eva-judge

The Brain. Unified abstraction for LLM-as-a-Judge: G-Eval, B-Eval, and LLM-Rubric.

Release

llm-as-a-jest

AI evaluation for complex agentic scenarios in industry-standard Jest workflow.

Release

dark-teaming

Manifesto and methodology for measuring LLM Epistemic Honesty with Symmetry Deviation.

MVP

eva-run

The Heart. High-performance "Fire & Forget" I/O-bound server for horizontal scaling of thousands and millions of tests.

Release

eva-parser

A bridge for the ecosystem, converting industry-standard Promptfoo-format into internal eva-run tasks.

Release

eva-cli

A terminal interface for local debugging and seamless CI/CD integration, supporting Promptfoo formats.

MVP

eva-guard

MCP-compatible guardrails server basing on eva-judge for production runtime.

WIP

eva-audit

A tool for Red Teaming, focusing on security and adversarial probes.

WIP

eva-web

A visual dashboard to manage high-volume test runs and analyze historical performance.