EVA-LLM

From AI Experiments to Industrial-Grade Validation

A comprehensive open-source ecosystem for professional AI Metrology

Created by the author of G-Eval implementation in AI automated testing and security platform Promptfoo

Don't just hope your AI is safe — prove it with data.

In the era of the EU AI Act and evolving global regulations, the gap between experimental R&D and production-ready verification is widening. To meet strict transparency and safety requirements, enterprises need automated AI QA instruments.

EVA-LLM provides the ecosystem to run massive-scale automated testing, transforming unpredictable model behavior into a Statistical SLA.

Explore GitHub Hub
Release

eva-judge

The Brain. Unified abstraction for LLM-as-a-Judge: G-Eval, B-Eval, and LLM-Rubric.

Release

dark-teaming

Manifesto and methodology for measuring LLM Epistemic Honesty.

Release

llm-as-a-jest

AI evaluation for complex agentic scenarios in industry-standard Jest workflow.

Beta

eva-run

The Heart. High-performance "Fire & Forget" scalable I/O-bound server for AI QA and Metrology.

Release

eva-cli

A terminal interface for local debugging and seamless CI/CD integration.

Release

eva-parser

A bridge for the ecosystem, converting industry-standard Promptfoo-format into eva-run.

Beta

eva-desk

Lightweight logic-level eva-run orchestrator and test load balancer with Redis-based bus.

WIP

eva-web

A visual dashboard to manage high-volume test runs and analyze historical performance.

Beta

eva-guard

MCP-compatible guardrails server basing on eva-judge for production runtime.

Freeze

eva-audit

A tool for Red Teaming, focusing on security and adversarial probes.

Manifesto

AI Regulation: Where Physics Meets Legislation

For centuries, humans have been trying to conquer nature.

We learned to tame fire, control electricity, and eventually describe the physical world through laws and equations. Physics became our way of turning chaos into something predictable - something we could build on.

Today, we are attempting something similar with Artificial Intelligence.

But AI is fundamentally different.

At its core, modern AI systems are non-deterministic. They rely on probabilistic processes, often driven by randomness. No one can fully trace what happens inside every single neuron of a neural network, or predict how each microscopic interaction contributes to the final output.

Trying to predict AI behavior at that level is like trying to predict the motion of a fluid by tracking every molecule.

And yet - physics faced the same problem.

We never learned to predict individual molecules. Instead, we developed thermodynamics - a statistical framework that made large-scale behavior understandable and reliable.

James Clerk Maxwell made a revolutionary step by applying a probabilistic approach to describing the motion of gas molecules and then Ludwig Boltzmann linked entropy to probability. And that's how engines, turbines, and entire industries became possible.

AI is heading in the same direction.

We may not achieve strict mathematical predictability. But we can achieve Statistical Reliability.

Not deterministic guarantees - but Statistical SLAs.

This shift has a direct implication for compliance.

At the age of AI, when we are crossing the bifurcation point, compliance is no longer just about rules and documentation. It is becoming a question of measurement at scale. Of proving, with high confidence, how a system behaves across massive variations of inputs.

This is why businesses increasingly need professional AI testing infrastructure - systems capable of running millions of test scenarios, exploring edge cases, and generating statistically meaningful evidence.

But infrastructure alone is not enough.

The harder challenge is defining what "good" looks like. What are the reference tests? What constitutes acceptable behavior? What is the equivalent of a calibrated measurement in this new domain?

In other words, we are rediscovering Metrology - but for AI.

And just like in physics, the future of regulation will not be built on perfect understanding, but on robust measurement.