eva-parser

A converter for Promptfoo test formats into the EVA-LLM ecosystem eva-run tasks.

NOTE! It supports a strictly validated subset of the Promptfoo format, extended with native EVA-LLM features for high-scale metrology (see examples below).

Quick Start

npm i @eva-llm/eva-parser

import { parsePromptfoo } from '@eva-llm/eva-parser';

const evaTests = parsePromptfoo(promptfooYamlContent);

Supported Promptfoo Items

Providers

providers:
  - openai:gpt-5-mini
  - openai:gpt-4.1-mini

providers:
  - id: openai:gpt-5.2
    config: # Any Vercel ai-sdk option: https://ai-sdk.dev/docs/reference/ai-sdk-core/generate-text#api-signature
      temperature: 0

Prompts

prompts:
  - Hello, how are you?
  - What is the capital of France?

prompts:
  - What is the capital of {{country}}

Repeat

test:
  times: 50 # optional (default 1), eva-run specific, used for AI Metrology statistics.

Variables

test:
  - vars:
      country: France

Output override

prompts:
  - What is the capital of France?
test:
  - output: Paris # Used in JQA tests: https://eva-llm.github.io/eva-run/#jqa-judge-quality-audit-metrology-mode

Asserts

NOTE! All LLM asserts support natively Dark Teaming to measure Epistemic Honesty via Symmetry Deviation, and extend Promptfoo format with field must_fail

b-eval (binary g-eval - eva-llm specific)

test:
  - assert:
      - type: b-eval
        value: answer is coherent to question # can be array as well
        threshold: 0.5 # optional (default is 0.5 in eva-run)
        provider: # optional (default is test provider)
          - id: openai:gpt-4.1-mini
            config:
              temperature: 0 # optional (default is 0 in eva-run as factual standard for better judging)
        times: 5 # optional (default 1, eva-run specific) - repeat assert N times
        must_fail: true # optional (default false, eva-run specific) - Dark Teaming field
        answer_only: true # optional (default false, eva-run specific) - analyze only LLM answer without prompt involvement

g-eval

test:
  - assert:
      - type: g-eval
        value: answer is coherent to question # can be array as well
        threshold: 0.5 # optional (default is 0.5 in eva-run)
        provider: # optional (default is test provider)
          - id: openai:gpt-4.1-mini
            config:
              temperature: 0 # optional (default is 0 in eva-run as factual standard for better judging)
        times: 5 # optional (default 1, eva-run specific) - repeat assert N times
        must_fail: true # optional (default false, eva-run specific) - Dark Teaming field
        answer_only: true # optional (default false, eva-run specific) - analyze only LLM answer without prompt involvement

llm-rubric

test:
  - assert:
      - type: llm-rubric
        value: answer is polite # can be array as well
        threshold: 0.5 # optional (default is 0.5 in eva-run)
        provider: # optional (default is test provider)
          - id: openai:gpt-4.1-mini
            config:
              temperature: 0 # optional (default is 0 in eva-run as factual standard for better judging)
        times: 5 # optional (default 1, eva-run specific) - repeat assert N times
        must_fail: true # optional (default false, eva-run specific) - Dark Teaming field

equals

test:
  - assert:
    - type: equals
      value: Paris
      case_sensitive: false # optional (default true, eva-run specific)
      times: 5 # optional (default 1, eva-run specific) - repeat assert N times

not-equals

test:
  - assert:
    - type: not-equals
      value: Chicago
      case_sensitive: false # optional (default true, eva-run specific)
      times: 5 # optional (default 1, eva-run specific) - repeat assert N times

contains

test:
  - assert:
    - type: contains
      value: Paris
      case_sensitive: false # optional (default true, eva-run specific)
      times: 5 # optional (default 1, eva-run specific) - repeat assert N times

not-contains

test:
  - assert:
    - type: not-contains
      value: Chicago
      case_sensitive: false # optional (default true, eva-run specific)
      times: 5 # optional (default 1, eva-run specific) - repeat assert N times

regex

test:
  - assert:
    - type: regex
      value: /paris/i
      times: 5 # optional (default 1, eva-run specific) - repeat assert N times

Quick Start​

Supported Promptfoo Items​

Providers​

Prompts​

Repeat​

Variables​

Output override​

Asserts​

b-eval (binary g-eval - eva-llm specific)​

g-eval​

llm-rubric​

equals​

not-equals​

contains​

not-contains​

regex​