Benchx LogoBenchx Logo

Introduction

BenchX is a platform designed to streamline the evaluation and benchmarking of AI applications. At its core, BenchX:

  • Offers a universal framework with a standardized interface for evaluating AI applications across diverse benchmarks.
  • Orchestrates inference and evaluation of benchmark task samples, distributing them in isolated containers in the cloud, ensuring scalability and configurable concurrency.
  • Gives complete control over benchmark execution, with options to stop, pause, and resume runs as needed.
  • Provides real-time reports, offering detailed insights into individual evaluated samples and enabling side-by-side comparisons of the model's prediction against the ground truth.
  • Integrates seamlessly with your model's source control provider, allowing you to track performance across different versions of your model.
  • Offers a public API, enabling you to run benchmarks as part of your CI/CD pipelines.

Core Concepts

AI Model

Also known as AI Agent, or AI Application, is the intelligent system designed to perform a specific task and is the subject of evaluation on BenchX.

Task execution

The act of running the AI model on a sample task to generate an output.

Task output

Also known as the prediction, refers to the output generated by the AI model when it is run on a sample task.

Expected output

The ideal output expected to be generated by the AI model when it is run on a sample task.

Task handler image

The docker image compatible with BenchX interface, containing the logic to run an AI model on a sample task.

Task handler repository

The source code repository that publishes a task handler image to BenchX upon each commit.

Evaluation

The process of comparing the model's output with the expected output to determine how well the model performed on a sample task.

Project

Defines a profile for running a certain benchmark against an AI model.

Benchmark

A collection of task sample that are collectively used to evaluate the overall performance of an AI Model on a certain task.

Task sample

A single instance of a task which consists of:

  • Task instructions: A description of the task that the model is expected to perform.
  • Context: A collection of input data or information associated with the given task sample. This could be unique for each task or shared across multiple tasks.
  • Expected Output: The ideal ouput expected from the model when it is run on the given task sample.

Tasks with shared context:600x700 Tasks with unique context:600x700