Introduction

BenchX is a platform designed to streamline the evaluation and benchmarking of AI applications. At its core, BenchX:

Offers a universal framework with a standardized interface for evaluating AI applications across diverse benchmarks.
Orchestrates execution and evaluation of benchmark task samples, distributing them in isolated containers in the cloud, ensuring scalability and configurable concurrency.
Gives complete control over benchmark execution, with options to stop, pause, and resume runs as needed.
Provides real-time reports, offering detailed insights into individual evaluated samples and enabling side-by-side comparisons of the model's prediction against the ground truth.
Integrates seamlessly with your model's source control provider, allowing you to track performance across different versions of your model.
Offers a public API, enabling you to run benchmarks as part of your CI/CD pipelines.

Core Concepts

Task

Let's define a task as a specific work whose completion requires intelligence such as finding documents that are semantically most relevant to a query, or generating a code that implements a given development instruction.

Intelligent System

You may have designed a system, application or agent that performs one or more intelligent tasks.

Benchmark

Benchmark is a collection of sample tasks (of a certain type) and their expected outputs that are used as testcases to evaluate how well your system (or a part of your system) handles that type of task.

Task handler repository

In order for BenchX to feed your system (or a component of your system) with sample tasks, you need to provide a docker image that can be used to handle tasks on behalf of your system (or a component in your system that is under test). This image is built from a source code repository that implements the task handling logic.

Depending on your system's design and your usecase, the task handling logic could be implemented by either reusing your existing code, calling your system's underlying services or APIs, or implementing a simplified simulation of your system's behavior.

Project

Defines a profile for running a certain benchmark against images published from a task handler repository.

Evaluation

The process of comparing your task handler's output with the expected output of a sample task to determine how well the system under test handles that task.

Task sample

A single instance of a task which consists of:

Task instructions: A description of the task that the model is expected to perform.
Context: A collection of input data or information associated with the given task sample. This could be unique for each task or shared across multiple tasks.
Expected Output: The ideal ouput expected from the model when it is run on the given task sample.

Tasks with shared context:600x700 Tasks with unique context:600x700