Introduction
BenchX is a platform designed to streamline the evaluation and benchmarking of AI applications. At its core, BenchX:
- Offers a universal framework with a standardized interface for evaluating AI applications across diverse benchmarks.
- Orchestrates inference and evaluation of benchmark task samples, distributing them in isolated containers in the cloud, ensuring scalability and configurable concurrency.
- Gives complete control over benchmark execution, with options to stop, pause, and resume runs as needed.
- Provides real-time reports, offering detailed insights into individual evaluated samples and enabling side-by-side comparisons of the model's prediction against the ground truth.
- Integrates seamlessly with your model's source control provider, allowing you to track performance across different versions of your model.
- Offers a public API, enabling you to run benchmarks as part of your CI/CD pipelines.
Core Concepts
AI Model
Also known as AI Agent, or AI Application, is the intelligent system designed to perform a specific task and is the subject of evaluation on BenchX.
Task execution
The act of running the AI model on a sample task to generate an output.
Task output
Also known as the prediction, refers to the output generated by the AI model when it is run on a sample task.
Expected output
The ideal output expected to be generated by the AI model when it is run on a sample task.
Task handler image
The docker image compatible with BenchX interface, containing the logic to run an AI model on a sample task.
Task handler repository
The source code repository that publishes a task handler image to BenchX upon each commit.
Evaluation
The process of comparing the model's output with the expected output to determine how well the model performed on a sample task.
Project
Defines a profile for running a certain benchmark against an AI model.
Benchmark
A collection of task sample that are collectively used to evaluate the overall performance of an AI Model on a certain task.
Task sample
A single instance of a task which consists of:
- Task instructions: A description of the task that the model is expected to perform.
- Context: A collection of input data or information associated with the given task sample. This could be unique for each task or shared across multiple tasks.
- Expected Output: The ideal ouput expected from the model when it is run on the given task sample.