Custom Benchmark

Evaluator

In ChEF, all evaluation pipelines are managed by the Evaluator (src/ChEF/evaluator.py) class. This class serves as the control center for evaluation tasks and incorporates various components, including a scenario, an instruction, an inferencer, and a metric. These components are defined through recipe configurations.

Key Components:

· Scenario: The scenario represents the evaluation dataset and task-specific details.
· Instruction: Responsible for processing samples and generating queries.
· Inferencer: Performs model inference on the dataset.
· Metric: Evaluates model performance using defined metrics.

Evaluation Workflow

The evaluation process in ChEF follows a structured workflow:

· Model and Data Loading: First, the model and evaluation dataset (scenario) are loaded.
· Evaluator.evaluate Method: The evaluation is initiated by calling the evaluate method of the Evaluator class.
· Inference with inferencer.inference: The inferencer is used to perform model inference. During dataset traversal, the InstructionHandler processes each sample, generating queries that serve as inputs to the model.
· Results Saving: The output of the inference is saved in the specified results_path.
· Metric Evaluation: Finally, the metric evaluates the results file, calculating various performance metrics specific to the evaluation task.
· Output Evaluation Results: The final evaluation results are provided as output, allowing you to assess the model’s performance.

Employ Your Model

In ChEF, you can employ your own custom models by following these steps:

Step 1: Prepare Your Model Files

· Navigate to the src/ChEF/models/ folder in ChEF.
· Paste all the necessary files for your custom model into this folder.

Step 2: Write the Test Model

· Create a new Python file in the models folder and name it something like test_your_model.py.
· In this file, you will need to inherit from the TestBase class defined in src/ChEF/models/test_base.py. The TestBase class provides a set of interfaces that you should implement for testing your model.

Step 3: Test Your Model

· Add your model in src/ChEF/models/init.py.
· Prepare your model configuration in src/config/ChEF/models/. For example, the config for KOSMOS-2 (src/config/ChEF/models/kosmos2.yaml):

model_name: Kosmos2
model_path: ../model_zoo/kosmos/kosmos-2.pt
if_grounding: False # set True for detection and grounding evaluation

The config for KOSMOS-2 on detection tasks evaluation:

model_name: Kosmos2
model_path: ../model_zoo/kosmos/kosmos-2.pt
if_grounding: True

Use the provided recipes for evaluation:

python tools/eval.py --model_cfg configs/ChEF/models/your_model.yaml --recipe_cfg recipe_cfg

Instruction

In ChEF, the InstructionHandler (src/ChEF/instruction/init.py) class plays a central role in managing instructions for generating queries when iterating through the dataset in the inferencer. These queries are then used as inputs to the model for various tasks.

ChEF supports three main query types: standard query, query pool, and multiturn query. For each query type, various query statements are defined based on the dataset’s task type.

· Standard Query: Uses the first query defined in the query pool.
· Query Pool: Specifies queries in the pool by assigned ids defined in the configuration.
· Multiturn Query: Can get different queries depending on the turn id, which are also defined in the query pool.

For more details, refer to the src/ChEF/instruction/query.py.

InstructionHandler also supports generating in-context examples for queries using ice_retriever (src/ChEF/instruction/ice_retriever/). ChEF supports four types of ice_retrievers: random, fixed, topk_text, and topk_img. The generate_ices function in the InstructionHandler class outputs several in-context examples for the input query.

Employ Your Instruction: You can add special queries in the Query Pool, and define the assigned ids in the recipe configuration to use the new queries. You can also define a new type of query by defining the query in src/ChEF/instruction/query.py and adding a new function in InstructionHandler.

Inferencer

In ChEF, the Inferencer component is a crucial part of the system, responsible for model inference. ChEF offers a variety of pre-defined inferencers to cater to different needs. You can easily choose the appropriate inferencer by specifying the inferencer category and necessary settings in the recipe configuration. Additionally, users have the flexibility to define their custom inferencers.

Pre-Defined Inferencers:

ChEF provides eight different inferencers that cover a range of use cases. You can effortlessly use the desired inferencer by specifying its category and required settings in the recipe configuration.

Custom Inferencers:

For advanced users and specific requirements, ChEF offers the option to create custom inferencers. The basic structure of an inferencer is defined in the src/ChEF/inferencer/Direct.py file (Direct_inferencer). You can extend this structure to implement your custom inferencer logic.

class Your_inferencer(Direct_inferencer):
    def __init__(self, **kwargs) -> None:
        super().__init__(**kwargs)
        
    def inference(self, model, dataset):
        predictions = []
        # Step 1: build dataloader
        dataloader = DataLoader(dataset, batch_size=self.batch_size, collate_fn=lambda batch: {key: [dict[key] for dict in batch] for key in batch[0]})
        for batch in tqdm(dataloader, desc="Running inference"):
            # Step 2: get input query
            prompts = self.instruction_handler.generate(batch)
            # Step 3: model outputs
            outputs = model.generate(prompts)
            # Step 4: save results
            predictions = predictions + outputs
        # Step 5: output file
        self._after_inference_step(predictions)

Metric

In ChEF, the Metric component plays a crucial role in evaluating and measuring the performance of models across various scenarios and protocols. ChEF offers a wide range of pre-defined metrics, each tailored to different evaluation needs. Detailed information about these metrics can be found in the src/ChEF/metric/init.py file.

Custom Metrics:

ChEF also allows users to define their custom metrics. The basic structure of a metric is defined in the src/ChEF/metric/utils.py file (Base_Metric). You can extend this structure to implement your custom metric logic.

class Your_metric(Base_metric):
    def __init__(self, **kwargs) -> None:
        super().__init__(**kwargs)

    def metric_func(self, answers):
        '''
            answers: List[sample], each sample is a dict
            sample: {
                'answer' : str,
                'gt_answers' : str, 
            }
        '''
        # Evaluation