In ChEF, all evaluation pipelines are managed by the Evaluator
(src/ChEF/evaluator.py
) class. This class serves as the control center for evaluation tasks and incorporates various components, including a scenario, an instruction, an inferencer, and a metric. These components are defined through recipe configurations.
The evaluation process in ChEF follows a structured workflow:
evaluate
method of the Evaluator
class.inferencer
is used to perform model inference. During dataset traversal, the InstructionHandler
processes each sample, generating queries that serve as inputs to the model.results_path
.metric
evaluates the results file, calculating various performance metrics specific to the evaluation task.In ChEF, you can employ your own custom models by following these steps:
src/ChEF/models/
folder in ChEF.test_your_model.py
.TestBase
class defined in src/ChEF/models/test_base.py
. The TestBase
class provides a set of interfaces that you should implement for testing your model.src/ChEF/models/init.py
.src/config/ChEF/models/
. For example, the config for KOSMOS-2
(src/config/ChEF/models/kosmos2.yaml
):model_name: Kosmos2
model_path: ../model_zoo/kosmos/kosmos-2.pt
if_grounding: False # set True for detection and grounding evaluation
The config for KOSMOS-2 on detection tasks evaluation:
model_name: Kosmos2
model_path: ../model_zoo/kosmos/kosmos-2.pt
if_grounding: True
Use the provided recipes for evaluation:
python tools/eval.py --model_cfg configs/ChEF/models/your_model.yaml --recipe_cfg recipe_cfg
In ChEF, the InstructionHandler
(src/ChEF/instruction/init.py
) class plays a central role in managing instructions for generating queries when iterating through the dataset in the inferencer
. These queries are then used as inputs to the model for various tasks.
ChEF supports three main query types: standard query
, query pool
, and multiturn query
. For each query type, various query statements are defined based on the dataset’s task type.
For more details, refer to the src/ChEF/instruction/query.py
.
InstructionHandler
also supports generating in-context examples for queries using ice_retriever
(src/ChEF/instruction/ice_retriever/
). ChEF supports four types of ice_retrievers
: random
, fixed
, topk_text
, and topk_img
. The generate_ices
function in the InstructionHandler
class outputs several in-context examples for the input query.
Employ Your Instruction: You can add special queries in the Query Pool
, and define the assigned ids in the recipe configuration to use the new queries. You can also define a new type of query by defining the query in src/ChEF/instruction/query.py
and adding a new function in InstructionHandler
.
In ChEF, the Inferencer
component is a crucial part of the system, responsible for model inference. ChEF offers a variety of pre-defined inferencers to cater to different needs. You can easily choose the appropriate inferencer by specifying the inferencer category and necessary settings in the recipe configuration. Additionally, users have the flexibility to define their custom inferencers.
ChEF provides eight different inferencers that cover a range of use cases. You can effortlessly use the desired inferencer by specifying its category and required settings in the recipe configuration.
For advanced users and specific requirements, ChEF offers the option to create custom inferencers. The basic structure of an inferencer is defined in the src/ChEF/inferencer/Direct.py
file (Direct_inferencer
). You can extend this structure to implement your custom inferencer logic.
class Your_inferencer(Direct_inferencer):
def __init__(self, **kwargs) -> None:
super().__init__(**kwargs)
def inference(self, model, dataset):
predictions = []
# Step 1: build dataloader
dataloader = DataLoader(dataset, batch_size=self.batch_size, collate_fn=lambda batch: {key: [dict[key] for dict in batch] for key in batch[0]})
for batch in tqdm(dataloader, desc="Running inference"):
# Step 2: get input query
prompts = self.instruction_handler.generate(batch)
# Step 3: model outputs
outputs = model.generate(prompts)
# Step 4: save results
predictions = predictions + outputs
# Step 5: output file
self._after_inference_step(predictions)
In ChEF, the Metric
component plays a crucial role in evaluating and measuring the performance of models across various scenarios and protocols. ChEF offers a wide range of pre-defined metrics, each tailored to different evaluation needs. Detailed information about these metrics can be found in the src/ChEF/metric/init.py
file.
ChEF also allows users to define their custom metrics. The basic structure of a metric is defined in the src/ChEF/metric/utils.py
file (Base_Metric
). You can extend this structure to implement your custom metric logic.
class Your_metric(Base_metric):
def __init__(self, **kwargs) -> None:
super().__init__(**kwargs)
def metric_func(self, answers):
'''
answers: List[sample], each sample is a dict
sample: {
'answer' : str,
'gt_answers' : str,
}
'''
# Evaluation