This folder contains scripts for evaluating models.
Similar to data collection, we uniformly use the OpenAI interface to generate.
Note: We leverages extra parameters specific to vllm's OpenAI-compatible server for handling custom chat templates and special tokens for our models. Other OpenAI-compatible inference services may not be directly applicable.
Example script to deploy CursorCore-Yi-1.5B using vllm:
python -m vllm.entrypoints.openai.api_server --port 10086 --model TechxGenus/CursorCore-Yi-1.5BWe define the model inference service parameters in model_map.json. An example configuration is as follows:
{
"TechxGenus/CursorCore-Yi-1.5B": {
"base": "http://127.0.0.1:10086/v1",
"api": "sk-xxx"
}
}Run the following program to generate predicted code:
# WF Format (Default)
python eval/eval_apeval.py --model_map model_map.json --input_path benchmark/apeval.json --output_path eval/generations.jsonl --temperature 0.0 --use_wf
# LC Format
python eval/eval_apeval.py --model_map model_map.json --input_path benchmark/apeval.json --output_path eval/generations.jsonl --temperature 0.0 --use_lc
# SR Format
python eval/eval_apeval.py --model_map model_map.json --input_path benchmark/apeval.json --output_path eval/generations.jsonl --temperature 0.0 --use_sr
# Instruct Models
python eval/eval_apeval.py --model_map model_map.json --input_path benchmark/apeval.json --output_path eval/generations.jsonl --temperature 0.0 --use_instruct
# Base Models
python eval/eval_apeval.py --model_map model_map.json --input_path benchmark/apeval.json --output_path eval/generations.jsonl --temperature 0.0 --use_baseRun the following script to execute programs:
evalplus.evaluate --dataset humaneval --samples eval/generations.jsonlRun the following script to get evaluation results for each type:
python eval/extract_results.py --dataset_path benchmark/apeval.json --result_path eval/generations_eval_results.jsonRun the following program to generate predicted code:
# Tab
python eval/eval_humaneval.py --model_map model_map.json --input_path evalplus/humanevalplus --output_path eval/generations.jsonl --temperature 0.0 --use_tab
python eval/eval_mbpp.py --model_map model_map.json --input_path evalplus/mbppplus --output_path eval/generations.jsonl --temperature 0.0 --use_tab
# Inline
python eval/eval_humaneval.py --model_map model_map.json --input_path evalplus/humanevalplus --output_path eval/generations.jsonl --temperature 0.0 --use_inline
python eval/eval_mbpp.py --model_map model_map.json --input_path evalplus/mbppplus --output_path eval/generations.jsonl --temperature 0.0 --use_inline
# Chat
python eval/eval_humaneval.py --model_map model_map.json --input_path evalplus/humanevalplus --output_path eval/generations.jsonl --temperature 0.0 --use_chat
python eval/eval_mbpp.py --model_map model_map.json --input_path evalplus/mbppplus --output_path eval/generations.jsonl --temperature 0.0 --use_chatRun the following script to execute programs:
evalplus.evaluate --dataset humaneval --samples eval/generations.jsonl
evalplus.evaluate --dataset mbpp --samples eval/generations.jsonl