AlpacaEval is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/tatsu-lab/alpaca_eval…
ANN-Benchmarks is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/erikbern/ann-benc…
ARES is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/stanford-futuredata/ARES.sv…
BEIR is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/beir-cellar/beir.svg?cacheS…
C-Eval is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/hkust-nlp/ceval.svg?cache…
Code Generation LM Evaluation Harness is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/…
COMET is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/Unbabel/COMET.svg?cacheSec…
Deepchecks is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/deepchecks/deepchecks…
DeepEval is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/confident-ai/deepeval.s…
DomainBed is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/facebookresearch/Domai…
EvalAI is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/Cloud-CV/EvalAI.svg?cache…
Evalchemy is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/mlfoundations/evalchem…
EvalPlus is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/evalplus/evalplus.svg?c…
Evals is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/openai/evals.svg?cacheSeco…
EvalScope is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/modelscope/evalscope.s…
Evaluate is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/huggingface/evaluate.sv…
Future AGI is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/future-agi/future-agi…
GAOKAO-Bench is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/OpenLMLab/GAOKAO-Be…
guidellm is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/vllm-project/guidellm.s…
Helicone is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/Helicone/helicone.svg?c…
HumanEval is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/openai/human-eval.svg?…
Inspect is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/UKGovernmentBEIS/inspect…
JiWER is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/jitsi/jiwer.svg?cacheSecon…
Laminar is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/lmnr-ai/lmnr.svg?cacheSe…
LangTest is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/JohnSnowLabs/langtest.s…
Language Model Evaluation Harness is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/star…
LLMPerf is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/ray-project/llmperf.svg?…
lmms-eval is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/EvolvingLMMs-Lab/lmms-…
Massive Text Embedding Benchmark is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars…
Melting Pot is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/google-deepmind/melt…
Meta-World is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/Farama-Foundation/Met…
mir_eval is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/mir-evaluation/mir_eval…
MLPerf Inference is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/mlcommons/infer…
NannyML is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/NannyML/nannyml.svg?cach…
OGB is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/snap-stanford/ogb.svg?cacheS…
Ollama Grid Search is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/dezoito/ollam…
OpenCompass is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/open-compass/OpenCom…
Overcooked-AI is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/HumanCompatibleAI/…
Prometheus-Eval is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/prometheus-eval/…
PromptBench is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/microsoft/promptbenc…
RagaAI Catalyst is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/raga-ai-hub/Raga…
RewardBench is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/allenai/reward-bench…
RLBench is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/stepjam/RLBench.svg?cach…
SimplerEnv is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/simpler-env/SimplerEn…
Speech-to-Text Benchmark is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/Picovoi…
SwanLab is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/SwanHubX/SwanLab.svg?cac…
TorchBench is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/pytorch/benchmark.svg…
TruLens is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/truera/trulens.svg?cache…
TrustLLM is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/HowieHwong/TrustLLM.svg…
VBench is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/Vchitect/VBench.svg?cache…
VLMEvalKit is an AI agent in the Evaluation and Monitoring category. ![](https://img.shields.io/github/stars/open-compass/VLMEvalK…