An extensible benchmark for evaluating large language models on planning
-
Updated
Mar 16, 2025 - PDDL
An extensible benchmark for evaluating large language models on planning
🔥 A list of tools, frameworks, and resources for building AI web agents
A comprehensive set of LLM benchmark scores and provider prices.
What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks
BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
How good are LLMs at chemistry?
Benchmark that evaluates LLMs using 601 NYT Connections puzzles extended with extra trick words
Language Model for Mainframe Modernization
Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a small set of examples and anti-examples, then detect which item truly fits that theme among a collection of misleading candidates.
CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes.
Develop reliable AI apps
Training and Benchmarking LLMs for Code Preference.
The data and implementation for the experiments in the paper "Flows: Building Blocks of Reasoning and Collaborating AI".
Restore safety in fine-tuned language models through task arithmetic
A minimalist benchmarking tool designed to test the routine-generation capabilities of LLMs.
Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
Join 15k builders to the Real-World ML Newsletter ⬇️⬇️⬇️
Add a description, image, and links to the llms-benchmarking topic page so that developers can more easily learn about it.
To associate your repository with the llms-benchmarking topic, visit your repo's landing page and select "manage topics."