llm-benchmarking

Here are 3 public repositories matching this topic...

lechmazur / confabulations

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

benchmark leaderboard gemini llama language-model claude rag o1 hallucinations ai-evaluation llm gemini-pro llm-benchmarking confabulations deepseek-r1 o3-mini

Updated Aug 7, 2025
HTML

A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.

evaluation llm llm-evaluation llm-benchmarking generative-ai-benchmarking

Updated Dec 26, 2025
HTML

forecastingresearch / forecastbench

Star

A dynamic forecasting benchmark for LLMs

forecasting llm-benchmarking

Updated Jan 7, 2026
HTML

Improve this page

Add a description, image, and links to the llm-benchmarking topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-benchmarking topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly