Answer Engines powered by Generative AI are reshaping how people access and interact with global knowledge and online information. This repository provides the code and data necessary to reproduce our experimental results in this area, advancing research on the evaluation of Answer Engines and their underlying RAG (Retrieval-Augmented Generation) systems.
@inproceedings{narayanan2025search,
title={Search Engines in the AI Era: A Qualitative Understanding to the False Promise of Factual and Verifiable Source-Cited Responses in LLM-based Search},
author={Narayanan Venkit, Pranav and Laban, Philippe and Zhou, Yilun and Mao, Yixin and Wu, Chien-Sheng},
booktitle={Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency},
pages={1325--1340},
year={2025}
}
DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence
@article{venkit2025deeptrace,
title={DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence},
author={Venkit, Pranav Narayanan and Laban, Philippe and Zhou, Yilun and Huang, Kung-Hsiang and Mao, Yixin and Wu, Chien-Sheng},
journal={arXiv preprint arXiv:2509.04499},
year={2025}
}
Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage
@article{xie2024rag,
title={Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage},
author={Xie, Kaige and Laban, Philippe and Choubey, Prafulla Kumar and Xiong, Caiming and Wu, Chien-Sheng},
journal={arXiv preprint arXiv:2410.15531},
year={2024}
}
This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.