- Project Overview
- Architecture Overview
- Project Structure
- Benchmark Results
- Reuirements
- Setup and Reproducibility
- Evaluation
- Project Status
HyDRA (Hierarchical Deep Researcher Agent) is an agentic research framework designed to execute complex, multi-step information-seeking and reasoning tasks with high reliability. It combines hierarchical agent delegation, strict action–observation execution loops, and aggressive context management to prevent failure modes commonly observed in long-horizon LLM workflows.
At its core, HyDRA separates planning, execution, research, and synthesis into well-defined agents and tools. Each component operates under strict output contracts (JSON-only actions), ensuring deterministic downstream processing and eliminating ambiguity during tool invocation.
A central Context Manager continuously validates, summarizes, and minimizes tool and agent outputs. This makes HyDRA resilient to:
- Context window exhaustion
- Tool output verbosity
- JSON decoding failures due to context pollution
HyDRA is designed for:
- Deep web and archive research
- Benchmark-driven evaluation (GAIA, SimpleQA, HLE)
- Structured, verifiable final answers
- Extensibility via modular toolchains
The system is suitable for research automation, evaluation of reasoning agents, and experimentation with hierarchical agent architectures.
Action: [Planning_agent > Managed_agent]
{
name: <agent_name>,
task: <delegated_task>
}
Final_answer: [Planning_agent < Managed_agent]
{
answer: <observation_from_managed_agent>
}HyDRA follows a hierarchical, contract-driven agent architecture with strict separation of concerns.
-
Planning Agent
- Interprets the task
- Delegates actions to managed agent while overviewing every observations to solve user's task
-
Execution Agents
browser_use_agent: Handles live web interactiondeep_researcher_agent: Performs multi-source synthesis and deeper reasoning with web-usedeep_analyzer_agent: Analyze the source (files / images) with given task. [In Progress]code_agent: Creates tools as function dynamically and puts then into agent context which they can use using givenpython_interpretertool [In Progress]- Agents operate in an Action → Observation loop
-
Tool Layer
- Web search, visti page navigation, archive search, Python execution
- Each tool produces validated outputs
-
Context Manager
- Intercepts every tool and agent output=
- Summarizes verbose outputs
- Minimizes context when token limits are reached
-
Final Answer Tool
- Produces a single, structured final response
- Terminates execution deterministically
The Context Manager provides three critical functions:
-
Validate
- Ensures tool/agent outputs conform to expected formats
- Prevents invalid data from propagating
-
Summarize
- Compresses long tool outputs into task-relevant facts
- Preserves decision-critical information only
-
Minimize
- Shrinks historical context using validation rules
- Automatically triggered when context budget is exceeded
HyDRA implements explicit and layered failure-handling mechanisms to mitigate common failure modes in long-horizon agentic workflows.
1. Context Pollution Prevention Tools may occasionally return verbose, poorly structured, or markdown-heavy outputs that can pollute the agent context and degrade downstream reasoning.
- Mitigation:
- Every tool output is validated before being appended to the observation history. Invalid or low-signal outputs are filtered, summarized, or rejected, preventing unnecessary noise from entering the context.
2. Non-Valid Structured Output from Models Some instruction-following models—particularly freely available Groq-hosted models—may emit malformed or non-structured outputs, including mismatched tool arguments or broken JSON, especially under polluted or compressed context conditions.
-
Mitigation:
- System summarize the most recent context segment; if the issue persists, remove the last context entry entirely.
- Enforce a maximum retry allowed per model invocation.
- If repeated attempts fail to recover valid output, the affected agent instance is terminated while the overall system continues execution. The Planning Agent can re-delegate the task to an alternative agent or execution path.
3. Context Limit Management When the context window approaches its limit, HyDRA proactively reduces context size.
- Mitigation:
- The Context Manager shrinks historical context to a minimal, information-preserving representation that retains key facts, decisions, and constraints required for task continuation.
These mechanisms substantially improve robustness in long-running research tasks. While failures may still occur due to external constraints such as request-rate exhaustion or rare Planning Agent errors, such failures are isolated and do not compromise the stability of the overall system.
.
├── src/ # Core agent implementation
│ ├── agent.py # Entry point for the hierarchical research agent
│ ├── core/ # Agent core logic
│ │ ├── agent.py # Base Agent logic
│ │ ├── context_manager.py # Context and token budget management: validate tool output, minimize context, summerize particular agent / tool output
│ │ ├── state.py # Agent state representation
│ │ ├── utils.py # Shared utilities
│ │ └── prompts/ # Prompt templates (YAML + Markdown)
│ ├── tools/ # Tooling layer used by the agent
│ │ ├── registry.py # Tool registration logic
│ │ ├── registry.yaml # Tool configuration
│ │ ├── tools_registry.py # Tool resolution and dispatch
│ │ ├── deep_researcher/ # Deep research toolchain
│ │ ├── archive_searcher/ # Academic / archive search tools
│ │ ├── web_browser/ # Web interaction and parsing tools
│ │ ├── python_interpreter/ # Local Python execution tool
│ │ └── final_answer.py # Final answer tool
│ └── tool_builder.py # Tools registry construction
│
├── Evaluation_Suite/ # Evaluation and benchmarking framework
│ ├── evaluation_gaia.py # GAIA benchmark
│ ├── evaluation_simpleQA.py # SimpleQA benchmark
│ ├── evaluation_hle.py # HLE benchmark
│ ├── evaluation_utils.py # Shared evaluation utilities
│ ├── evaluation_config.yaml # Evaluation configuration
│ └── Evaluation_results/ # Stored evaluation outputs
│
├── notebooks/ # Analysis and benchmarking notebooks
├── docs/ # Project documentation
├── app.py # Frontend demo backend
├── Dockerfile # Containerized execution environment
├── requirements.txt # Python dependencies
├── README.md # Project documentation
└── LICENSE # License
The following environment variables are required to run or evaluate the agent:
| Variable | Description |
|---|---|
GROQ_API_KEY |
Primary agent inference |
GROQ_API_KEY_2 |
Context manager (token-efficient planning) |
GROQ_BASE_URL |
Groq API base URL |
HF_TOKEN |
Hugging Face token (required for evaluation only) |
Example:
export GROQ_API_KEY=your_key_here
export GROQ_API_KEY_2=your_key_here
export GROQ_BASE_URL=https://api.groq.com
export HF_TOKEN=your_hf_token_hereThis mutli-agent is fully reproducible and can be run in an isolated virtual environment with no cached dependencies.
# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate
# Set required environment variables
export GROQ_API_KEY="your_key_here"
export GROQ_API_KEY_2="your_key_here"
export GROQ_BASE_URL="https://api.groq.com"
# Install dependencies without caching (low disk usage)
python -m pip install --no-cache-dir -r requirements.txt
# Run the agent
python src/agent.py # Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate
# Set required environment variables
export GROQ_API_KEY="your_key_here"
export GROQ_API_KEY_2="your_key_here"
export GROQ_BASE_URL="https://api.groq.com"
# OPTION 1: Without Docker
# Install dependencies without caching (low disk usage)
python -m pip install --no-cache-dir -r requirements.txt
# Run the server
uvicorn app:app --host 0.0.0.0 --port 10000
# OPTION 2: Run Using Docker
docker build --no-cache --progress=plain -t app:latest .
docker run -p 10000:10000 app:latest
# Run Frontend
git clone https://github.com/Se00n00/Chat/tree/main
git checkout Hydra
npm install
export NG_APP_BACKEND="http://127.0.0.1:10000/chat" ! Important
ng serveEvaluation scripts benchmark the agent on established datasets.
# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate
# Set environment variables
export GROQ_API_KEY="your_key_here"
export GROQ_API_KEY_2="your_key_here"
export GROQ_BASE_URL="https://api.groq.com"
export HF_TOKEN="you_key_here"
# Install dependencies without caching
python -m pip install --no-cache-dir -r requirements.txt
# Run evaluation
python Evaluation_Suite/evaluation_<dataset>.py <evaluation_name> --time [sleep_seconds]Parameters
<dataset>:gaia,simpleQA,hle<evaluation_name>: Unique identifier for the run--time[optional]: Sleep interval between examples (rate-limit control)
This project is fully functional. Core agent capabilities are stable and usable.
Evaluation benchmarks are currently in progress.
- Hierarchical multi-agent architecture
- Tool registry and dynamic tool invocation
- Web browsing and document analysis tools
- Local Python execution tool
- Deterministic agent execution (seeded runs)
- Reproducible environment setup (no cache, isolated venv)
- code agent
- GAIA benchmark evaluation
- SimpleQA benchmark evaluation
- HLE benchmark evaluation
- Project structure documentation
- Dataset documentation
- Example run notebooks
- Architecture diagram (final)
- Comprehensive evaluation report

