A Serverless AI Agent system built on Claude Agent SDK, implementing stateful conversation persistence across stateless containers using S3+DynamoDB.
A Serverless AI Agent system built on Claude Agent SDK, implementing stateful conversation persistence across stateless containers using S3+DynamoDB.
Exploratory Project | This project explores how to achieve stateful AI Agent sessions using FileSystem + Stateless Containers (AWS Lambda with Firecracker runtime as the foundation). It demonstrates how to maintain conversation persistence across stateless function invocations.
Telegram User → Bot API → API Gateway → Producer Lambda → SQS Queue → Consumer Lambda
↓ ↓
Return 200 agent-server Lambda
immediately ↓
DynamoDB (Session mapping) + S3 (Session files) + Bedrock (Claude)
Core Design:
- Uses the Hybrid Sessions pattern recommended by Claude Agent SDK
- SQS Async Architecture: Producer returns 200 immediately to Telegram, Consumer processes requests asynchronously
- Session Persistence: DynamoDB for mapping storage, S3 for conversation history, cross-request recovery support
- Multi-tenant Isolation: Client isolation based on Telegram chat_id + thread_id
- SubAgent Support: Configurable specialized Agents (e.g., AWS support) with example implementations
- Skills Support: Reusable skill modules with hello-world example
- MCP Integration: Support for HTTP and local command-based MCP servers
- Auto Cleanup: 25-day TTL + S3 lifecycle management
- SQS Queue: Async processing + auto retry + dead letter queue
- Quick Start: Provides example Skill/SubAgent/MCP configurations for adding other components
├── agent-sdk-server/ # Agent Runtime (Docker Container)
│ ├── handler.py # Lambda Entry Point
│ ├── agent_session.py # SDK Wrapper
│ ├── session_store.py # Session Persistence
│ └── claude-config/ # Configuration Files
│ ├── agents.json # SubAgent Definitions
│ ├── mcp.json # MCP Server Configuration
│ ├── skills/ # Skills Definitions
│ │ └── hello-world/ # Example Skill
│ └── system_prompt.md # System Prompt
│
├── agent-sdk-client/ # Telegram Client (ZIP Deployment)
│ ├── handler.py # Producer: Webhook receiver, writes to SQS
│ ├── consumer.py # Consumer: SQS consumer, calls Agent
│ └── config.py # Configuration management
│
├── docs/ # Documentation
│ └── anthropic-agent-sdk-official/ # SDK Official Docs Reference
│
├── template.yaml # SAM Deployment Template
└── samconfig.toml # SAM Configuration
- AWS CLI + SAM CLI
- Docker
- Amazon Bedrock access (Claude models)
- Telegram Bot Token
- Copy and modify configuration files:
cp .env.example .env
# Edit .env to fill in required environment variables- Build and deploy:
sam build
sam deploy --guided| Variable | Description |
|---|---|
SESSION_BUCKET |
S3 bucket name (auto-created) |
SESSION_TABLE |
DynamoDB table name (auto-created) |
BEDROCK_ACCESS_KEY_ID |
Bedrock access key |
BEDROCK_SECRET_ACCESS_KEY |
Bedrock secret key |
SDK_CLIENT_AUTH_TOKEN |
Internal authentication token |
TELEGRAM_BOT_TOKEN |
Telegram Bot Token |
QUEUE_URL |
SQS queue URL (auto-created) |
- Runtime: Python 3.12 + Claude Agent SDK
- Computing: AWS Lambda (ARM64)
- Storage: S3 + DynamoDB
- Message Queue: AWS SQS (Standard Queue + DLQ)
- AI: Claude via Amazon Bedrock
- Orchestration: AWS SAM
- Integration: Telegram Bot API + MCP
Problem Solved: Telegram Webhook times out and retries after ~27s, while Agent processing may take 30-70s, causing duplicate responses.
Solution:
- Producer Lambda receives Webhook, writes to SQS, returns 200 immediately (<1s)
- Consumer Lambda consumes from SQS, calls Agent Server, sends response to Telegram
- Retry 3 times on failure, then move to dead letter queue (DLQ)
Queue Configuration:
- VisibilityTimeout: 900s (= Lambda timeout)
- maxReceiveCount: 3 (retry 3 times)
- DLQ Alarm: CloudWatch alarm triggers when messages enter DLQ
Lifecycle:
- New message → Query DynamoDB mapping
- Mapping exists → Download
conversation.jsonlfrom S3 → Restore session - No mapping → Create new session → Save mapping to DynamoDB
- Processing done → Upload updates to S3
Persistent Files:
conversation.jsonl- Conversation history (required for restoration)debug.txt- Debug logstodos.json- Task status
Edit agent-sdk-server/claude-config/agents.json:
{
"agent-name": {
"description": "Agent description",
"prompt_file": "agents/prompt.md",
"tools": ["specific tool name"],
"model": "haiku"
}
}Note: The tools field does not support wildcards; you must specify complete tool names.
Create a new Skill in the agent-sdk-server/claude-config/skills/ directory:
- Create a folder:
skills/your-skill/ - Create a
SKILL.mdfile with YAML frontmatter and Markdown description - Claude Agent SDK will auto-discover and use these Skills
Example: skills/hello-world/SKILL.md
Edit agent-sdk-server/claude-config/mcp.json, supporting two types:
- HTTP MCP: HTTP endpoint pointing to remote MCP servers
- Command-line MCP: Start local MCP servers via
commandandargs
Examples include AWS knowledge base MCP servers. Refer to existing configurations to add more MCP servers.
The project includes the following example components; follow these examples to add other components:
- SubAgent Example:
aws-supportAgent inagents.json - Skill Example:
skills/hello-world/SKILL.md - MCP Example: AWS knowledge base and documentation MCP servers in
mcp.json
- Multi-tenant TenantID isolation
MIT
基于 Claude Agent SDK 构建的 Serverless AI Agent 系统,通过 S3+DynamoDB 实现无状态容器的"有状态"会话持久化。
探索性项目 | 本项目旨在探索如何通过 FileSystem + 无状态容器(以 Firecracker 为底层的 AWS Lambda)实现有状态 AI Agent 会话。项目展示了在无状态函数调用间维持对话持久化的实现方式。
Telegram User → Bot API → API Gateway → Producer Lambda → SQS Queue → Consumer Lambda
↓ ↓
立即返回 200 agent-server Lambda
↓
DynamoDB (Session映射) + S3 (Session文件) + Bedrock (Claude)
核心设计:
- 采用 Claude Agent SDK 官方推荐的 Hybrid Sessions 模式
- SQS 异步架构:Producer 立即返回 200 给 Telegram,Consumer 异步处理请求
- Session 持久化:DynamoDB 存储映射,S3 存储对话历史,支持跨请求恢复
- 多租户隔离:基于 Telegram chat_id + thread_id 实现客户端隔离
- SubAgent 支持:可配置多个专业 Agent(如 AWS 支持),包含示例实现
- Skills 支持:可复用的技能模块,包含 hello-world 示例
- MCP 集成:支持 HTTP 和本地命令类型的 MCP 服务器
- 自动清理:25天 TTL + S3 生命周期管理
- SQS 队列:异步处理 + 自动重试 + 死信队列
- 快速开始:提供示例 Skill/SubAgent/MCP 配置,可按照示例添加其他组件
├── agent-sdk-server/ # Agent Runtime (Docker容器)
│ ├── handler.py # Lambda入口
│ ├── agent_session.py # SDK包装器
│ ├── session_store.py # Session持久化
│ └── claude-config/ # 配置文件
│ ├── agents.json # SubAgent定义
│ ├── mcp.json # MCP服务器配置
│ ├── skills/ # Skills定义
│ │ └── hello-world/ # 示例 Skill
│ └── system_prompt.md # 系统提示
│
├── agent-sdk-client/ # Telegram客户端 (ZIP部署)
│ ├── handler.py # Producer: Webhook接收,写入SQS
│ ├── consumer.py # Consumer: SQS消费,调用Agent
│ └── config.py # 配置管理
│
├── docs/ # 文档
│ └── anthropic-agent-sdk-official/ # SDK官方文档参考
│
├── template.yaml # SAM部署模板
└── samconfig.toml # SAM配置
- AWS CLI + SAM CLI
- Docker
- Amazon Bedrock 访问权限(Claude模型)
- Telegram Bot Token
- 复制并修改配置文件:
cp .env.example .env
# 编辑 .env 填入必要的环境变量- 构建和部署:
sam build
sam deploy --guided| 变量 | 说明 |
|---|---|
SESSION_BUCKET |
S3桶名称(自动创建) |
SESSION_TABLE |
DynamoDB表名(自动创建) |
BEDROCK_ACCESS_KEY_ID |
Bedrock访问密钥 |
BEDROCK_SECRET_ACCESS_KEY |
Bedrock密钥 |
SDK_CLIENT_AUTH_TOKEN |
内部认证Token |
TELEGRAM_BOT_TOKEN |
Telegram Bot Token |
QUEUE_URL |
SQS队列URL(自动创建) |
- Runtime: Python 3.12 + Claude Agent SDK
- 计算: AWS Lambda (ARM64)
- 存储: S3 + DynamoDB
- 消息队列: AWS SQS (Standard Queue + DLQ)
- AI: Claude via Amazon Bedrock
- 编排: AWS SAM
- 集成: Telegram Bot API + MCP
解决的问题:Telegram Webhook 在 ~27s 后超时重试,而 Agent 处理可能需要 30-70s,导致重复响应。
解决方案:
- Producer Lambda 接收 Webhook,写入 SQS,立即返回 200(<1s)
- Consumer Lambda 从 SQS 消费,调用 Agent Server,发送响应给 Telegram
- 失败重试 3 次,最终失败进入死信队列(DLQ)
队列配置:
- VisibilityTimeout: 900s(= Lambda 超时)
- maxReceiveCount: 3(重试 3 次)
- DLQ 告警:消息进入 DLQ 时触发 CloudWatch 告警
生命周期:
- 新消息 → 查询 DynamoDB 映射
- 存在映射 → 从 S3 下载
conversation.jsonl→ 恢复会话 - 不存在 → 创建新 session → 保存映射到 DynamoDB
- 处理完成 → 上传更新到 S3
持久化文件:
conversation.jsonl- 对话历史(恢复必需)debug.txt- 调试日志todos.json- 任务状态
编辑 agent-sdk-server/claude-config/agents.json:
{
"agent-name": {
"description": "Agent描述",
"prompt_file": "agents/prompt.md",
"tools": ["具体工具名称"],
"model": "haiku"
}
}注意:tools 字段不支持通配符,必须指定完整工具名称。
在 agent-sdk-server/claude-config/skills/ 目录下创建新 Skill:
- 创建文件夹:
skills/your-skill/ - 在文件夹中创建
SKILL.md文件,包含 YAML 前置和 Markdown 描述 - Claude Agent SDK 会自动发现并使用这些 Skills
参考示例:skills/hello-world/SKILL.md
编辑 agent-sdk-server/claude-config/mcp.json,支持两种类型:
- HTTP MCP:指向远程 MCP 服务器的 HTTP 端点
- 命令行 MCP:通过
command和args启动本地 MCP 服务器
示例中配置了 AWS 知识库 MCP 服务器。可参考现有配置添加更多 MCP 服务器。
项目已包含以下示例组件,可按照这些示例添加其他组件:
- SubAgent 示例:
agents.json中的aws-supportAgent - Skill 示例:
skills/hello-world/SKILL.md - MCP 示例:
mcp.json中的 AWS 知识库和文档 MCP 服务器
- 多租户 TenantID 隔离
MIT