Enterprise-grade document parsing service with asynchronous queue processing based on Celery, featuring a fully decoupled API/Worker architecture.
- π Asynchronous Processing: Distributed task queue based on Celery
- π Multi-format Support: PDF, Office, images, and various document formats
- π High Availability: Supports task retry and fault recovery
- π Real-time Monitoring: Task status tracking and queue statistics
- π― Priority Queue: Supports task priority scheduling
- π§ Easy to Extend: Modular design, easy to add new parsing engines
- Docker and Docker Compose
- (Optional) NVIDIA GPU for GPU worker
-
Copy environment configuration:
cp .env.example .env
-
Start Redis and API:
cd docker && docker compose up -d redis mineru-api
-
Start Worker (choose CPU or GPU):
# CPU Worker (recommended for development) cd docker && docker compose --profile mineru-cpu up -d # GPU Worker (requires NVIDIA GPU) cd docker && docker compose --profile mineru-gpu up -d
-
Verify services:
curl http://localhost:8000/api/v1/health
That's it! The API is now running at http://localhost:8000.
MinerU-API provides two API interfaces to suit different use cases:
The /file_parse endpoint is compatible with the official MinerU API format. It submits tasks to the worker and waits for completion, returning results directly in the response.
Reference: MinerU Official API
curl -X POST "http://localhost:8000/file_parse" \
-F "files=@document.pdf" \
-F "backend=pipeline" \
-F "lang_list=ch" \
-F "parse_method=auto" \
-F "return_md=true"Use cases: Simple integration, immediate results needed, compatible with existing MinerU clients.
The /api/v1/tasks/submit and /api/v1/tasks/{task_id} endpoints provide an asynchronous queue-based API, compatible with the mineru-tianshu project format.
Reference: mineru-tianshu API
Submit a Task:
curl -X POST "http://localhost:8000/api/v1/tasks/submit" \
-F "file=@document.pdf" \
-F "backend=pipeline" \
-F "lang=ch"Query Task Status:
curl "http://localhost:8000/api/v1/tasks/{task_id}"Use cases: Production deployments, batch processing, long-running tasks, better scalability.
Visit http://localhost:8000/docs for interactive API documentation with full parameter details.
The most important configuration options (see .env.example for all options):
# Redis Configuration
REDIS_URL=redis://redis:6379/0
# Storage Type: local or s3
MINERU_STORAGE_TYPE=local
# For S3 storage (distributed deployment)
MINERU_S3_ENDPOINT=http://minio:9000
MINERU_S3_ACCESS_KEY=minioadmin
MINERU_S3_SECRET_KEY=minioadmin
# CORS Configuration (production)
CORS_ALLOWED_ORIGINS=http://localhost:3000
ENVIRONMENT=production
# File Upload Limits
MAX_FILE_SIZE=104857600 # 100MB- π Full Documentation - Complete guide and configuration (English | δΈζ)
- π Deployment Guide - Production deployment (δΈζ)
- βοΈ Configuration Reference - All configuration options (δΈζ)
- π‘ API Examples - Code examples in multiple languages (δΈζ)
- π§ Troubleshooting - Common issues and solutions (δΈζ)
- π§Ή Storage & Cleanup - Storage configuration and cleanup (δΈζ)
- API Service: Handles task submission and status queries (
api/app.py) - Worker Service: Processes documents using MinerU/MarkItDown (
worker/tasks.py) - Redis: Message queue and result storage
- Shared Config: Unified configuration in
shared/celeryconfig.py
# Install dependencies
pip install -r api/requirements.txt
pip install -r worker/requirements.txt
pip install -r cleanup/requirements.txtWe welcome contributions! Please see CONTRIBUTING.md for guidelines.
This project is built on top of the following excellent open-source projects:
- MinerU - The core document parsing engine that powers this service
- mineru-tianshu - Inspiration and reference for the API architecture
We are grateful to the developers and contributors of these projects for their valuable work.
MIT License - see LICENSE file for details.
This project uses the following open-source libraries:
MinerU is used as an external library and its source code is not included in this repository.