Skip to content

wzdavid/mineru-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

MinerU Parsing Service

CI License: MIT Python 3.10+

Enterprise-grade document parsing service with asynchronous queue processing based on Celery, featuring a fully decoupled API/Worker architecture.

Features

  • πŸš€ Asynchronous Processing: Distributed task queue based on Celery
  • πŸ“„ Multi-format Support: PDF, Office, images, and various document formats
  • πŸ”„ High Availability: Supports task retry and fault recovery
  • πŸ“Š Real-time Monitoring: Task status tracking and queue statistics
  • 🎯 Priority Queue: Supports task priority scheduling
  • πŸ”§ Easy to Extend: Modular design, easy to add new parsing engines

Quick Start

Prerequisites

  • Docker and Docker Compose
  • (Optional) NVIDIA GPU for GPU worker

Start Services

  1. Copy environment configuration:

    cp .env.example .env
  2. Start Redis and API:

    cd docker && docker compose up -d redis mineru-api
  3. Start Worker (choose CPU or GPU):

    # CPU Worker (recommended for development)
    cd docker && docker compose --profile mineru-cpu up -d
    
    # GPU Worker (requires NVIDIA GPU)
    cd docker && docker compose --profile mineru-gpu up -d
  4. Verify services:

    curl http://localhost:8000/api/v1/health

That's it! The API is now running at http://localhost:8000.

API Usage

MinerU-API provides two API interfaces to suit different use cases:

1. Official MinerU API (Synchronous)

The /file_parse endpoint is compatible with the official MinerU API format. It submits tasks to the worker and waits for completion, returning results directly in the response.

Reference: MinerU Official API

curl -X POST "http://localhost:8000/file_parse" \
  -F "files=@document.pdf" \
  -F "backend=pipeline" \
  -F "lang_list=ch" \
  -F "parse_method=auto" \
  -F "return_md=true"

Use cases: Simple integration, immediate results needed, compatible with existing MinerU clients.

2. Async Queue API (Asynchronous)

The /api/v1/tasks/submit and /api/v1/tasks/{task_id} endpoints provide an asynchronous queue-based API, compatible with the mineru-tianshu project format.

Reference: mineru-tianshu API

Submit a Task:

curl -X POST "http://localhost:8000/api/v1/tasks/submit" \
  -F "file=@document.pdf" \
  -F "backend=pipeline" \
  -F "lang=ch"

Query Task Status:

curl "http://localhost:8000/api/v1/tasks/{task_id}"

Use cases: Production deployments, batch processing, long-running tasks, better scalability.

View API Documentation

Visit http://localhost:8000/docs for interactive API documentation with full parameter details.

Basic Configuration

Environment Variables

The most important configuration options (see .env.example for all options):

# Redis Configuration
REDIS_URL=redis://redis:6379/0

# Storage Type: local or s3
MINERU_STORAGE_TYPE=local

# For S3 storage (distributed deployment)
MINERU_S3_ENDPOINT=http://minio:9000
MINERU_S3_ACCESS_KEY=minioadmin
MINERU_S3_SECRET_KEY=minioadmin

# CORS Configuration (production)
CORS_ALLOWED_ORIGINS=http://localhost:3000
ENVIRONMENT=production

# File Upload Limits
MAX_FILE_SIZE=104857600  # 100MB

Documentation

Architecture

  • API Service: Handles task submission and status queries (api/app.py)
  • Worker Service: Processes documents using MinerU/MarkItDown (worker/tasks.py)
  • Redis: Message queue and result storage
  • Shared Config: Unified configuration in shared/celeryconfig.py

Development

# Install dependencies
pip install -r api/requirements.txt
pip install -r worker/requirements.txt
pip install -r cleanup/requirements.txt

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Acknowledgments

This project is built on top of the following excellent open-source projects:

  • MinerU - The core document parsing engine that powers this service
  • mineru-tianshu - Inspiration and reference for the API architecture

We are grateful to the developers and contributors of these projects for their valuable work.

License

MIT License - see LICENSE file for details.

Third-Party Licenses

This project uses the following open-source libraries:

  • MinerU - Licensed under AGPL-3.0
  • MarkItDown - Licensed under MIT

MinerU is used as an external library and its source code is not included in this repository.

About

Enterprise-grade document parsing service with asynchronous queue processing based on Celery.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages