Parallel parsing of multiple documents #2757

Us3r369 · 2025-12-08T21:12:29Z

Us3r369
Dec 8, 2025

Hi,

I'm building a Databricks pipeline to parse a few thousand scanned (OCR required) PDFs into Markdown. Quality looks quite promising, but a test document took about 18 minutes to process, which I'm afraid won’t scale well. I’m exploring using Spark to parallelize document parsing, but if the Docling OCR model is instantiated across multiple threads, I'm pretty sure my runtime will crash.

Has anyone found an efficient approach to parallelizing Docling PDF parsing at scale in Databricks? I already tried switching the PDF backend as recommended in a different thread, but that only reduced the time slightly. Since ORC and table parsing are needed too, we can't turn those off the save some time either.

omkumar40 · 2025-12-08T21:16:01Z

dosubot[bot]
bot Dec 8, 2025

For scaling Docling PDF parsing with OCR and table extraction in Databricks, the most efficient and robust approach is to run Docling in distributed mode—either by deploying docling-serve as a containerized service or by ensuring each Spark worker instantiates its own Docling pipeline instance. Docling’s threaded PDF pipeline (StandardPdfPipeline/ThreadedStandardPdfPipeline) is designed for thread safety: heavy models like OCR are initialized once per pipeline instance and shared read-only by worker threads, so you can safely parallelize at the document level (one pipeline per worker/process) without runtime crashes or shared-state issues [source].

For best throughput, tune pipeline options: set ocr_batch_size, layout_batch_size, and table_batch_size (default is 4, but experiment for your workload), and use accelerator options to leverage all available CPU cores or GPU if present. You can control thread count via DOCLING_NUM_THREADS or OMP_NUM_THREADS environment variables [source]. GPU acceleration (CUDA-enabled images) can drastically speed up OCR and vision-language tasks [source].

If you use EasyOCR, be aware of a known issue in Spark/Databricks: EasyOCR tries to create a ~/.EasyOCR directory, which may be read-only. Fix this by pre-creating a user_network directory and setting the EASYOCR_MODULE_PATH environment variable to its parent [source].

For large-scale distributed workloads, docling-serve is recommended—it’s built for cluster environments and supports persistent model storage, environment variable configuration, and multiple compute engines (local, Redis Queue, Kubeflow Pipelines) [source], [source].

Switching to the pypdfium2 backend can improve PDF conversion speed and stability, especially for large or complex documents [source].

Finally, monitor memory usage and batch sizes—processing very large PDFs or too many pages in a batch can exhaust memory, so consider splitting documents or limiting batch sizes if you hit resource constraints [source].

Let me know if you need example code or deployment patterns for Spark/Databricks!

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

1 reply

omkumar40 Dec 20, 2025

@dosu Please share all the codes for using Docling/docling-serve in Databricks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parallel parsing of multiple documents #2757

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Parallel parsing of multiple documents #2757

Uh oh!

Us3r369 Dec 8, 2025

Replies: 1 comment · 1 reply

Uh oh!

dosubot[bot] bot Dec 8, 2025

Uh oh!

omkumar40 Dec 20, 2025

Us3r369
Dec 8, 2025

Replies: 1 comment 1 reply

dosubot[bot]
bot Dec 8, 2025