Replies: 1 comment 1 reply
-
|
For scaling Docling PDF parsing with OCR and table extraction in Databricks, the most efficient and robust approach is to run Docling in distributed mode—either by deploying docling-serve as a containerized service or by ensuring each Spark worker instantiates its own Docling pipeline instance. Docling’s threaded PDF pipeline (StandardPdfPipeline/ThreadedStandardPdfPipeline) is designed for thread safety: heavy models like OCR are initialized once per pipeline instance and shared read-only by worker threads, so you can safely parallelize at the document level (one pipeline per worker/process) without runtime crashes or shared-state issues [source]. For best throughput, tune pipeline options: set If you use EasyOCR, be aware of a known issue in Spark/Databricks: EasyOCR tries to create a For large-scale distributed workloads, docling-serve is recommended—it’s built for cluster environments and supports persistent model storage, environment variable configuration, and multiple compute engines (local, Redis Queue, Kubeflow Pipelines) [source], [source]. Switching to the Finally, monitor memory usage and batch sizes—processing very large PDFs or too many pages in a batch can exhaust memory, so consider splitting documents or limiting batch sizes if you hit resource constraints [source]. Let me know if you need example code or deployment patterns for Spark/Databricks! To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I'm building a Databricks pipeline to parse a few thousand scanned (OCR required) PDFs into Markdown. Quality looks quite promising, but a test document took about 18 minutes to process, which I'm afraid won’t scale well. I’m exploring using Spark to parallelize document parsing, but if the Docling OCR model is instantiated across multiple threads, I'm pretty sure my runtime will crash.
Has anyone found an efficient approach to parallelizing Docling PDF parsing at scale in Databricks? I already tried switching the PDF backend as recommended in a different thread, but that only reduced the time slightly. Since ORC and table parsing are needed too, we can't turn those off the save some time either.
Beta Was this translation helpful? Give feedback.
All reactions