-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
When parsing certain PDF documents, some pages fail to parse and are silently skipped. This causes two primary problems:
- Debugging Difficulty: The error log does not include which page number failed, making it hard to identify problematic pages.
- Export Inconsistency: The
export_to_doctags()function does not generate<page_break>tags for skipped pages, causing page count mismatches in downstream processing.
Steps to Reproduce
- Convert the attached
test.pdfusingDocumentConverter. - Check the error logs.
- Export using
export_to_doctags(). - Count the
<page_break>tags.
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options,
),
},
)
doc = converter.convert(source).document
doctags_output = doc.export_to_doctags()
print(doctags_output)Expected Behavior
- Error Logs: Should clearly indicate which page number failed to parse.
- Export:
export_to_doctags()should generate<page_break>tags for all pages, including failed ones, so that page numbering remains consistent.
Actual Behavior
1. Log output shows only run ID, not page number:
ERROR - Stage preprocess failed for run 1: Invalid code point
2. Missing <page_break> for failed pages:
If pages 1, 2, 4, 5 succeed but page 3 fails, the output currently looks like this:
... content ... 1page
<page_break>
... content ... 2page
<page_break>
... content ... 4page
<page_break>
... content ... 5pageNote: This breaks any downstream logic that relies on
<page_break>tags to track page positions (e.g., page counting, content alignment).
Environment
| Component | Version / Details |
|---|---|
| docling version | 2.31.1 |
| docling-core version | 2.31.0 |
| Python | 3.11 |
| OS | macOS |
Attachments
test.pdf- A minimal reproduction file derived from a larger document. The content has been intentionally corrupted for confidentiality, but the parsing error still reproduces.- Structure: Consists of 4 pages (Pages 78, 79, 83, 84).
- Behavior: Parsing succeeds for pages 78 and 84, but fails for pages 79 and 83.
- Current Output:
<doctag>
... content ... 78page
<page_break>
... content ... 84page
</doctag>Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request