Skip to content

Failed pages are silently skipped without page number in logs and missing page_break in exports #2857

@jhchoi1182

Description

@jhchoi1182

When parsing certain PDF documents, some pages fail to parse and are silently skipped. This causes two primary problems:

  1. Debugging Difficulty: The error log does not include which page number failed, making it hard to identify problematic pages.
  2. Export Inconsistency: The export_to_doctags() function does not generate <page_break> tags for skipped pages, causing page count mismatches in downstream processing.

Steps to Reproduce

  1. Convert the attached test.pdf using DocumentConverter.
  2. Check the error logs.
  3. Export using export_to_doctags().
  4. Count the <page_break> tags.
converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options,
        ),
    },
)
doc = converter.convert(source).document
doctags_output = doc.export_to_doctags()

print(doctags_output)

Expected Behavior

  • Error Logs: Should clearly indicate which page number failed to parse.
  • Export: export_to_doctags() should generate <page_break> tags for all pages, including failed ones, so that page numbering remains consistent.

Actual Behavior

1. Log output shows only run ID, not page number:

ERROR - Stage preprocess failed for run 1: Invalid code point

2. Missing <page_break> for failed pages:
If pages 1, 2, 4, 5 succeed but page 3 fails, the output currently looks like this:

... content ...      1page
<page_break>
... content ...      2page
<page_break>
... content ...      4page
<page_break>
... content ...      5page

Note: This breaks any downstream logic that relies on <page_break> tags to track page positions (e.g., page counting, content alignment).

Environment

Component Version / Details
docling version 2.31.1
docling-core version 2.31.0
Python 3.11
OS macOS

Attachments

test.pdf

  • test.pdf - A minimal reproduction file derived from a larger document. The content has been intentionally corrupted for confidentiality, but the parsing error still reproduces.
    • Structure: Consists of 4 pages (Pages 78, 79, 83, 84).
    • Behavior: Parsing succeeds for pages 78 and 84, but fails for pages 79 and 83.
    • Current Output:
<doctag>
... content ...      78page
<page_break>
... content ...      84page
</doctag>

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions