PDF Layout Refactor — Recursive Zone Segmentation + Table Detection

Context

The current PDF preprocessor (src/storage/preprocess/pdf.rs::extract_pdf) treats every page as a single bag of (text|image) elements, sorts them by (-y_center, x_left), then groups consecutive same-Y elements into "lines" and lines into paragraphs. This works for single-column prose but produces unreadable word-salad on:

  1. Multi-column PDFs — left-column line at y=750 sorts before right-column line at y=700, but the line-grouping then merges runs from both columns whenever they share the same Y center. Real example: a Czech aviation regulation (L14) with two-column body text came out with sentences from both columns interleaved word-by-word.
  2. Tables — currently flattened into the same word-salad pattern; cell boundaries are lost.
  3. Page banners — repeating page headers/footers ("PŘEDPIS L14 HLAVA 5" on every page) are interleaved into body text.
  4. Justified paragraphs — current break threshold (y_gap > line_height * 1.5) merges paragraphs that have higher line spacing.

The fix is to do recursive XY-cut zone segmentation before line grouping: split each page along vertical/horizontal whitespace gutters into a zone tree (leaves = atomic content blocks like a column, a table cell, a heading region), traverse the tree in reading order, and run the existing line/paragraph logic inside each leaf only. Grid-shaped subtrees become GFM markdown tables.

Goal: feed the same Czech regulation PDF and get readable, sectioned, column-respecting markdown — with tables preserved.

Design

New module — src/storage/preprocess/zone.rs

Pure-geometry, no pdfium dependency. Owns the layout types and the XY-cut algorithm.

pub struct BBox { pub left: f32, pub right: f32, pub top: f32, pub bottom: f32 }
// PDF-native y-up: top > bottom (confirmed against current pdf.rs y-math).

pub trait Bounded { fn bbox(&self) -> BBox; }

pub enum Zone<T: Bounded> {
    Leaf  { bbox: BBox, items: Vec<T> },
    Split { dir: SplitDir, bbox: BBox, children: Vec<Zone<T>> },
    Table { bbox: BBox, rows: Vec<Vec<Zone<T>>> },  // populated only after `classify_tables`
}
pub enum SplitDir { Vertical /* col gutter */, Horizontal /* row gutter */ }

pub struct SegmentParams {
    pub min_v_gap: f32,   // ≈ median_space_width * 3
    pub min_h_gap: f32,   // ≈ median_line_height * 1.5
    pub min_zone_items: usize, // stop recursing if leaf has fewer items
}

pub fn segment<T: Bounded>(items: Vec<T>, p: &SegmentParams) -> Zone<T>;
pub fn classify_tables<T: Bounded>(zone: Zone<T>) -> Zone<T>;

XY-cut algorithm (segment):

  1. Compute bounding box of items.
  2. Try vertical cut first (column gutter detection):
    • Sweep an X-coverage histogram; find runs of zero coverage wider than min_v_gap.
    • If found → partition items into N child groups by their x_center, recurse on each, return Split { Vertical, .. }.
  3. Else try horizontal cut (row gutter): same idea on Y axis with min_h_gap.
  4. Else return Leaf.
  5. Cut order matters: vertical first prevents columnar text from being chopped into per-paragraph rows.

Table classification (classify_tables, conservative — answered by user): Walk the zone tree. A Split is reclassified as Table iff all of:

  • It is a Split { Horizontal } (rows) and every child is Split { Vertical } (columns) — or symmetric (top-level Vertical of same-shape Horizontals, less common).
  • ≥ 2 rows and ≥ 2 columns.
  • Column gutters align across rows within tolerance = median_char_width * 2.
  • Every cell is a Leaf containing only short text (no images, no nested splits, total text ≤ ~200 chars per cell).
  • Rows have similar height.

Anything that fails any test stays a Split. False-negative tables become column-respecting prose, which is still a vast improvement.

Refactored src/storage/preprocess/pdf.rs

Keep the existing high-level structure (extract_pdf returns ExtractedDoc, traces every phase, public API unchanged). Replace the inner pipeline:

pdfium load
  → font histogram (existing, pass 1, build HeadingClassifier — UNCHANGED)
  → for each page:
      collect Vec<PageElement>             (existing extraction logic)
      strip headers/footers                (NEW — see below)
      let zone = zone::segment(elements)   (NEW)
      let zone = zone::classify_tables(zone) (NEW)
      let page_md = emit_zone(&zone, &classifier, &mut max_heading_level) (NEW recursive emitter)
  → join pages, return ExtractedDoc

emit_zone (lives in pdf.rs, not zone.rs — needs HeadingClassifier):

  • Leaf: sort items by (-y_center, x_left), run existing emit_page_markdown body (line grouping → paragraph grouping → render). With X-gap guard added to line merging (see below).
  • Split { Horizontal }: emit children top-to-bottom, blank line between.
  • Split { Vertical }: emit children left-to-right, blank line between (so that columns become sequential paragraphs, preserving order).
  • Table: emit GFM:
    | cell | cell |
    | --- | --- |
    | cell | cell |
    
    Cell text = inline-rendered concatenation of the cell leaf's text runs (existing Line::render_inline). First row is treated as header if its dominant font signature is bolder/larger than other rows' (use HeadingClassifier::classify heuristic).

Header/footer stripping (NEW, in pdf.rs)

Implemented as a pre-pass run once after pass 1, before per-page extraction enters its zone-segmentation phase. Approach (repetition-based, answered by user):

  1. For each page, collect candidate banner text from items in the top 10% and bottom 10% of page bbox.
  2. Normalize each candidate (collapse whitespace, replace runs of digits with \d+ so "Page 1" and "Page 12" match).
  3. Across all pages, count occurrences. Any normalized banner appearing on ≥ 3 pages (or ≥ 50% of pages for very short docs) is marked as a banner.
  4. During per-page extraction, drop matching items from the elements list before calling zone::segment.

Line-grouping tweaks inside Leaf zones

Inside emit_page_markdown (now emit_leaf):

  1. X-gap guard (high-leverage even with zone segmentation, defensive): When merging a text element into the current line (pdf.rs:302-315), require:

    te.left - last_run.x_right <= median_space_width * 4.0
    

    median_space_width is computed once per page from the inter-run X gaps. Rejection forces a new line, preventing residual cell-bleed if zone detection missed a separator.

  2. Paragraph break (group_into_paragraphs):

    • Lower threshold from 1.5× to 1.2× line height.
    • Add: start a new paragraph if line.x_start - prev.x_start > median_indent (catches first-line-indent paragraph styles).

Files to modify / create

FileAction
src/storage/preprocess/zone.rsCREATE — geometry types + segment + classify_tables
src/storage/preprocess/mod.rsAdd mod zone; (kept private)
src/storage/preprocess/pdf.rsReplace extract_pdf body; replace emit_page_markdown with emit_zone; add header/footer detector + X-gap guard + paragraph indent check
src/storage/preprocess/headings.rsNO CHANGE — keep HeadingClassifier::build / classify and FontSignature::new exactly as-is
src/storage/preprocess/error.rsNO CHANGE — reuse PreprocessorError::PdfParse for layout-detection failures
src/storage/preprocess/orchestrator.rs, slice_orchestrator.rsNO CHANGEExtractedDoc shape stays the same
tests/preprocess_pdf.rsExtend smoke test, add a 2-column fixture and a table fixture (best-effort — fixtures may need to be hand-crafted since fixture dir is mostly empty)

Reused symbols (do not redefine)

  • headings::HeadingClassifier (build, classify, body, levels) — drives heading levels exactly as today
  • headings::FontSignature (new, size_bucket, is_bold, is_italic)
  • pdf::TextElement, pdf::ImageElement — implement new zone::Bounded trait on these instead of duplicating geometry
  • pdf::Line, pdf::Paragraph, pdf::TextRun, pdf::extract_image_figure, pdf::font_signature_from_text_object — keep as-is, called from emit_leaf
  • The existing tracing breadcrumbs added in the previous turn — extend with new debug!("zone segmented", splits=..., leaves=...) and debug!("table detected", rows=..., cols=...) events

Verification

  1. Compile: cargo check --bin storage_server --bin infra_server — must pass with no new warnings.
  2. Existing test: cargo test --test preprocess_pdfpdf_preprocess_smoke must still pass on tests/fixtures/preprocess/sample.pdf.
  3. Unit tests for zone.rs: add #[cfg(test)] mod tests covering:
    • Single-column page → 1 leaf zone.
    • Two-column page (synthetic items) → top-level Split { Vertical } with 2 leaves.
    • 3×3 grid (synthetic items) → Table after classification.
    • Heading-then-body page → Split { Horizontal } with heading leaf + body leaf.
    • Pure noise (no gaps wider than thresholds) → 1 leaf.
  4. End-to-end smoke (manual):
    • Drop the L14 PDF that produced the original word-salad output into tests/fixtures/preprocess/.
    • Add a #[ignore] test that loads it, calls extract_pdf, and writes the markdown to a temp file. Run with cargo test -- --ignored l14. Inspect the markdown manually — verify columns are sequential, banner is gone, body sentences read naturally.
  5. Live upload:
    • cargo run --bin infra_server
    • Upload the same PDF via the file_add endpoint or storage MCP file_add.
    • With the existing breadcrumb logging (RUST_LOG=info,infrastructure=debug,... from the previous turn), the server log should show PDF preprocess: page N lines, zone segmented debug entries, optional table detected lines, and the final slice orchestrator: done with pages_created > 0.
    • Open the resulting KB pages in the SPA and read them.

Out of scope (follow-up work)

  • OCR / scanned PDFs (no text objects → empty markdown today, same after refactor).
  • Right-to-left scripts (sort key for vertical splits flips).
  • Table-cell merging across rows/cols.
  • Footnote detection.