PDF Layout Refactor — Recursive Zone Segmentation + Table Detection

Context

The current PDF preprocessor (src/storage/preprocess/pdf.rs::extract_pdf) treats every page as a single bag of (text|image) elements, sorts them by (-y_center, x_left), then groups consecutive same-Y elements into "lines" and lines into paragraphs. This works for single-column prose but produces unreadable word-salad on:

Multi-column PDFs — left-column line at y=750 sorts before right-column line at y=700, but the line-grouping then merges runs from both columns whenever they share the same Y center. Real example: a Czech aviation regulation (L14) with two-column body text came out with sentences from both columns interleaved word-by-word.
Tables — currently flattened into the same word-salad pattern; cell boundaries are lost.
Page banners — repeating page headers/footers ("PŘEDPIS L14 HLAVA 5" on every page) are interleaved into body text.
Justified paragraphs — current break threshold (y_gap > line_height * 1.5) merges paragraphs that have higher line spacing.

The fix is to do recursive XY-cut zone segmentation before line grouping: split each page along vertical/horizontal whitespace gutters into a zone tree (leaves = atomic content blocks like a column, a table cell, a heading region), traverse the tree in reading order, and run the existing line/paragraph logic inside each leaf only. Grid-shaped subtrees become GFM markdown tables.

Goal: feed the same Czech regulation PDF and get readable, sectioned, column-respecting markdown — with tables preserved.

Design

New module — `src/storage/preprocess/zone.rs`

Pure-geometry, no pdfium dependency. Owns the layout types and the XY-cut algorithm.

pub struct BBox { pub left: f32, pub right: f32, pub top: f32, pub bottom: f32 }
// PDF-native y-up: top > bottom (confirmed against current pdf.rs y-math).

pub trait Bounded { fn bbox(&self) -> BBox; }

pub enum Zone<T: Bounded> {
    Leaf  { bbox: BBox, items: Vec<T> },
    Split { dir: SplitDir, bbox: BBox, children: Vec<Zone<T>> },
    Table { bbox: BBox, rows: Vec<Vec<Zone<T>>> },  // populated only after `classify_tables`
}
pub enum SplitDir { Vertical /* col gutter */, Horizontal /* row gutter */ }

pub struct SegmentParams {
    pub min_v_gap: f32,   // ≈ median_space_width * 3
    pub min_h_gap: f32,   // ≈ median_line_height * 1.5
    pub min_zone_items: usize, // stop recursing if leaf has fewer items
}

pub fn segment<T: Bounded>(items: Vec<T>, p: &SegmentParams) -> Zone<T>;
pub fn classify_tables<T: Bounded>(zone: Zone<T>) -> Zone<T>;

XY-cut algorithm (segment):

Compute bounding box of items.
Try vertical cut first (column gutter detection):
- Sweep an X-coverage histogram; find runs of zero coverage wider than min_v_gap.
- If found → partition items into N child groups by their x_center, recurse on each, return Split { Vertical, .. }.
Else try horizontal cut (row gutter): same idea on Y axis with min_h_gap.
Else return Leaf.
Cut order matters: vertical first prevents columnar text from being chopped into per-paragraph rows.

Table classification (classify_tables, conservative — answered by user): Walk the zone tree. A Split is reclassified as Table iff all of:

It is a Split { Horizontal } (rows) and every child is Split { Vertical } (columns) — or symmetric (top-level Vertical of same-shape Horizontals, less common).
≥ 2 rows and ≥ 2 columns.
Column gutters align across rows within tolerance = median_char_width * 2.
Every cell is a Leaf containing only short text (no images, no nested splits, total text ≤ ~200 chars per cell).
Rows have similar height.

Anything that fails any test stays a Split. False-negative tables become column-respecting prose, which is still a vast improvement.

Refactored `src/storage/preprocess/pdf.rs`

Keep the existing high-level structure (extract_pdf returns ExtractedDoc, traces every phase, public API unchanged). Replace the inner pipeline:

pdfium load
  → font histogram (existing, pass 1, build HeadingClassifier — UNCHANGED)
  → for each page:
      collect Vec<PageElement>             (existing extraction logic)
      strip headers/footers                (NEW — see below)
      let zone = zone::segment(elements)   (NEW)
      let zone = zone::classify_tables(zone) (NEW)
      let page_md = emit_zone(&zone, &classifier, &mut max_heading_level) (NEW recursive emitter)
  → join pages, return ExtractedDoc

emit_zone (lives in pdf.rs, not zone.rs — needs HeadingClassifier):

Leaf: sort items by (-y_center, x_left), run existing emit_page_markdown body (line grouping → paragraph grouping → render). With X-gap guard added to line merging (see below).
Split { Horizontal }: emit children top-to-bottom, blank line between.
Split { Vertical }: emit children left-to-right, blank line between (so that columns become sequential paragraphs, preserving order).
Table: emit GFM:
```
| cell | cell |
| --- | --- |
| cell | cell |
```
Cell text = inline-rendered concatenation of the cell leaf's text runs (existing Line::render_inline). First row is treated as header if its dominant font signature is bolder/larger than other rows' (use HeadingClassifier::classify heuristic).

Header/footer stripping (NEW, in pdf.rs)

Implemented as a pre-pass run once after pass 1, before per-page extraction enters its zone-segmentation phase. Approach (repetition-based, answered by user):

For each page, collect candidate banner text from items in the top 10% and bottom 10% of page bbox.
Normalize each candidate (collapse whitespace, replace runs of digits with \d+ so "Page 1" and "Page 12" match).
Across all pages, count occurrences. Any normalized banner appearing on ≥ 3 pages (or ≥ 50% of pages for very short docs) is marked as a banner.
During per-page extraction, drop matching items from the elements list before calling zone::segment.

Line-grouping tweaks inside `Leaf` zones

Inside emit_page_markdown (now emit_leaf):

X-gap guard (high-leverage even with zone segmentation, defensive): When merging a text element into the current line (pdf.rs:302-315), require:
```
te.left - last_run.x_right <= median_space_width * 4.0
```
median_space_width is computed once per page from the inter-run X gaps. Rejection forces a new line, preventing residual cell-bleed if zone detection missed a separator.
Paragraph break (group_into_paragraphs):
- Lower threshold from 1.5× to 1.2× line height.
- Add: start a new paragraph if line.x_start - prev.x_start > median_indent (catches first-line-indent paragraph styles).

Files to modify / create

File	Action
`src/storage/preprocess/zone.rs`	CREATE — geometry types + `segment` + `classify_tables`
`src/storage/preprocess/mod.rs`	Add `mod zone;` (kept private)
`src/storage/preprocess/pdf.rs`	Replace `extract_pdf` body; replace `emit_page_markdown` with `emit_zone`; add header/footer detector + X-gap guard + paragraph indent check
`src/storage/preprocess/headings.rs`	NO CHANGE — keep `HeadingClassifier::build / classify` and `FontSignature::new` exactly as-is
`src/storage/preprocess/error.rs`	NO CHANGE — reuse `PreprocessorError::PdfParse` for layout-detection failures
`src/storage/preprocess/orchestrator.rs`, `slice_orchestrator.rs`	NO CHANGE — `ExtractedDoc` shape stays the same
`tests/preprocess_pdf.rs`	Extend smoke test, add a 2-column fixture and a table fixture (best-effort — fixtures may need to be hand-crafted since fixture dir is mostly empty)

Reused symbols (do not redefine)

headings::HeadingClassifier (build, classify, body, levels) — drives heading levels exactly as today
headings::FontSignature (new, size_bucket, is_bold, is_italic)
pdf::TextElement, pdf::ImageElement — implement new zone::Bounded trait on these instead of duplicating geometry
pdf::Line, pdf::Paragraph, pdf::TextRun, pdf::extract_image_figure, pdf::font_signature_from_text_object — keep as-is, called from emit_leaf
The existing tracing breadcrumbs added in the previous turn — extend with new debug!("zone segmented", splits=..., leaves=...) and debug!("table detected", rows=..., cols=...) events

Verification

Compile: cargo check --bin storage_server --bin infra_server — must pass with no new warnings.
Existing test: cargo test --test preprocess_pdf — pdf_preprocess_smoke must still pass on tests/fixtures/preprocess/sample.pdf.
Unit tests for zone.rs: add #[cfg(test)] mod tests covering:
- Single-column page → 1 leaf zone.
- Two-column page (synthetic items) → top-level Split { Vertical } with 2 leaves.
- 3×3 grid (synthetic items) → Table after classification.
- Heading-then-body page → Split { Horizontal } with heading leaf + body leaf.
- Pure noise (no gaps wider than thresholds) → 1 leaf.
End-to-end smoke (manual):
- Drop the L14 PDF that produced the original word-salad output into tests/fixtures/preprocess/.
- Add a #[ignore] test that loads it, calls extract_pdf, and writes the markdown to a temp file. Run with cargo test -- --ignored l14. Inspect the markdown manually — verify columns are sequential, banner is gone, body sentences read naturally.
Live upload:
- cargo run --bin infra_server
- Upload the same PDF via the file_add endpoint or storage MCP file_add.
- With the existing breadcrumb logging (RUST_LOG=info,infrastructure=debug,... from the previous turn), the server log should show PDF preprocess: page N lines, zone segmented debug entries, optional table detected lines, and the final slice orchestrator: done with pages_created > 0.
- Open the resulting KB pages in the SPA and read them.

Out of scope (follow-up work)

OCR / scanned PDFs (no text objects → empty markdown today, same after refactor).
Right-to-left scripts (sort key for vertical splits flips).
Table-cell merging across rows/cols.
Footnote detection.

pdf-extraction