PDF Layout Refactor — Recursive Zone Segmentation + Table Detection
Context
The current PDF preprocessor (src/storage/preprocess/pdf.rs::extract_pdf) treats every page as a single bag of (text|image) elements, sorts them by (-y_center, x_left), then groups consecutive same-Y elements into "lines" and lines into paragraphs. This works for single-column prose but produces unreadable word-salad on:
- Multi-column PDFs — left-column line at y=750 sorts before right-column line at y=700, but the line-grouping then merges runs from both columns whenever they share the same Y center. Real example: a Czech aviation regulation (L14) with two-column body text came out with sentences from both columns interleaved word-by-word.
- Tables — currently flattened into the same word-salad pattern; cell boundaries are lost.
- Page banners — repeating page headers/footers ("PŘEDPIS L14 HLAVA 5" on every page) are interleaved into body text.
- Justified paragraphs — current break threshold (
y_gap > line_height * 1.5) merges paragraphs that have higher line spacing.
The fix is to do recursive XY-cut zone segmentation before line grouping: split each page along vertical/horizontal whitespace gutters into a zone tree (leaves = atomic content blocks like a column, a table cell, a heading region), traverse the tree in reading order, and run the existing line/paragraph logic inside each leaf only. Grid-shaped subtrees become GFM markdown tables.
Goal: feed the same Czech regulation PDF and get readable, sectioned, column-respecting markdown — with tables preserved.
Design
New module — src/storage/preprocess/zone.rs
Pure-geometry, no pdfium dependency. Owns the layout types and the XY-cut algorithm.
pub struct BBox { pub left: f32, pub right: f32, pub top: f32, pub bottom: f32 }
// PDF-native y-up: top > bottom (confirmed against current pdf.rs y-math).
pub trait Bounded { fn bbox(&self) -> BBox; }
pub enum Zone<T: Bounded> {
Leaf { bbox: BBox, items: Vec<T> },
Split { dir: SplitDir, bbox: BBox, children: Vec<Zone<T>> },
Table { bbox: BBox, rows: Vec<Vec<Zone<T>>> }, // populated only after `classify_tables`
}
pub enum SplitDir { Vertical /* col gutter */, Horizontal /* row gutter */ }
pub struct SegmentParams {
pub min_v_gap: f32, // ≈ median_space_width * 3
pub min_h_gap: f32, // ≈ median_line_height * 1.5
pub min_zone_items: usize, // stop recursing if leaf has fewer items
}
pub fn segment<T: Bounded>(items: Vec<T>, p: &SegmentParams) -> Zone<T>;
pub fn classify_tables<T: Bounded>(zone: Zone<T>) -> Zone<T>;
XY-cut algorithm (segment):
- Compute bounding box of items.
- Try vertical cut first (column gutter detection):
- Sweep an X-coverage histogram; find runs of zero coverage wider than
min_v_gap. - If found → partition items into N child groups by their
x_center, recurse on each, returnSplit { Vertical, .. }.
- Sweep an X-coverage histogram; find runs of zero coverage wider than
- Else try horizontal cut (row gutter): same idea on Y axis with
min_h_gap. - Else return
Leaf. - Cut order matters: vertical first prevents columnar text from being chopped into per-paragraph rows.
Table classification (classify_tables, conservative — answered by user):
Walk the zone tree. A Split is reclassified as Table iff all of:
- It is a
Split { Horizontal }(rows) and every child isSplit { Vertical }(columns) — or symmetric (top-level Vertical of same-shape Horizontals, less common). - ≥ 2 rows and ≥ 2 columns.
- Column gutters align across rows within
tolerance = median_char_width * 2. - Every cell is a
Leafcontaining only short text (no images, no nested splits, total text ≤ ~200 chars per cell). - Rows have similar height.
Anything that fails any test stays a Split. False-negative tables become column-respecting prose, which is still a vast improvement.
Refactored src/storage/preprocess/pdf.rs
Keep the existing high-level structure (extract_pdf returns ExtractedDoc, traces every phase, public API unchanged). Replace the inner pipeline:
pdfium load
→ font histogram (existing, pass 1, build HeadingClassifier — UNCHANGED)
→ for each page:
collect Vec<PageElement> (existing extraction logic)
strip headers/footers (NEW — see below)
let zone = zone::segment(elements) (NEW)
let zone = zone::classify_tables(zone) (NEW)
let page_md = emit_zone(&zone, &classifier, &mut max_heading_level) (NEW recursive emitter)
→ join pages, return ExtractedDoc
emit_zone (lives in pdf.rs, not zone.rs — needs HeadingClassifier):
Leaf: sort items by(-y_center, x_left), run existingemit_page_markdownbody (line grouping → paragraph grouping → render). With X-gap guard added to line merging (see below).Split { Horizontal }: emit children top-to-bottom, blank line between.Split { Vertical }: emit children left-to-right, blank line between (so that columns become sequential paragraphs, preserving order).Table: emit GFM:
Cell text = inline-rendered concatenation of the cell leaf's text runs (existing| cell | cell | | --- | --- | | cell | cell |Line::render_inline). First row is treated as header if its dominant font signature is bolder/larger than other rows' (useHeadingClassifier::classifyheuristic).
Header/footer stripping (NEW, in pdf.rs)
Implemented as a pre-pass run once after pass 1, before per-page extraction enters its zone-segmentation phase. Approach (repetition-based, answered by user):
- For each page, collect candidate banner text from items in the top 10% and bottom 10% of page bbox.
- Normalize each candidate (collapse whitespace, replace runs of digits with
\d+so "Page 1" and "Page 12" match). - Across all pages, count occurrences. Any normalized banner appearing on ≥ 3 pages (or ≥ 50% of pages for very short docs) is marked as a banner.
- During per-page extraction, drop matching items from the elements list before calling
zone::segment.
Line-grouping tweaks inside Leaf zones
Inside emit_page_markdown (now emit_leaf):
-
X-gap guard (high-leverage even with zone segmentation, defensive): When merging a text element into the current line (
pdf.rs:302-315), require:te.left - last_run.x_right <= median_space_width * 4.0median_space_widthis computed once per page from the inter-run X gaps. Rejection forces a new line, preventing residual cell-bleed if zone detection missed a separator. -
Paragraph break (
group_into_paragraphs):- Lower threshold from
1.5×to1.2×line height. - Add: start a new paragraph if
line.x_start - prev.x_start > median_indent(catches first-line-indent paragraph styles).
- Lower threshold from
Files to modify / create
| File | Action |
|---|---|
src/storage/preprocess/zone.rs | CREATE — geometry types + segment + classify_tables |
src/storage/preprocess/mod.rs | Add mod zone; (kept private) |
src/storage/preprocess/pdf.rs | Replace extract_pdf body; replace emit_page_markdown with emit_zone; add header/footer detector + X-gap guard + paragraph indent check |
src/storage/preprocess/headings.rs | NO CHANGE — keep HeadingClassifier::build / classify and FontSignature::new exactly as-is |
src/storage/preprocess/error.rs | NO CHANGE — reuse PreprocessorError::PdfParse for layout-detection failures |
src/storage/preprocess/orchestrator.rs, slice_orchestrator.rs | NO CHANGE — ExtractedDoc shape stays the same |
tests/preprocess_pdf.rs | Extend smoke test, add a 2-column fixture and a table fixture (best-effort — fixtures may need to be hand-crafted since fixture dir is mostly empty) |
Reused symbols (do not redefine)
headings::HeadingClassifier(build,classify,body,levels) — drives heading levels exactly as todayheadings::FontSignature(new,size_bucket,is_bold,is_italic)pdf::TextElement,pdf::ImageElement— implement newzone::Boundedtrait on these instead of duplicating geometrypdf::Line,pdf::Paragraph,pdf::TextRun,pdf::extract_image_figure,pdf::font_signature_from_text_object— keep as-is, called fromemit_leaf- The existing tracing breadcrumbs added in the previous turn — extend with new
debug!("zone segmented", splits=..., leaves=...)anddebug!("table detected", rows=..., cols=...)events
Verification
- Compile:
cargo check --bin storage_server --bin infra_server— must pass with no new warnings. - Existing test:
cargo test --test preprocess_pdf—pdf_preprocess_smokemust still pass ontests/fixtures/preprocess/sample.pdf. - Unit tests for
zone.rs: add#[cfg(test)] mod testscovering:- Single-column page → 1 leaf zone.
- Two-column page (synthetic items) → top-level
Split { Vertical }with 2 leaves. - 3×3 grid (synthetic items) →
Tableafter classification. - Heading-then-body page →
Split { Horizontal }with heading leaf + body leaf. - Pure noise (no gaps wider than thresholds) → 1 leaf.
- End-to-end smoke (manual):
- Drop the L14 PDF that produced the original word-salad output into
tests/fixtures/preprocess/. - Add a
#[ignore]test that loads it, callsextract_pdf, and writes the markdown to a temp file. Run withcargo test -- --ignored l14. Inspect the markdown manually — verify columns are sequential, banner is gone, body sentences read naturally.
- Drop the L14 PDF that produced the original word-salad output into
- Live upload:
cargo run --bin infra_server- Upload the same PDF via the file_add endpoint or storage MCP
file_add. - With the existing breadcrumb logging (
RUST_LOG=info,infrastructure=debug,...from the previous turn), the server log should showPDF preprocess: page Nlines,zone segmenteddebug entries, optionaltable detectedlines, and the finalslice orchestrator: donewithpages_created> 0. - Open the resulting KB pages in the SPA and read them.
Out of scope (follow-up work)
- OCR / scanned PDFs (no text objects → empty markdown today, same after refactor).
- Right-to-left scripts (sort key for vertical splits flips).
- Table-cell merging across rows/cols.
- Footnote detection.