{"feature_extractor_name":"document_graph_extractor","version":"v1","feature_extractor_id":"document_graph_extractor_v1","description":"Extracts spatial blocks from PDFs with layout classification and confidence scoring. Decomposes documents into paragraphs, tables, forms, lists, headers, footers, figures, and handwritten content. Includes optional VLM correction for low-confidence blocks. Best for archival documents, scanned files, and documents requiring spatial understanding.\n\n**Pipeline Steps:**\n1. Filter dataset to collection (if collection_id provided)\n2. Find and resolve PDF URL from row data\n3. **Layout Detection Mode Fork:**\n   - **If use_layout_detection=true (NEW - ML-based):**\n     a. PaddleOCR layout detection (finds ALL elements: text, images, tables)\n     b. Skip to Step 4 (object_type already set by detector)\n   - **If use_layout_detection=false (LEGACY - Text-only):**\n     a. PyMuPDF span extraction (text with bounding boxes)\n     b. Spatial clustering (group nearby spans into logical blocks)\n     c. Layout classification (rule-based: paragraph, table, form, etc.)\n4. Confidence scoring (A/B/C/D tags based on extraction quality)\n5. Text cleaning (remove OCR artifacts, normalize whitespace)\n6. **Conditional:** Page rendering (if generate_thumbnails=true OR use_vlm_correction=true)\n   - Full page and segment thumbnails at configured DPI\n7. **Conditional:** VLM correction (if use_vlm_correction=true AND not fast_mode AND confidence C/D)\n   - Gemini/OpenAI/Anthropic vision models correct low-confidence text\n8. **Conditional:** Text embedding (if run_text_embedding=true)\n   - E5-Large embeddings (1024D) for semantic search\n9. **Output:** Block-level documents with text, layout type, bbox, confidence, and embeddings\n\n**Use for:** Archival documents, scanned PDFs, forms processing, structured extraction, document understanding.\n\n**Not for:** Simple text extraction (use text_extractor), images (use image_extractor).","icon":"file-scan","category":"document","source":"builtin","type_mode":"type_specific","expected_input_types":{"pdf":"pdf"},"inference_type":"embedding","input_schema":{"description":"Input schema for the document graph extractor.\n\nDefines the PDF file that will be processed into spatial blocks.","examples":[{"description":"Archival document for block extraction","pdf":"s3://archive-docs/hoffman-file-part-35.pdf"}],"properties":{"pdf":{"description":"REQUIRED. URL or S3 path to PDF file for processing. PDF will be decomposed into spatial blocks with layout classification. Supports any PDF version, both digital and scanned documents. Examples: 's3://bucket/document.pdf', 'https://example.com/report.pdf'","examples":["s3://my-bucket/documents/fbi-file-part-35.pdf","https://storage.googleapis.com/my-docs/archival-record.pdf"],"title":"Pdf","type":"string"}},"required":["pdf"],"title":"DocumentGraphExtractorInput","type":"object"},"output_schema":{"$defs":{"BoundingBox":{"description":"Bounding box coordinates for a block.","properties":{"x0":{"description":"Left edge x-coordinate","title":"X0","type":"number"},"y0":{"description":"Top edge y-coordinate","title":"Y0","type":"number"},"x1":{"description":"Right edge x-coordinate","title":"X1","type":"number"},"y1":{"description":"Bottom edge y-coordinate","title":"Y1","type":"number"}},"required":["x0","y0","x1","y1"],"title":"BoundingBox","type":"object"},"ConfidenceTag":{"description":"Confidence tags for extraction quality.","enum":["A","B","C","D"],"title":"ConfidenceTag","type":"string"},"ObjectType":{"description":"Block/object types produced by document graph extractor.","enum":["paragraph","table","form","list","header","footer","figure","handwritten"],"title":"ObjectType","type":"string"}},"description":"Output schema for a single block produced by the document graph extractor.\n\nEach block represents a spatially-clustered region of the document with\nlayout classification and confidence scoring.","examples":[{"bbox":{"x0":14.0,"x1":551.0,"y0":272.0,"y1":375.0},"block_index":2,"confidence_tag":"A","description":"High-confidence paragraph block","object_type":"paragraph","overall_confidence":0.85,"page_number":1,"source_file":"abbie-hoffman-part-35.pdf","text_corrected":"HOFFMAN has been a participant in conversations...","text_raw":"HOFFMAN has been a participant in conversations...","total_pages":15}],"properties":{"page_number":{"description":"Page number in original PDF (1-indexed)","title":"Page Number","type":"integer"},"object_type":{"$ref":"#/$defs/ObjectType","description":"Classified type of this block. PARAGRAPH: Regular text. TABLE: Tabular data. FORM: Form fields. LIST: Bulleted/numbered lists. HEADER/FOOTER: Page headers/footers. FIGURE: Images/diagrams. HANDWRITTEN: Handwritten content."},"block_index":{"description":"Block index within the page (0-indexed)","title":"Block Index","type":"integer"},"bbox":{"$ref":"#/$defs/BoundingBox","description":"Bounding box coordinates for this block on the page"},"text_raw":{"description":"Original extracted text from the block (before cleaning)","title":"Text Raw","type":"string"},"text_corrected":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Cleaned and/or VLM-corrected text. Contains cleaned text for high-confidence blocks, VLM-corrected text for low-confidence blocks (if enabled).","title":"Text Corrected"},"overall_confidence":{"description":"Extraction confidence score (0.0-1.0)","maximum":1.0,"minimum":0.0,"title":"Overall Confidence","type":"number"},"confidence_tag":{"$ref":"#/$defs/ConfidenceTag","description":"Confidence category. A: >=0.85 (high). B: >=0.70 (medium). C: >=0.50 (low, may need verification). D: <0.50 (very low, needs VLM)."},"document_graph_extractor_v1_text_embedding":{"anyOf":[{"items":{"type":"number"},"type":"array"},{"type":"null"}],"default":null,"description":"Dense vector embedding for text content (1024-dim E5)","title":"Document Graph Extractor V1 Text Embedding"},"thumbnail_url":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"URL to full page thumbnail (low-res image of entire page)","title":"Thumbnail Url"},"segment_thumbnail_url":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"URL to segment thumbnail (cropped to block's bounding box)","title":"Segment Thumbnail Url"},"total_pages":{"anyOf":[{"type":"integer"},{"type":"null"}],"default":null,"description":"Total pages in source PDF","title":"Total Pages"},"source_file":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Original source file name","title":"Source File"}},"required":["page_number","object_type","block_index","bbox","text_raw","overall_confidence","confidence_tag"],"title":"DocumentGraphExtractorOutput","type":"object"},"parameter_schema":{"description":"Parameters for the document graph extractor.\n\nThis extractor decomposes PDFs into spatial blocks with layout classification,\nconfidence scoring, and optional VLM correction for degraded documents.\n\n**When to Use**:\n    - Historical/archival document processing (FBI files, old records)\n    - Scanned documents with mixed quality\n    - Documents requiring spatial understanding (forms, tables, multi-column)\n    - When you need block-level granularity with bounding boxes\n    - When confidence scoring is needed for downstream filtering\n\n**When NOT to Use**:\n    - Simple text-only documents -> Use text_extractor instead\n    - When page-level granularity is sufficient -> Use pdf_extractor instead\n    - Real-time processing requirements -> VLM correction adds latency","examples":[{"description":"Fast processing mode (no VLM, maximum throughput)","extractor_type":"document_graph_extractor","fast_mode":true,"generate_thumbnails":true,"layout_detector":"pymupdf","run_text_embedding":true,"use_case":"High-volume document ingestion where speed matters more than perfect accuracy","use_layout_detection":true},{"description":"Archival documents with VLM correction (recommended for old scans)","extractor_type":"document_graph_extractor","layout_detector":"pymupdf","min_confidence_for_vlm":0.6,"render_dpi":150,"run_text_embedding":true,"use_case":"Historical archives, FBI files, old scanned documents with degraded quality","use_layout_detection":true,"use_vlm_correction":true,"vlm_model":"gemini-2.5-flash","vlm_provider":"google"},{"description":"SOTA accuracy mode with Docling (best for tables/figures)","extractor_type":"document_graph_extractor","fast_mode":true,"generate_thumbnails":true,"layout_detector":"docling","run_text_embedding":true,"use_case":"Documents with complex tables, figures, or requiring accurate semantic typing","use_layout_detection":true}],"properties":{"extractor_type":{"const":"document_graph_extractor","default":"document_graph_extractor","description":"Discriminator field for parameter type identification. Must be 'document_graph_extractor'.","title":"Extractor Type","type":"string"},"use_layout_detection":{"default":true,"description":"Enable ML-based layout detection to find ALL document elements (text, images, tables, figures). When enabled, uses the configured layout_detector to detect and extract both text regions AND non-text elements (scanned images, figures, charts) as separate documents. **Recommended for**: Scanned documents, image-heavy PDFs, mixed content documents. **When disabled**: Falls back to text-only extraction (faster but misses images). Default: True (detects all elements including images).","title":"Use Layout Detection","type":"boolean"},"layout_detector":{"default":"pymupdf","description":"Layout detection engine to use when use_layout_detection=True. 'pymupdf': Fast, rule-based detection using PyMuPDF heuristics (~15 pages/sec). 'docling': SOTA ML-based detection using IBM Docling with DiT model (~3-8 sec/doc). **Docling advantages**: Better semantic type detection (section_header vs paragraph), true table structure extraction (rows/cols), more accurate figure detection. **PyMuPDF advantages**: Much faster, lower memory usage, simpler dependencies. Default: 'pymupdf' for speed. Use 'docling' for accuracy-critical applications.","enum":["pymupdf","docling"],"title":"Layout Detector","type":"string"},"vertical_threshold":{"default":15.0,"description":"Maximum vertical gap (in points) between lines to be grouped in same block. Increase for looser grouping, decrease for tighter blocks. Default 15pt works well for standard documents.","maximum":100.0,"minimum":1.0,"title":"Vertical Threshold","type":"number"},"horizontal_threshold":{"default":50.0,"description":"Maximum horizontal distance (in points) for overlap detection. Affects column detection and block merging. Increase for wider columns, decrease for narrow layouts.","maximum":200.0,"minimum":1.0,"title":"Horizontal Threshold","type":"number"},"min_text_length":{"default":20,"description":"Minimum text length (characters) to keep a block. Blocks with less text are filtered out. Helps remove noise and tiny fragments.","maximum":500,"minimum":1,"title":"Min Text Length","type":"integer"},"base_confidence":{"default":0.85,"description":"Base confidence score for embedded (native) text. Penalties are subtracted for OCR artifacts, encoding issues, etc.","maximum":1.0,"minimum":0.0,"title":"Base Confidence","type":"number"},"min_confidence_for_vlm":{"default":0.6,"description":"Confidence threshold below which VLM correction is triggered. Blocks with confidence < this value get sent to VLM for correction. Only applies when use_vlm_correction=True.","maximum":1.0,"minimum":0.0,"title":"Min Confidence For Vlm","type":"number"},"use_vlm_correction":{"default":true,"description":"Enable VLM (Vision Language Model) correction for low-confidence blocks. Uses Gemini/GPT-4V to correct OCR errors by analyzing the page image. Significantly slower (~1 page/sec) but improves accuracy for degraded docs.","title":"Use Vlm Correction","type":"boolean"},"fast_mode":{"default":false,"description":"Skip VLM correction entirely for maximum throughput (~15 pages/sec). Overrides use_vlm_correction. Use when speed is more important than accuracy.","title":"Fast Mode","type":"boolean"},"vlm_provider":{"default":"google","description":"LLM provider for VLM correction. Options: 'google' (Gemini), 'openai' (GPT-4V), 'anthropic' (Claude). Google recommended for best vision quality.","title":"Vlm Provider","type":"string"},"vlm_model":{"default":"gemini-2.5-flash","description":"Specific model for VLM correction. Examples: 'gemini-2.5-flash', 'gpt-4o', 'claude-3-5-sonnet'.","title":"Vlm Model","type":"string"},"llm_api_key":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"API key for VLM correction (BYOK - Bring Your Own Key). Supports:\n- Direct key: 'sk-proj-abc123...'\n- Secret reference: '{{SECRET.openai_api_key}}'\n\nWhen using secret reference, the key is loaded from your organization's secrets vault at runtime. Store secrets via POST /v1/organizations/secrets.\n\nIf not provided, uses Mixpeek's default API keys.","title":"Llm Api Key"},"run_text_embedding":{"default":true,"description":"Generate text embeddings for semantic search over block content. Uses E5-Large (1024-dim) for multilingual support.","title":"Run Text Embedding","type":"boolean"},"render_dpi":{"default":150,"description":"DPI for page rendering (used for VLM correction). 72: Fast, lower quality. 150: Balanced (recommended). 300: High quality, slower.","maximum":300,"minimum":72,"title":"Render Dpi","type":"integer"},"generate_thumbnails":{"default":true,"description":"Generate thumbnail images for blocks. Useful for visual previews and UI display.","title":"Generate Thumbnails","type":"boolean"},"thumbnail_mode":{"default":"both","description":"Thumbnail generation mode. 'full_page': Low-res thumbnail of entire page. 'segment': Cropped thumbnail of just the block's bounding box. 'both': Generate both types (recommended for flexibility).","title":"Thumbnail Mode","type":"string"},"thumbnail_dpi":{"default":72,"description":"DPI for thumbnail generation. Lower DPI = smaller files. 72: Standard web quality. 36: Very small thumbnails.","maximum":150,"minimum":36,"title":"Thumbnail Dpi","type":"integer"}},"title":"DocumentGraphExtractorParams","type":"object"},"supported_input_types":["pdf"],"max_inputs":{"pdf":1},"default_parameters":{},"costs":{"tier":2,"tier_label":"MODERATE","rates":[{"unit":"page","credits_per_unit":5,"description":"Document page processing with layout analysis"},{"unit":"extraction","credits_per_unit":20,"description":"VLM correction per low-confidence block"}]},"required_vector_indexes":[{"feature_uri":"mixpeek://document_graph_extractor@v1/intfloat__multilingual_e5_large_instruct","name":"intfloat__multilingual_e5_large_instruct","description":"Vector index for document graph text embeddings","type":"single","index":{"name":"document_graph_extractor_v1_text_embedding","description":"Dense vector embedding for block text content","dimensions":1024,"type":"dense","distance":"Cosine","datatype":"float32","on_disk":null,"supported_inputs":["text","string"],"inference_name":"intfloat__multilingual_e5_large_instruct","inference_service_id":"intfloat/multilingual-e5-large-instruct","purpose":null,"vector_name_override":null,"supports_multi_query":false}}],"required_payload_indexes":[],"position_fields":["page_number","object_type","block_index"],"capabilities":["batch","realtime"],"example_usage":{"namespace":{"feature_extractors":[{"name":"document_graph_extractor","version":"v1"}]},"collection":{"feature_extractor":{"name":"document_graph_extractor","version":"v1","input_mappings":{"pdf":"<your_pdf_field>"},"parameters":{"use_layout_detection":true,"layout_detector":"pymupdf","vertical_threshold":15.0,"horizontal_threshold":50.0,"min_text_length":20,"base_confidence":0.85,"min_confidence_for_vlm":0.6,"use_vlm_correction":true,"fast_mode":false,"vlm_provider":"google","vlm_model":"gemini-2.5-flash","run_text_embedding":true,"render_dpi":150,"generate_thumbnails":true,"thumbnail_mode":"both","thumbnail_dpi":72}}}}}