{"feature_extractor_name":"web_scraper","version":"v1","feature_extractor_id":"web_scraper_v1","description":"Crawls websites and extracts content with multimodal embeddings. Supports documentation sites, job boards, news sites, and SPAs.\n\n**Embedding Types:**\n- Text (E5-Large 1024D): Semantic search over page content\n- Code (Jina Code 768D): Code similarity and API pattern matching\n- Images (SigLIP 768D): Semantic visual search (what is shown)\n- Images (DINOv2 768D): Visual structure comparison (how it looks)\n\n**Use for:** Documentation freshness detection, knowledge base building, job board ingestion, API example indexing, curriculum validation.","icon":"globe","source":"builtin","input_schema":{"description":"Input schema for the web scraper extractor.\n\nAccepts a URL to crawl. The extractor will recursively follow links\nand extract content from all discovered pages.","examples":[{"description":"AWS Boto3 documentation","url":"https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html"},{"description":"Job board","url":"https://boards.greenhouse.io/anthropic"}],"properties":{"url":{"description":"REQUIRED. Seed URL to start crawling from. Example: 'https://docs.example.com/api/'","examples":["https://boto3.amazonaws.com/v1/documentation/api/latest/","https://boards.greenhouse.io/anthropic","https://github.com/anthropics/anthropic-sdk-python"],"title":"Url","type":"string"}},"required":["url"],"title":"WebScraperExtractorInput","type":"object"},"output_schema":{"$defs":{"AssetLink":{"description":"A downloadable asset link discovered during crawling.\n\nDESIGN RATIONALE:\n----------------\nDuring web crawling, we encounter links to downloadable files (PDFs, documents,\narchives, etc.) that cannot be directly embedded as text. Rather than:\n1. Following these links and storing unusable binary data, OR\n2. Silently ignoring them via exclude_patterns\n\nWe capture them in a structured array. This enables downstream processing:\n- A separate PDF extractor collection can process these assets\n- Analytics on what documentation assets exist\n- Completeness tracking for documentation coverage\n\nUSAGE:\n------\nAsset links are captured during HTML crawling but NOT followed. They are stored\nas metadata on the parent page document. A downstream pipeline can then:\n1. Query pages with asset_links\n2. Send asset URLs to a dedicated document processing collection\n3. Link the extracted content back to the source page\n\nExample downstream workflow:\n    Page A (HTML) -> asset_links: [{url: \"guide.pdf\", ...}]\n                          |\n                          v\n    PDF Collection (with pdf_extractor) -> processes guide.pdf\n                          |\n                          v\n    Linked documents with parent_url reference","properties":{"url":{"description":"Full URL of the downloadable asset","title":"Url","type":"string"},"file_type":{"description":"Asset type detected from extension/content-type (pdf, doc, zip, etc.)","title":"File Type","type":"string"},"link_text":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Anchor text of the link (provides context about the asset)","title":"Link Text"},"link_title":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Title attribute of the link element","title":"Link Title"},"file_extension":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"File extension extracted from URL (e.g., '.pdf', '.docx')","title":"File Extension"}},"required":["url","file_type"],"title":"AssetLink","type":"object"},"CodeBlock":{"description":"A code block extracted from a web page.","properties":{"language":{"description":"Programming language (python, javascript, etc.)","title":"Language","type":"string"},"code":{"description":"The code content","title":"Code","type":"string"},"line_start":{"anyOf":[{"type":"integer"},{"type":"null"}],"default":null,"description":"Starting line in source","title":"Line Start"},"line_end":{"anyOf":[{"type":"integer"},{"type":"null"}],"default":null,"description":"Ending line in source","title":"Line End"}},"required":["language","code"],"title":"CodeBlock","type":"object"},"ExtractedImage":{"description":"An image extracted from a web page.","properties":{"src":{"description":"Image source URL","title":"Src","type":"string"},"alt":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Alt text","title":"Alt"},"title":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Title attribute","title":"Title"},"width":{"anyOf":[{"type":"integer"},{"type":"null"}],"default":null,"description":"Image width in pixels","title":"Width"},"height":{"anyOf":[{"type":"integer"},{"type":"null"}],"default":null,"description":"Image height in pixels","title":"Height"}},"required":["src"],"title":"ExtractedImage","type":"object"}},"description":"Output schema for a single document produced by the web scraper.\n\nEach crawled page (or chunk) produces one document with:\n- Text content + E5 embedding (1024D)\n- Code blocks + Jina Code embeddings (768D each)\n- Images + SigLIP embeddings (768D each)\n- Page metadata (URL, title, depth, etc.)","examples":[{"code_blocks":[{"code":"import boto3\ns3 = boto3.client('s3')","language":"python"}],"content":"Getting Started with S3...","description":"Documentation page with code","intfloat__multilingual_e5_large_instruct":[0.01,-0.02],"jinaai__jina_embeddings_v2_base_code":[0.03,-0.04],"page_url":"https://boto3.amazonaws.com/.../quickstart.html","title":"Quickstart - Boto3 Docs"}],"properties":{"content":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Extracted text content from the page or chunk.","title":"Content"},"title":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Page title extracted from HTML.","title":"Title"},"page_url":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"URL of the source page.","title":"Page Url"},"code_blocks":{"anyOf":[{"items":{"$ref":"#/$defs/CodeBlock"},"type":"array"},{"type":"null"}],"default":null,"description":"Code blocks extracted from the page.","title":"Code Blocks"},"images":{"anyOf":[{"items":{"$ref":"#/$defs/ExtractedImage"},"type":"array"},{"type":"null"}],"default":null,"description":"Images extracted from the page.","title":"Images"},"asset_links":{"anyOf":[{"items":{"$ref":"#/$defs/AssetLink"},"type":"array"},{"type":"null"}],"default":null,"description":"Downloadable assets discovered on this page (PDFs, docs, archives). These links are captured for downstream processing by specialized extractors (e.g., PDF collection) but are NOT followed during crawling. Use this to build complete documentation coverage including non-HTML assets.","title":"Asset Links"},"intfloat__multilingual_e5_large_instruct":{"anyOf":[{"items":{"type":"number"},"maxItems":1024,"minItems":1024,"type":"array"},{"type":"null"}],"default":null,"description":"E5 embedding for text content (1024D). Derived from intfloat/multilingual-e5-large-instruct.","title":"Intfloat  Multilingual E5 Large Instruct"},"jinaai__jina_embeddings_v2_base_code":{"anyOf":[{"items":{"type":"number"},"maxItems":768,"minItems":768,"type":"array"},{"type":"null"}],"default":null,"description":"Jina code embedding for code blocks (768D). Derived from jinaai/jina-embeddings-v2-base-code.","title":"Jinaai  Jina Embeddings V2 Base Code"},"google__siglip_base_patch16_224":{"anyOf":[{"items":{"type":"number"},"maxItems":768,"minItems":768,"type":"array"},{"type":"null"}],"default":null,"description":"SigLIP embedding for images (768D). Derived from google/siglip-base-patch16-224.","title":"Google  Siglip Base Patch16 224"},"facebook__dinov2_base":{"anyOf":[{"items":{"type":"number"},"maxItems":768,"minItems":768,"type":"array"},{"type":"null"}],"default":null,"description":"DINOv2 visual structure embedding (768D). Derived from facebook/dinov2-base.","title":"Facebook  Dinov2 Base"},"chunk_index":{"anyOf":[{"type":"integer"},{"type":"null"}],"default":null,"description":"Index of this chunk within the page.","title":"Chunk Index"},"total_chunks":{"anyOf":[{"type":"integer"},{"type":"null"}],"default":null,"description":"Total chunks from this page.","title":"Total Chunks"},"crawl_depth":{"anyOf":[{"type":"integer"},{"type":"null"}],"default":null,"description":"Depth from seed URL (0=seed page).","title":"Crawl Depth"},"parent_url":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"URL of the page that linked to this one.","title":"Parent Url"}},"title":"WebScraperExtractorOutput","type":"object"},"parameter_schema":{"$defs":{"ChunkStrategy":{"description":"Strategy for splitting page content into chunks.","enum":["none","sentences","paragraphs","words","characters"],"title":"ChunkStrategy","type":"string"},"CrawlMode":{"description":"Mode for crawling web pages.\n\nValues:\n    DETERMINISTIC: BFS crawl following all links up to max_depth\n    SEMANTIC: LLM-guided crawl prioritizing pages relevant to crawl_goal","enum":["deterministic","semantic"],"title":"CrawlMode","type":"string"},"DocumentIdStrategy":{"description":"Strategy for generating deterministic document IDs.\n\nValues:\n    URL: hash(page_url + chunk_index) - stable across re-crawls\n    POSITION: hash(seed_url + page_index + chunk_index) - order-based\n    CONTENT: hash(content) - deduplicates identical content","enum":["url","position","content"],"title":"DocumentIdStrategy","type":"string"},"RenderStrategy":{"description":"Strategy for rendering web pages.\n\nValues:\n    STATIC: Fast HTTP fetch, works for most sites\n    JAVASCRIPT: Browser rendering via Playwright for SPAs\n    AUTO: Try static first, fall back to JS if content too short","enum":["static","javascript","auto"],"title":"RenderStrategy","type":"string"}},"description":"Parameters for the web scraper extractor.\n\nThe web scraper extractor crawls websites and extracts content with three types\nof embeddings for comprehensive multimodal search:\n\n**Embedding Types:**\n- Text (E5-Large): 1024D embeddings for page content\n- Code (Jina Code): 768D embeddings for code blocks\n- Images (SigLIP): 768D semantic embeddings for figures/screenshots\n- Images (DINOv2): 768D structure embeddings for visual layout comparison\n\n**Crawl Modes:**\n- DETERMINISTIC: BFS following all links (default, predictable)\n- SEMANTIC: LLM-guided, prioritizes pages matching crawl_goal\n\n**Rendering Strategies:**\n- STATIC: Fast HTTP fetch (default, works for most sites)\n- JAVASCRIPT: Playwright browser for SPAs (React/Vue/Angular)\n- AUTO: Tries static, falls back to JS if content too short\n\n**Use Cases:**\n- Documentation freshness: Crawl docs, compare against course content\n- Job board ingestion: Extract job listings with structured data\n- Knowledge base building: Convert websites to searchable collections\n- Code example indexing: Find API usage patterns across docs","examples":[{"chunk_size":3,"chunk_strategy":"paragraphs","description":"Documentation site crawl","extractor_type":"web_scraper","max_depth":3,"max_pages":100},{"description":"Job board extraction","extractor_type":"web_scraper","max_depth":1,"max_pages":50,"render_strategy":"auto","response_shape":"Extract job title, department, location, and requirements"},{"crawl_goal":"Find all S3 upload examples and API documentation","crawl_mode":"semantic","description":"Semantic crawl for API docs","extractor_type":"web_scraper","generate_code_embeddings":true,"max_pages":200},{"delay_between_requests":0.5,"description":"Large-scale catalogue with resilience","extractor_type":"web_scraper","max_depth":5,"max_pages":10000,"max_retries":5,"respect_retry_after":true},{"description":"Protected site with proxy rotation","extractor_type":"web_scraper","max_pages":5000,"persist_cookies":true,"proxies":["http://proxy1.example.com:8080","http://proxy2.example.com:8080"],"rotate_proxy_every_n_requests":50,"rotate_proxy_on_error":true}],"properties":{"extractor_type":{"const":"web_scraper","default":"web_scraper","description":"Discriminator field for parameter type identification.","title":"Extractor Type","type":"string"},"max_depth":{"default":2,"description":"Maximum link depth to crawl. 0=seed page only, 1=seed+direct links, etc. Default: 2. Max: 10.","maximum":10,"minimum":0,"title":"Max Depth","type":"integer"},"max_pages":{"default":50,"description":"Maximum pages to crawl. Default: 50. Max: 500.","maximum":500,"minimum":1,"title":"Max Pages","type":"integer"},"crawl_timeout":{"default":300,"description":"Maximum total time for crawling in seconds. Default: 300 (5 minutes). Increase for large sites with many pages. Max: 3600 (1 hour).","maximum":3600,"minimum":10,"title":"Crawl Timeout","type":"integer"},"crawl_mode":{"$ref":"#/$defs/CrawlMode","default":"deterministic","description":"Crawl strategy. DETERMINISTIC: BFS all links (predictable). SEMANTIC: LLM-guided, prioritizes relevant pages (requires crawl_goal)."},"crawl_goal":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Goal for semantic crawling. Only used when crawl_mode=SEMANTIC. Example: 'Find all S3 API documentation and examples'","title":"Crawl Goal"},"render_strategy":{"$ref":"#/$defs/RenderStrategy","default":"auto","description":"How to render pages. AUTO (default): tries static, falls back to JS. STATIC: fast HTTP fetch. JAVASCRIPT: Playwright browser for SPAs."},"include_patterns":{"anyOf":[{"items":{"type":"string"},"type":"array"},{"type":"null"}],"default":null,"description":"Regex patterns for URLs to include. Example: ['/docs/', '/api/']","title":"Include Patterns"},"exclude_patterns":{"anyOf":[{"items":{"type":"string"},"type":"array"},{"type":"null"}],"default":null,"description":"Regex patterns for URLs to exclude. Example: ['/blog/', '\\.pdf$']","title":"Exclude Patterns"},"chunk_strategy":{"$ref":"#/$defs/ChunkStrategy","default":"none","description":"How to split page content. NONE: one chunk per page. SENTENCES/PARAGRAPHS: semantic boundaries. WORDS/CHARACTERS: fixed size chunks."},"chunk_size":{"default":500,"description":"Target size for each chunk (in units of chunk_strategy).","maximum":10000,"minimum":1,"title":"Chunk Size","type":"integer"},"chunk_overlap":{"default":50,"description":"Overlap between chunks to preserve context.","maximum":5000,"minimum":0,"title":"Chunk Overlap","type":"integer"},"document_id_strategy":{"$ref":"#/$defs/DocumentIdStrategy","default":"url","description":"How to generate document IDs. URL (default): stable across re-crawls. POSITION: order-based. CONTENT: deduplicates identical content."},"generate_text_embeddings":{"default":true,"description":"Generate E5 embeddings for text content.","title":"Generate Text Embeddings","type":"boolean"},"generate_code_embeddings":{"default":true,"description":"Generate Jina code embeddings for code blocks.","title":"Generate Code Embeddings","type":"boolean"},"generate_image_embeddings":{"default":true,"description":"Generate SigLIP embeddings for images/figures.","title":"Generate Image Embeddings","type":"boolean"},"generate_structure_embeddings":{"default":true,"description":"Generate DINOv2 visual structure embeddings for layout comparison.","title":"Generate Structure Embeddings","type":"boolean"},"response_shape":{"anyOf":[{"type":"string"},{"additionalProperties":true,"type":"object"},{"type":"null"}],"default":null,"description":"Optional structured extraction schema. Natural language or JSON schema. Example: 'Extract API version, deprecated methods, and example code'","title":"Response Shape"},"llm_provider":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"LLM provider for structured extraction: openai, google, anthropic","title":"Llm Provider"},"llm_model":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"LLM model for structured extraction.","title":"Llm Model"},"llm_api_key":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"API key for LLM operations (BYOK - Bring Your Own Key). Supports:\n- Direct key: 'sk-proj-abc123...'\n- Secret reference: '{{SECRET.openai_api_key}}'\n\nWhen using secret reference, the key is loaded from your organization's secrets vault at runtime. Store secrets via POST /v1/organizations/secrets.\n\nIf not provided, uses Mixpeek's default API keys.","title":"Llm Api Key"},"max_retries":{"default":3,"description":"Maximum retry attempts for failed HTTP requests. Uses exponential backoff with jitter. Default: 3.","maximum":10,"minimum":0,"title":"Max Retries","type":"integer"},"retry_base_delay":{"default":1.0,"description":"Base delay in seconds for retry backoff. Actual delay = base * 2^attempt + jitter. Default: 1.0.","maximum":30.0,"minimum":0.1,"title":"Retry Base Delay","type":"number"},"retry_max_delay":{"default":30.0,"description":"Maximum delay in seconds between retries. Default: 30.","maximum":300.0,"minimum":1.0,"title":"Retry Max Delay","type":"number"},"respect_retry_after":{"default":true,"description":"Respect Retry-After header from 429/503 responses. If False, uses exponential backoff instead. Default: True.","title":"Respect Retry After","type":"boolean"},"proxies":{"anyOf":[{"items":{"type":"string"},"type":"array"},{"type":"null"}],"default":null,"description":"List of proxy URLs for rotation. Supports formats: 'http://host:port', 'http://user:pass@host:port', 'socks5://host:port'. Proxies rotate on errors or every N requests.","title":"Proxies"},"rotate_proxy_on_error":{"default":true,"description":"Rotate to next proxy when request fails. Default: True.","title":"Rotate Proxy On Error","type":"boolean"},"rotate_proxy_every_n_requests":{"default":0,"description":"Rotate proxy every N requests (0 = disabled). Useful for avoiding IP-based rate limits. Default: 0 (disabled).","maximum":1000,"minimum":0,"title":"Rotate Proxy Every N Requests","type":"integer"},"captcha_service_provider":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Captcha solving service provider: '2captcha', 'anti-captcha', 'capsolver'. If not set, captcha pages are skipped gracefully.","title":"Captcha Service Provider"},"captcha_service_api_key":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"API key for captcha solving service. Supports secret reference: '{{SECRET.captcha_api_key}}'. Required if captcha_service_provider is set.","title":"Captcha Service Api Key"},"detect_captcha":{"default":true,"description":"Detect captcha challenges (Cloudflare, reCAPTCHA, hCaptcha). If detected and no solver configured, page is skipped. Default: True.","title":"Detect Captcha","type":"boolean"},"persist_cookies":{"default":true,"description":"Persist cookies across requests within a crawl session. Useful for sites requiring authentication. Default: True.","title":"Persist Cookies","type":"boolean"},"custom_headers":{"anyOf":[{"additionalProperties":{"type":"string"},"type":"object"},{"type":"null"}],"default":null,"description":"Custom HTTP headers to include in all requests. Example: {'Authorization': 'Bearer token', 'X-Custom': 'value'}","title":"Custom Headers"},"delay_between_requests":{"default":0.0,"description":"Delay in seconds between consecutive requests. Useful for polite crawling and avoiding rate limits. Default: 0 (no delay).","maximum":60.0,"minimum":0.0,"title":"Delay Between Requests","type":"number"}},"title":"WebScraperExtractorParams","type":"object"},"supported_input_types":["string","text"],"max_inputs":{"string":1},"default_parameters":{},"costs":{"tier":3,"tier_label":"COMPLEX","rates":[{"unit":"page","credits_per_unit":5,"description":"Web page crawl and text extraction"},{"unit":"extraction","credits_per_unit":1,"description":"Code block embedding with Jina Code"},{"unit":"image","credits_per_unit":2,"description":"Image embedding with SigLIP"}]},"required_vector_indexes":[{"feature_uri":"mixpeek://web_scraper@v1/intfloat__multilingual_e5_large_instruct","name":"intfloat__multilingual_e5_large_instruct","description":"Vector index for text content embeddings.","type":"single","index":{"name":"intfloat__multilingual_e5_large_instruct","description":"E5 embedding for text content.","dimensions":1024,"type":"dense","distance":"Cosine","datatype":"float32","on_disk":null,"supported_inputs":["string","text"],"inference_name":"intfloat__multilingual_e5_large_instruct","inference_service_id":"intfloat/multilingual-e5-large-instruct","purpose":"text","vector_name_override":null}},{"feature_uri":"mixpeek://web_scraper@v1/jinaai__jina_embeddings_v2_base_code","name":"jinaai__jina_embeddings_v2_base_code","description":"Vector index for code block embeddings.","type":"single","index":{"name":"jinaai__jina_embeddings_v2_base_code","description":"Jina code embedding for code blocks.","dimensions":768,"type":"dense","distance":"Cosine","datatype":"float32","on_disk":null,"supported_inputs":["string","text"],"inference_name":"jinaai__jina_embeddings_v2_base_code","inference_service_id":"jinaai/jina-embeddings-v2-base-code","purpose":"code","vector_name_override":null}},{"feature_uri":"mixpeek://web_scraper@v1/google__siglip_base_patch16_224","name":"google__siglip_base_patch16_224","description":"Vector index for semantic image embeddings.","type":"single","index":{"name":"google__siglip_base_patch16_224","description":"SigLIP embedding for semantic visual content.","dimensions":768,"type":"dense","distance":"Cosine","datatype":"float32","on_disk":null,"supported_inputs":["image"],"inference_name":"google__siglip_base_patch16_224","inference_service_id":"google/siglip-base-patch16-224","purpose":"image","vector_name_override":null}},{"feature_uri":"mixpeek://web_scraper@v1/facebook__dinov2_base","name":"facebook__dinov2_base","description":"Vector index for visual structure embeddings.","type":"single","index":{"name":"facebook__dinov2_base","description":"DINOv2 embedding for fine-grained visual structure comparison.","dimensions":768,"type":"dense","distance":"Cosine","datatype":"float32","on_disk":null,"supported_inputs":["image"],"inference_name":"facebook__dinov2_base","inference_service_id":"facebook/dinov2-base","purpose":"image","vector_name_override":null}}],"required_payload_indexes":[],"position_fields":["page_url","doc_type","code_index","image_index","chunk_index"]}