[{"stage_id":"api_call","description":"Enrich documents with external API calls","category":"apply","icon":"external-link","parameter_schema":{"$defs":{"AuthConfig":{"description":"Authentication configuration for API calls.\n\nDefines how to authenticate with external APIs using credentials stored\nin the organization secrets vault. Credentials are NEVER stored in the\nstage configuration - only references to vault secrets.\n\n**⚠️ CRITICAL SECURITY WARNINGS**:\n- NEVER store actual credentials in configuration\n- ALWAYS use secret_ref to reference vault secrets\n- Credentials are encrypted at rest in vault\n- Decrypted values never appear in logs or API responses\n\n**How It Works**:\n1. Store secret via: POST /v1/organizations/secrets\n2. Reference secret via: auth.secret_ref in stage config\n3. At runtime: Secret is retrieved, decrypted, and injected into request\n4. Security: Original secret value never exposed or logged\n\n**Requirements**:\n- type: REQUIRED, authentication method (none, api_key, bearer, basic, custom_header)\n- secret_ref: REQUIRED (except for type=none), name of secret in vault\n- key: REQUIRED (for api_key and custom_header types), header/query param name\n- location: OPTIONAL (for api_key type), 'header' or 'query' (default: header)\n\n**Supported Authentication Types**:\n- Bearer tokens (OAuth 2.0, JWT): Most modern APIs\n- API keys: Weather APIs, Maps, etc.\n- Basic auth: Legacy systems\n- Custom headers: Non-standard auth schemes","examples":[{"description":"No authentication for public APIs","type":"none"},{"description":"Bearer token for GitHub API (OAuth 2.0)","secret_ref":"github_pat","type":"bearer"},{"description":"API key in header for Stripe","key":"Authorization","location":"header","secret_ref":"stripe_api_key","type":"api_key"},{"description":"API key in query parameter (less secure)","key":"apikey","location":"query","secret_ref":"weather_api_key","type":"api_key"},{"description":"Basic authentication (username:password)","secret_ref":"basic_auth_credentials","type":"basic"},{"description":"Custom header for non-standard APIs","key":"X-Custom-Auth","secret_ref":"custom_token","type":"custom_header"}],"properties":{"type":{"$ref":"#/$defs/AuthType","default":"none","description":"REQUIRED. Authentication method to use. Options: none (public API), api_key (API keys), bearer (OAuth/JWT), basic (HTTP Basic Auth), custom_header (non-standard headers). See AuthType enum for detailed description of each type.","examples":["bearer","api_key","basic"]},"secret_ref":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"REQUIRED (except for type=none). Name of secret in organization vault. The secret must be created first via POST /v1/organizations/secrets. Format: Use the exact secret_name from vault (e.g., 'stripe_api_key'). At runtime, the secret value is securely retrieved and decrypted. The decrypted value is then injected into the request per auth type. SECURITY: NEVER store actual credentials here - only the reference name. Examples: 'stripe_api_key', 'github_pat', 'weather_api_key'","examples":["stripe_api_key","github_pat","openai_api_key","weather_api_key"],"title":"Secret Ref"},"location":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"OPTIONAL (for api_key type only). Where to inject the API key. Options: 'header' (recommended, more secure) or 'query' (less secure). Default: 'header' if not specified. Query parameters appear in URLs and logs - use headers when possible. Ignored for other auth types (bearer, basic, custom_header).","examples":["header","query"],"title":"Location"},"key":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"REQUIRED (for api_key and custom_header types). Header name or query parameter name for authentication. For api_key with location=header: Header name like 'X-API-Key', 'Authorization'. For api_key with location=query: Query param like 'apikey', 'api_key', 'key'. For custom_header: Any custom header name like 'X-Custom-Auth', 'X-Token'. Ignored for bearer and basic types (use standard headers). Common patterns: 'X-API-Key', 'Authorization', 'X-Auth-Token'","examples":["X-API-Key","Authorization","apikey","X-Custom-Auth"],"title":"Key"}},"title":"AuthConfig","type":"object"},"AuthType":{"description":"Authentication type for API calls.\n\nDefines how credentials from the organization secrets vault should be\ninjected into HTTP requests for authentication with external APIs.\n\nValues:\n    NONE: No authentication (public APIs)\n\n    API_KEY: API key in header or query parameter\n        - Use for APIs that require API keys (Weather API, Maps, etc.)\n        - Header location recommended for security\n        - Requires: secret_ref, key, location\n        - Example: X-API-Key: abc123\n\n    BEARER: Bearer token authentication (OAuth 2.0, JWT)\n        - Use for APIs using Bearer tokens (GitHub, OpenAI, most modern APIs)\n        - Adds: Authorization: Bearer {secret_value}\n        - Requires: secret_ref\n        - Example: Authorization: Bearer ghp_abc123\n\n    BASIC: HTTP Basic authentication\n        - Use for APIs using Basic Auth (legacy systems)\n        - Secret format: \"username:password\"\n        - Adds: Authorization: Basic {base64(username:password)}\n        - Requires: secret_ref\n        - Example: Authorization: Basic dXNlcm5hbWU6cGFzc3dvcmQ=\n\n    CUSTOM_HEADER: Custom header with arbitrary name\n        - Use for APIs with non-standard auth headers\n        - Adds: {key}: {secret_value}\n        - Requires: secret_ref, key\n        - Example: X-Custom-Auth: token123\n\nExamples:\n    - Bearer token for GitHub API: type=\"bearer\"\n    - API key for Weather API: type=\"api_key\", location=\"query\"\n    - Basic auth for legacy API: type=\"basic\"\n    - Custom header: type=\"custom_header\", key=\"X-Custom-Token\"","enum":["none","api_key","bearer","basic","custom_header"],"title":"AuthType","type":"string"},"DynamicValue":{"description":"A value that should be dynamically resolved from the query request.","properties":{"type":{"const":"dynamic","default":"dynamic","title":"Type","type":"string"},"field":{"description":"The dot-notation path to the value in the runtime query request, e.g., 'inputs.user_id'","examples":["inputs.query_text","filters.AND[0].value"],"title":"Field","type":"string"}},"required":["field"],"title":"DynamicValue","type":"object"},"ErrorHandling":{"description":"Error handling strategy for API call failures.\n\nDefines what happens when an API call fails (network error, timeout,\nauthentication failure, etc.). Choose based on whether failed enrichment\nshould be fatal or gracefully handled.\n\nValues:\n    SKIP: Skip document enrichment, keep document in results\n        - Document remains unchanged in the pipeline\n        - Best for optional enrichment\n        - Use when: Enrichment is nice-to-have but not critical\n        - Example: Adding weather data to locations (keep doc if API fails)\n\n    REMOVE: Remove document from results entirely\n        - Document is filtered out of pipeline\n        - Best when enrichment is mandatory\n        - Use when: Document is useless without enrichment\n        - Example: Must have Stripe billing data to proceed\n\n    RAISE: Raise exception and fail entire pipeline\n        - Stops pipeline execution immediately\n        - Best for debugging or critical failures\n        - Use when: Want to catch and fix configuration issues\n        - Example: Development/testing to catch errors early\n\nExamples:\n    - Optional weather enrichment: SKIP\n    - Mandatory Stripe billing: REMOVE\n    - Development/debugging: RAISE","enum":["skip","remove","raise"],"title":"ErrorHandling","type":"string"},"FilterCondition":{"description":"Represents a single filter condition.\n\nAttributes:\n    field: The field to filter on\n    operator: The comparison operator\n    value: The value to compare against","properties":{"field":{"description":"Field name to filter on","title":"Field","type":"string"},"operator":{"$ref":"#/$defs/FilterOperator","default":"eq","description":"Comparison operator"},"value":{"anyOf":[{"$ref":"#/$defs/DynamicValue"},{}],"description":"Value to compare against","title":"Value"}},"required":["field","value"],"title":"FilterCondition","type":"object"},"FilterOperator":{"description":"Supported filter operators across database implementations.","enum":["eq","ne","gt","lt","gte","lte","in","nin","contains","starts_with","ends_with","regex","exists","is_null","text","phrase","geo_radius","geo_bounding_box","geo_polygon"],"title":"FilterOperator","type":"string"},"LogicalOperator":{"additionalProperties":true,"description":"Represents a logical operation (AND, OR, NOT) on filter conditions.\n\nAllows nesting with a defined depth limit.\n\nAlso supports shorthand syntax where field names can be passed directly\nas key-value pairs for equality filtering (e.g., {\"metadata.title\": \"value\"}).","properties":{"AND":{"anyOf":[{"items":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"$ref":"#/$defs/FilterCondition"}]},"type":"array"},{"type":"null"}],"default":null,"description":"Logical AND operation - all conditions must be true","example":[{"field":"name","operator":"eq","value":"John"},{"field":"age","operator":"gte","value":30}],"title":"And"},"OR":{"anyOf":[{"items":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"$ref":"#/$defs/FilterCondition"}]},"type":"array"},{"type":"null"}],"default":null,"description":"Logical OR operation - at least one condition must be true","example":[{"field":"status","operator":"eq","value":"active"},{"field":"role","operator":"eq","value":"admin"}],"title":"Or"},"NOT":{"anyOf":[{"items":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"$ref":"#/$defs/FilterCondition"}]},"type":"array"},{"type":"null"}],"default":null,"description":"Logical NOT operation - all conditions must be false","example":[{"field":"department","operator":"eq","value":"HR"},{"field":"location","operator":"eq","value":"remote"}],"title":"Not"},"case_sensitive":{"anyOf":[{"type":"boolean"},{"type":"null"}],"default":false,"description":"Whether to perform case-sensitive matching","example":true,"title":"Case Sensitive"}},"title":"LogicalOperator","type":"object"},"RateLimitConfig":{"description":"Rate limiting configuration.","properties":{"requests_per_minute":{"anyOf":[{"minimum":1,"type":"integer"},{"type":"null"}],"default":null,"description":"Maximum requests per minute per domain","title":"Requests Per Minute"},"requests_per_hour":{"anyOf":[{"minimum":1,"type":"integer"},{"type":"null"}],"default":null,"description":"Maximum requests per hour per domain","title":"Requests Per Hour"}},"title":"RateLimitConfig","type":"object"}},"description":"Configuration for API call enrichment stage.\n\n**Stage Category**: ENRICH (1-1 Enrichment)\n\n**⚠️ CRITICAL SECURITY WARNINGS ⚠️**:\n\n1. **SSRF Risk**: This stage makes external HTTP requests which can be exploited\n   for Server-Side Request Forgery attacks. ALWAYS use `allowed_domains` allowlist.\n\n2. **Data Exfiltration**: Malicious configurations could send internal data to\n   external endpoints. Audit all configurations before deployment.\n\n3. **Credential Safety**: NEVER store credentials directly in configuration.\n   Always use `auth.secret_ref` to reference vault-stored credentials.\n\n4. **Rate Limiting**: Set rate limits to prevent abuse and excessive costs.\n\n5. **Domain Allowlist**: REQUIRED. Explicitly list allowed domains. Never use \"*\".\n\n**Transformation**: N documents → N documents (same count, expanded schema)\n\n**Purpose**: Enriches documents by calling external HTTP APIs. Enables integration\nwith third-party services (Stripe, GitHub, weather APIs, etc.) to augment documents\nwith real-time data. Due to security risks, this stage implements strict controls\nincluding domain allowlisting, SSRF protection, rate limiting, and secure credential\nmanagement.\n\n**When to Use**:\n    - Enrich documents with data from external APIs\n    - Integrate third-party services (Stripe, GitHub, Salesforce)\n    - Fetch real-time data (weather, stocks, currency rates)\n    - Validate data against external systems\n    - Lookup additional context from APIs\n\n**When NOT to Use**:\n    - For untrusted/user-provided URLs (major security risk)\n    - When API credentials can't be securely stored\n    - For high-volume enrichment (rate limits apply)\n    - When response time is critical (network latency)\n    - For internal-only APIs behind firewalls\n\nRequirements:\n    - url: REQUIRED, API endpoint URL (supports templates)\n    - allowed_domains: REQUIRED, domain allowlist (NEVER use \"*\")\n    - method: OPTIONAL, HTTP method (default: GET)\n    - auth: OPTIONAL, authentication configuration\n    - headers: OPTIONAL, additional headers\n    - body: OPTIONAL, request body (for POST/PUT)\n    - output_field: REQUIRED, where to store response\n    - timeout: OPTIONAL, request timeout (default: 10s)\n    - max_response_size: OPTIONAL, max response size (default: 10MB)\n    - when: OPTIONAL, conditional enrichment filter\n    - on_error: OPTIONAL, error handling (skip/remove/raise)\n\nUse Cases:\n    - Stripe customer lookup: Enrich with billing data\n    - GitHub repo info: Fetch commit stats\n    - Weather API: Add location-based weather\n    - Currency conversion: Real-time exchange rates\n    - Address validation: Verify and standardize addresses","examples":[{"allowed_domains":["httpbin.org"],"description":"Simple GET request to public API (no auth required)","method":"GET","output_field":"metadata.api_response","url":"https://httpbin.org/get"},{"allowed_domains":["httpbin.org"],"body":{"document_id":"{DOC.document_id}","query":"{INPUT.query}"},"description":"POST request to httpbin with JSON body","headers":{"Content-Type":"application/json"},"method":"POST","output_field":"metadata.api_response","url":"https://httpbin.org/post"},{"allowed_domains":["api.stripe.com"],"auth":{"secret_ref":"stripe_api_key","type":"bearer"},"description":"Stripe customer lookup with bearer auth","method":"GET","output_field":"metadata.stripe_data","timeout":10,"url":"https://api.stripe.com/v1/customers/{DOC.metadata.stripe_id}"},{"allowed_domains":["api.github.com"],"description":"GitHub repo info (public API)","method":"GET","output_field":"metadata.github_info","response_path":"$.stargazers_count","url":"https://api.github.com/repos/{INPUT.owner}/{INPUT.repo}"}],"properties":{"url":{"default":"https://httpbin.org/get","description":"API endpoint URL to call. Supports template variables: {INPUT.field}, {DOC.field}. Must be HTTP/HTTPS. Domain must be in allowed_domains list. Default uses httpbin.org for testing. Examples: 'https://api.stripe.com/v1/customers/{DOC.metadata.customer_id}'","examples":["https://httpbin.org/get","https://httpbin.org/post","https://api.stripe.com/v1/customers/{DOC.metadata.customer_id}","https://api.github.com/repos/{INPUT.owner}/{INPUT.repo}"],"title":"Url","type":"string"},"allowed_domains":{"default":["httpbin.org"],"description":"Allowlist of domains that can be called. CRITICAL FOR SECURITY - prevents SSRF attacks. Supports wildcards: '*.example.com' matches subdomains. NEVER use '*' (all domains) in production. Default allows httpbin.org for testing. Examples: ['api.stripe.com', '*.github.com', 'api.weatherapi.com']","examples":[["httpbin.org"],["api.stripe.com"],["api.github.com","raw.githubusercontent.com"]],"items":{"type":"string"},"minItems":1,"title":"Allowed Domains","type":"array"},"method":{"default":"GET","description":"HTTP method: GET, POST, PUT, PATCH, DELETE. Default: GET.","examples":["GET","POST","PUT"],"title":"Method","type":"string"},"auth":{"anyOf":[{"$ref":"#/$defs/AuthConfig"},{"type":"null"}],"default":null,"description":"OPTIONAL. Authentication configuration. Uses organization vault for credential storage. See AuthConfig for details."},"headers":{"additionalProperties":{"type":"string"},"description":"OPTIONAL. Additional HTTP headers to include. Do NOT include authentication headers here - use 'auth' field. Supports template variables in values. Example: {'Content-Type': 'application/json', 'X-Custom': '{{INPUT.value}}'}","examples":[{"Content-Type":"application/json"},{"Accept":"application/json","X-Request-ID":"{{DOC.document_id}}"}],"title":"Headers","type":"object"},"body":{"anyOf":[{"additionalProperties":true,"type":"object"},{"type":"null"}],"default":null,"description":"OPTIONAL. Request body (for POST/PUT/PATCH). Serialized as JSON. Supports template variables in values. Only used for non-GET requests.","examples":[{"limit":10,"query":"{{INPUT.search}}"},{"action":"update","customer_id":"{{DOC.metadata.id}}"}],"title":"Body"},"output_field":{"default":"api_response","description":"Dot-path where API response should be stored. Creates nested structure if needed. Response stored as-is (JSON object/array/primitive). Example: 'metadata.api_data'","examples":["metadata.api_response","enrichment.external_data","api_result"],"title":"Output Field","type":"string"},"timeout":{"default":10,"description":"Request timeout in seconds. Range: 1-60. Default: 10.","maximum":60,"minimum":1,"title":"Timeout","type":"integer"},"max_response_size":{"default":10485760,"description":"Maximum response size in bytes. Prevents memory exhaustion. Default: 10MB.","minimum":1024,"title":"Max Response Size","type":"integer"},"rate_limit":{"anyOf":[{"$ref":"#/$defs/RateLimitConfig"},{"type":"null"}],"default":null,"description":"OPTIONAL. Rate limiting configuration per domain."},"response_path":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"OPTIONAL. JSONPath expression to extract specific field from response. If not specified, stores entire response. Examples: '$.data', '$.results[0]', '$.customer.email'","examples":["$.data","$.results[*]","$.customer.email"],"title":"Response Path"},"when":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"type":"null"}],"default":null,"description":"OPTIONAL. Conditional filter for selective enrichment. Only documents matching condition will call API. RECOMMENDED for cost/performance optimization."},"on_error":{"$ref":"#/$defs/ErrorHandling","default":"skip","description":"Error handling strategy: 'skip': Pass document through unchanged. 'remove': Remove failed documents. 'raise': Halt pipeline on error. Default: 'skip'."}},"title":"APICallConfig","type":"object"}},{"stage_id":"cross_compare","description":"","category":"apply","icon":"box","parameter_schema":null},{"stage_id":"json_transform","description":"","category":"apply","icon":"code","parameter_schema":{"additionalProperties":true,"description":"Configuration for JSON Jinja template transformation stage.\n\nStage Category: APPLY (1-1 transformation)\n\nTransformation: N documents → N documents (or fewer with fail_on_error=False)\n\nPurpose: Applies a Jinja2 template to each document in the retrieval pipeline,\nrendering the template with full document context and replacing the document\nwith the parsed JSON output. Useful for reformatting documents to match\nexternal API schemas or restructuring data for downstream consumers.\n\nPerformance: Template rendering is fast (<1ms per document). No caching is\nimplemented as re-rendering is faster than cache overhead for typical\ndocument sizes. Processes documents sequentially with error handling.\n\nWhen to Use:\n    - Reformat documents for external API calls (webhooks, workflows)\n    - Rename or reorganize document fields for client consumption\n    - Drop unnecessary properties to reduce response size\n    - Expand or flatten nested arrays/objects\n    - Apply conditional field inclusion based on document values\n    - Create custom response schemas from standard document format\n\nWhen NOT to Use:\n    - For filtering documents (use FILTER stages: structured_filter, llm_filter)\n    - For sorting documents (use SORT stages: sort_relevance, rerank)\n    - For enriching with new data (use APPLY stages: document_enrich)\n    - For joining external data (use APPLY 1-N stages: taxonomy_enrich)\n\nTemplate Context:\n    Templates have access to the full retriever template context:\n    - DOC (or doc): Current document fields and metadata\n    - INPUT (or input/inputs): Original query inputs from the search request\n    - CONTEXT (or context): Execution context (namespace_id, internal_id, etc.)\n    - STAGE (or stage): Current stage execution data and metadata\n\nTemplate Features:\n    - Standard Jinja2 syntax with all built-in filters (tojson, length, etc.)\n    - Conditional logic ({% if %}, {% elif %}, {% else %})\n    - Loops and iteration ({% for item in items %})\n    - Variable access with dot notation (DOC.metadata.field)\n    - JSON filters for proper escaping ({{ value | tojson }})\n\nError Handling:\n    Documents that fail template rendering or JSON parsing can either\n    skip with a warning (default) or fail the entire retrieval pipeline.\n    Failed documents are tracked in stage metadata for observability.\n\nOperational Behavior:\n    - Fast stage: runs in API layer (no Engine delegation)\n    - Sequential processing: documents transformed one at a time\n    - Error isolation: one document failure doesn't affect others (unless fail_on_error=True)\n    - Schema replacement: output schema completely defined by template\n    - Reports metrics to ClickHouse for performance monitoring\n\nCommon Pipeline Position: FILTER → SORT → APPLY (this stage)\n\nRequirements:\n    - template: REQUIRED\n    - fail_on_error: OPTIONAL (defaults to False)\n\nUse Cases:\n    - External API formatting: Format documents for webhook payloads\n    - Response optimization: Remove unused fields to reduce bandwidth\n    - Schema adaptation: Convert internal format to client-specific format\n    - Conditional outputs: Include fields based on document properties\n    - Array flattening: Transform nested structures to flat arrays\n\nExamples:\n    Basic field selection and renaming:\n    >>> template = '{\"id\": \"{{ DOC.document_id }}\", \"text\": \"{{ DOC.content }}\"}'\n\n    Conditional field inclusion for external API:\n    >>> template = '''\n    ... {\n    ...   \"workflow\": \"{{ DOC.workflow_name }}\",\n    ...   \"inputs\": [\n    ...     {\"name\": \"variant_id\", \"value\": \"{{ DOC.variant_id }}\"}\n    ...     {% if DOC.asset_type == \"VIDEO\" %},\n    ...     {\"name\": \"video\", \"value\": {\"src\": \"{{ DOC.asset_url }}\"}}\n    ...     {% endif %}\n    ...   ]\n    ... }\n    ... '''\n\n    Array expansion with iteration:\n    >>> template = '''\n    ... {\n    ...   \"items\": [\n    ...     {% for item in DOC.tags %}\n    ...     \"{{ item }}\"{% if not loop.last %},{% endif %}\n    ...     {% endfor %}\n    ...   ]\n    ... }\n    ... '''\n\n    Nested field access and JSON escaping:\n    >>> template = '{\"user\": \"{{ DOC.metadata.user_id }}\", \"data\": {{ DOC.raw_data | tojson }}}'","examples":[{"description":"Simple field selection and renaming for API response","fail_on_error":false,"template":"{\"id\": \"{{ DOC.document_id }}\", \"content\": \"{{ DOC.text }}\", \"score\": {{ DOC.score }}}"},{"description":"Conditional field inclusion for external workflow API","fail_on_error":false,"template":"{\"workflow_name\": \"process-asset\", \"inputs\": [{\"name\": \"id\", \"value\": \"{{ DOC.id }}\"}{% if DOC.asset_type == \"VIDEO\" %}, {\"name\": \"video\", \"value\": {\"src\": \"{{ DOC.url }}\"}}{% endif %}]}"},{"description":"Array expansion with iteration and comma handling","fail_on_error":false,"template":"{\"title\": \"{{ DOC.title }}\", \"tags\": [{% for tag in DOC.tags %}\"{{ tag }}\"{% if not loop.last %}, {% endif %}{% endfor %}]}"},{"description":"Nested field access with JSON escaping","fail_on_error":false,"template":"{\"user_id\": \"{{ DOC.metadata.user_id }}\", \"category\": \"{{ DOC.metadata.category }}\", \"raw_data\": {{ DOC.metadata.raw | tojson }}}"},{"description":"Strict transformation for critical API integration (fail on any error)","fail_on_error":true,"template":"{\"required_field\": \"{{ DOC.must_exist }}\", \"value\": {{ DOC.number }}, \"timestamp\": \"{{ DOC.created_at }}\"}"},{"description":"Drop unnecessary fields to reduce response size","fail_on_error":false,"template":"{\"id\": \"{{ DOC.document_id }}\", \"title\": \"{{ DOC.title }}\", \"url\": \"{{ DOC.url }}\"}"},{"description":"Flatten nested metadata structure","fail_on_error":false,"template":"{\"doc_id\": \"{{ DOC.document_id }}\", \"user_id\": \"{{ DOC.metadata.user_id }}\", \"category\": \"{{ DOC.metadata.category }}\", \"score\": {{ DOC.score }}}"}],"properties":{"template":{"default":"{\"id\": \"{{ DOC.document_id }}\", \"content\": {{ DOC.content | tojson }}, \"score\": {{ DOC.score }}}","description":"Jinja2 template string that must render to valid JSON. The template has access to full retriever context: - DOC (or doc): Current document fields and metadata - INPUT (or input/inputs): Original query inputs from search request - CONTEXT (or context): Execution context (namespace_id, internal_id) - STAGE (or stage): Current stage execution data Both uppercase and lowercase namespace formats work identically (DOC == doc). The template must produce valid JSON syntax when rendered - invalid JSON will cause document to be skipped (unless fail_on_error=True). Supports all Jinja2 features: conditionals ({% if %}), loops ({% for %}), filters (| tojson, | length), variable access (DOC.metadata.field). Common patterns: - Field selection: {'id': '{{ DOC.document_id }}'} - Conditional inclusion: {% if DOC.type == 'video' %}...{% endif %} - Array iteration: {% for item in DOC.tags %}...{% endfor %} - JSON escaping: {{ DOC.data | tojson }} Use cases: API formatting, field renaming, property filtering, structure flattening.","examples":["{\"id\": \"{{ DOC.document_id }}\", \"content\": \"{{ DOC.text }}\", \"score\": {{ DOC.score }}}","{\"query\": \"{{ INPUT.query }}\", \"results\": [{% for item in DOC.items %}\"{{ item }}\"{% if not loop.last %}, {% endif %}{% endfor %}]}","{\"status\": \"{{ DOC.status }}\"{% if DOC.metadata %}, \"meta\": {{ DOC.metadata | tojson }}{% endif %}}","{\"workflow\": \"{{ DOC.workflow_name }}\", \"inputs\": [{\"name\": \"id\", \"value\": \"{{ DOC.id }}\"}]}"],"minLength":1,"title":"Template","type":"string"},"fail_on_error":{"default":false,"description":"OPTIONAL. Whether to fail the entire retrieval pipeline if any document transformation fails. Default: False. False (default): Skip failed documents with warning logged, continue processing remaining documents. Failed documents are tracked in stage metadata for observability and debugging. Use for lenient pipelines where partial results are acceptable (e.g., best-effort reformatting). True: Fail entire retrieval on first transformation error. Use for strict pipelines where all documents must transform successfully (e.g., critical API integrations where incomplete data would cause downstream failures). Failure causes: invalid template syntax, template rendering errors (missing fields), invalid JSON output from template, document missing required fields. Typical values: False for public APIs (resilient), True for internal workflows (data integrity critical).","examples":[false,true],"title":"Fail On Error","type":"boolean"}},"title":"JsonTransformParameters","type":"object"}},{"stage_id":"rag_prepare","description":"Prepare documents for LLM context windows with token management","category":"apply","icon":"file-text","parameter_schema":{"$defs":{"CitationConfig":{"description":"Configuration for citation/source tracking in RAG output.\n\nCitations help users trace information back to source documents\nand are essential for attribution in RAG applications.","properties":{"style":{"default":"numbered","description":"Citation style to use:\n- 'numbered': [1], [2], [3]\n- 'bracketed': [doc_id]\n- 'footnote': Superscript numbers\n- 'none': No citations","enum":["numbered","bracketed","footnote","none"],"title":"Style","type":"string"},"include_title":{"default":true,"description":"Include document title in citation metadata.","title":"Include Title","type":"boolean"},"include_url":{"default":false,"description":"Include source URL in citation metadata if available.","title":"Include Url","type":"boolean"}},"title":"CitationConfig","type":"object"}},"description":"Configuration for RAG context preparation.\n\n**Stage Category**: APPLY\n\n**Transformation**: N documents → 1 context document (single_context mode)\n                   OR N documents → N formatted documents (formatted_list mode)\n\n**Purpose**: Prepare search results for LLM consumption by formatting documents,\nmanaging token budgets, and adding citations. This is a preparation stage that\ndoes NOT call an LLM - it prepares content for downstream LLM stages.\n\n**When to Use**:\n    - Before passing search results to an LLM for RAG\n    - When you need to fit multiple documents into a token budget\n    - When you need citation tracking for source attribution\n    - When you need consistent document formatting\n\n**When NOT to Use**:\n    - When you want the LLM to generate a summary (use summarize stage)\n    - When you don't need token management\n    - For simple pass-through of documents\n\n**Output Modes**:\n    - `single_context`: Combines all documents into one context string\n    - `formatted_list`: Returns individually formatted documents\n\n**Common Pipeline Position**: feature_search → rerank → rag_prepare → (external LLM call)\n\nExamples:\n    Basic context preparation:\n        ```json\n        {\n            \"max_tokens\": 8000,\n            \"output_mode\": \"single_context\"\n        }\n        ```\n\n    Custom document formatting with citations:\n        ```json\n        {\n            \"max_tokens\": 4000,\n            \"document_template\": \"[{{CONTEXT.INDEX}}] {{DOC.metadata.title}}\\n{{DOC.content}}\\n\",\n            \"citation\": {\"style\": \"numbered\", \"include_title\": true}\n        }\n        ```\n\n    Formatted list for custom processing:\n        ```json\n        {\n            \"output_mode\": \"formatted_list\",\n            \"document_template\": \"Source: {{DOC.metadata.source}}\\n{{DOC.content}}\"\n        }\n        ```","examples":[{"description":"Basic RAG context preparation (simplest usage)","max_tokens":8000,"output_mode":"single_context"},{"citation":{"include_title":true,"style":"numbered"},"description":"Custom formatting with numbered citations","document_template":"[{{CONTEXT.INDEX}}] {{DOC.metadata.title}}\n{{DOC.content}}\n\n","max_tokens":4000,"truncation_strategy":"priority_truncate"},{"description":"Formatted list for custom downstream processing","document_template":"Source: {{DOC.metadata.source}}\n{{DOC.content}}","max_tokens":16000,"output_mode":"formatted_list"},{"description":"High token budget for long context models","max_tokens":32000,"tokenizer":"cl100k_base","truncation_strategy":"proportional"}],"properties":{"max_tokens":{"default":8000,"description":"OPTIONAL. Maximum tokens for the combined context output. Documents exceeding this limit are handled by truncation_strategy. Default: 8000 (safe for most models).","examples":[4000,8000,16000,32000],"maximum":128000,"minimum":100,"title":"Max Tokens","type":"integer"},"tokenizer":{"default":"cl100k_base","description":"OPTIONAL. Tokenizer to use for token counting. Default: 'cl100k_base' (GPT-4/GPT-3.5 tokenizer). Options: 'cl100k_base', 'p50k_base', 'r50k_base', 'gpt2'","examples":["cl100k_base","p50k_base","gpt2"],"title":"Tokenizer","type":"string"},"truncation_strategy":{"default":"priority_truncate","description":"OPTIONAL. How to handle documents exceeding max_tokens:\n- 'priority_truncate': Include docs in score order, truncate last to fit\n- 'proportional': Give each doc proportional token budget based on count\n- 'drop_last': Include complete docs until limit, drop remaining","enum":["priority_truncate","proportional","drop_last"],"examples":["priority_truncate","proportional","drop_last"],"title":"Truncation Strategy","type":"string"},"output_mode":{"default":"single_context","description":"OPTIONAL. Output format:\n- 'single_context': One document with combined 'context' string + 'citations'\n- 'formatted_list': N documents with 'formatted_content' field each","enum":["single_context","formatted_list"],"examples":["single_context","formatted_list"],"title":"Output Mode","type":"string"},"document_template":{"default":"[{{CONTEXT.INDEX}}] {{DOC.content}}\n\n","description":"OPTIONAL. Template for formatting each document. Available placeholders:\n- {{CONTEXT.INDEX}}: 1-based position in result set (1, 2, 3...)\n- {{CONTEXT.CITATION}}: Citation marker based on citation.style\n- {{DOC.*}}: Any document field (e.g., {{DOC.content}}, {{DOC.metadata.title}})","examples":["[{{CONTEXT.INDEX}}] {{DOC.content}}\n\n","{{CONTEXT.CITATION}} {{DOC.metadata.title}}\n{{DOC.content}}\n---\n","Document {{CONTEXT.INDEX}}:\nTitle: {{DOC.metadata.title}}\n{{DOC.content}}\n\n"],"title":"Document Template","type":"string"},"content_field":{"default":"content","description":"Primary field to extract content from each document.","examples":["content","text","body","metadata.description"],"title":"Content Field","type":"string"},"separator":{"default":"\n","description":"Separator between documents in single_context mode.","title":"Separator","type":"string"},"citation":{"$ref":"#/$defs/CitationConfig","description":"Citation configuration for source tracking."},"context_field":{"default":"rag_context","description":"Field name for combined context (single_context mode).","title":"Context Field","type":"string"},"citations_field":{"default":"citations","description":"Field name for citation metadata.","title":"Citations Field","type":"string"},"formatted_content_field":{"default":"formatted_content","description":"Field name for formatted content (formatted_list mode).","title":"Formatted Content Field","type":"string"}},"title":"RAGPrepareStageConfig","type":"object"}},{"stage_id":"external_reverse_image","description":"Reverse-image search the open web via SerpAPI Google Lens","category":"apply","icon":"image","parameter_schema":{"description":"Reverse-image search the OPEN WEB via an external provider (SerpAPI\nGoogle Lens), returning external visual matches alongside internal results.\n\n**Stage Category**: APPLY (adds external web matches to the working set)\n\n**Transformation**: N documents → N + M documents (M external web matches\nappended, each flagged ``metadata.source = \"web\"`` so they are always\ndistinguishable from internal MVS hits)\n\n**Purpose**: Mixpeek has no open-web image index, so \"does this image appear\nanywhere on the public web / stock libraries?\" cannot be answered by an\ninternal ``feature_search``. This stage takes the QUERY image URL, runs a\nreverse-image lookup against an external provider, and appends the matches\nto the pipeline results. The canonical use is an originality / plagiarism\ncheck: run ``feature_search`` (internal archive + competitor SigLIP\nmatches) then this stage (external web matches) so one execute returns both.\n\n**When to Use**:\n    - Plagiarism / originality: has this design leaked to the open web?\n    - Stock-library / marketplace reuse detection\n    - Any retriever that needs external reverse-image matches next to\n      internal similarity hits\n\n**When NOT to Use**:\n    - Reverse-image against your OWN collections (use ``feature_search``\n      with ``input_mode: content``)\n    - Text web search (use the ``external_web_search`` / Exa stage)\n\n**Operational Behavior**:\n    - PRESERVES all incoming documents (internal matches) and APPENDS the\n      external web matches — never replaces the internal results.\n    - Makes ONE external HTTP call for the query image (not per document).\n    - A no-results lookup is a SUCCESS with zero external matches.\n    - Recommended to wire with ``on_error: \"skip\"`` so a provider outage\n      degrades to internal-matches-only (the optional-stage circuit breaker\n      then also stops calling a down provider) rather than failing the query.\n\n**Output Schema**: appends ``DocumentResult`` items with:\n    - ``metadata.source``: ``\"web\"`` (the internal/external discriminator)\n    - ``metadata.provider``: ``\"serpapi\"`` · ``metadata.engine``: ``\"google_lens\"``\n    - ``metadata.title`` / ``link`` / ``source_name`` / ``source_icon`` /\n      ``thumbnail`` / ``rank``\n  plus stage ``metadata``: ``web_matches`` (count), ``query_image``, ``engine``.\n\nRequirements:\n    - image_url: REQUIRED (templated), the query image URL to reverse-search\n    - provider: OPTIONAL, only ``\"serpapi\"`` today (default)\n    - engine: OPTIONAL, only ``\"google_lens\"`` today (default)\n    - top_k: OPTIONAL, max external matches (default 20, max 100)","examples":[{"description":"Reverse-image the query image on the open web","engine":"google_lens","image_url":"{{INPUT.image_url}}","provider":"serpapi","top_k":20}],"properties":{"image_url":{"default":"{{INPUT.image_url}}","description":"Template for the query image URL to reverse-search on the open web. Supports template variables: {{INPUT.field}} for query inputs (e.g. {{INPUT.image_url}}), {{DOC.field}} for document fields. Must resolve to a publicly-fetchable http(s) URL the provider can load (a private/localhost URL will yield no matches).","examples":["{{INPUT.image_url}}","{{INPUT.query}}"],"title":"Image Url","type":"string"},"provider":{"const":"serpapi","default":"serpapi","description":"External reverse-image provider. Only 'serpapi' today.","title":"Provider","type":"string"},"engine":{"const":"google_lens","default":"google_lens","description":"Provider engine. Only 'google_lens' today.","title":"Engine","type":"string"},"top_k":{"default":20,"description":"OPTIONAL. Maximum number of external web matches to append. Between 1 and 100 (default 20).","examples":[10,20,50],"maximum":100,"minimum":1,"title":"Top K","type":"integer"}},"title":"ReverseImageSearchConfig","type":"object"}},{"stage_id":"sql_lookup","description":"Enrich documents with SQL query results from external databases","category":"apply","icon":"database","parameter_schema":{"$defs":{"DynamicValue":{"description":"A value that should be dynamically resolved from the query request.","properties":{"type":{"const":"dynamic","default":"dynamic","title":"Type","type":"string"},"field":{"description":"The dot-notation path to the value in the runtime query request, e.g., 'inputs.user_id'","examples":["inputs.query_text","filters.AND[0].value"],"title":"Field","type":"string"}},"required":["field"],"title":"DynamicValue","type":"object"},"ErrorHandling":{"description":"Error handling strategy for SQL lookup failures.\n\nValues:\n    SKIP: Skip document enrichment, keep document unchanged\n        - Document remains in pipeline without enrichment\n        - Best for optional lookups\n\n    REMOVE: Remove document from results\n        - Document filtered out of pipeline\n        - Best when enrichment is mandatory\n\n    RAISE: Raise exception and fail pipeline\n        - Stops execution immediately\n        - Best for debugging or critical failures","enum":["skip","remove","raise"],"title":"ErrorHandling","type":"string"},"FilterCondition":{"description":"Represents a single filter condition.\n\nAttributes:\n    field: The field to filter on\n    operator: The comparison operator\n    value: The value to compare against","properties":{"field":{"description":"Field name to filter on","title":"Field","type":"string"},"operator":{"$ref":"#/$defs/FilterOperator","default":"eq","description":"Comparison operator"},"value":{"anyOf":[{"$ref":"#/$defs/DynamicValue"},{}],"description":"Value to compare against","title":"Value"}},"required":["field","value"],"title":"FilterCondition","type":"object"},"FilterOperator":{"description":"Supported filter operators across database implementations.","enum":["eq","ne","gt","lt","gte","lte","in","nin","contains","starts_with","ends_with","regex","exists","is_null","text","phrase","geo_radius","geo_bounding_box","geo_polygon"],"title":"FilterOperator","type":"string"},"LogicalOperator":{"additionalProperties":true,"description":"Represents a logical operation (AND, OR, NOT) on filter conditions.\n\nAllows nesting with a defined depth limit.\n\nAlso supports shorthand syntax where field names can be passed directly\nas key-value pairs for equality filtering (e.g., {\"metadata.title\": \"value\"}).","properties":{"AND":{"anyOf":[{"items":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"$ref":"#/$defs/FilterCondition"}]},"type":"array"},{"type":"null"}],"default":null,"description":"Logical AND operation - all conditions must be true","example":[{"field":"name","operator":"eq","value":"John"},{"field":"age","operator":"gte","value":30}],"title":"And"},"OR":{"anyOf":[{"items":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"$ref":"#/$defs/FilterCondition"}]},"type":"array"},{"type":"null"}],"default":null,"description":"Logical OR operation - at least one condition must be true","example":[{"field":"status","operator":"eq","value":"active"},{"field":"role","operator":"eq","value":"admin"}],"title":"Or"},"NOT":{"anyOf":[{"items":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"$ref":"#/$defs/FilterCondition"}]},"type":"array"},{"type":"null"}],"default":null,"description":"Logical NOT operation - all conditions must be false","example":[{"field":"department","operator":"eq","value":"HR"},{"field":"location","operator":"eq","value":"remote"}],"title":"Not"},"case_sensitive":{"anyOf":[{"type":"boolean"},{"type":"null"}],"default":false,"description":"Whether to perform case-sensitive matching","example":true,"title":"Case Sensitive"}},"title":"LogicalOperator","type":"object"},"OnNoResults":{"description":"Behavior when SQL query returns no rows.\n\nValues:\n    SKIP: Keep document unchanged, do not set output_field\n        - Document passes through without enrichment\n        - Best for optional lookups\n\n    NULL: Set output_field to null\n        - Explicitly marks no result found\n        - Useful for downstream conditional logic\n\n    ERROR: Raise error and fail enrichment\n        - Strict mode for mandatory data\n        - Use with on_error to control pipeline behavior","enum":["skip","null","error"],"title":"OnNoResults","type":"string"},"ResultHandling":{"description":"How to handle multiple result rows from SQL query.\n\nValues:\n    FIRST: Return only the first row (most common for lookups)\n        - Use when expecting single result (lookup by primary key)\n        - Result stored as object\n\n    ALL: Return all rows as array\n        - Use when query may return multiple matches\n        - Result stored as array of objects\n\n    ERROR_IF_EMPTY: Raise error if no rows returned\n        - Use when result is mandatory\n        - Fails enrichment if query returns nothing\n\n    ERROR_IF_MULTIPLE: Raise error if more than one row returned\n        - Use for strict single-result lookups\n        - Validates uniqueness constraint at runtime","enum":["first","all","error_if_empty","error_if_multiple"],"title":"ResultHandling","type":"string"}},"description":"Configuration for SQL Lookup enrichment stage.\n\n**Stage Category**: APPLY (1-1 Enrichment or 0-M Document Creation)\n\n**Transformation**:\n- With documents: N documents -> N documents (expanded schema)\n- Without documents: 0 documents -> M documents (from SQL results)\n\n**Purpose**: Enriches documents by running SQL queries against organization\nconnections (PostgreSQL, Snowflake). Enables joining structured database data\nwith document pipelines.\n\n**Security**:\n- Only SELECT queries allowed (no INSERT/UPDATE/DELETE/DROP)\n- Uses parameterized queries to prevent SQL injection\n- Connection credentials managed via organization connections\n- Query timeout enforced\n\n**Supported Providers**:\n- PostgreSQL: Full SQL support with $1, $2 placeholders\n- Snowflake: Full SQL support (coming soon)\n\n**When to Use**:\n- Lookup customer data from database by ID\n- Enrich products with inventory information\n- Join document metadata with relational data\n- Create documents from SQL query results\n\n**When NOT to Use**:\n- Complex analytics queries (use dedicated analytics tools)\n- High-volume batch operations (use ETL pipelines)\n- Real-time streaming (use event systems)\n\nRequirements:\n    - connection_id: REQUIRED, organization connection ID or name\n    - query: REQUIRED, SQL SELECT query\n    - parameters: OPTIONAL, named parameters for query\n    - output_field: OPTIONAL, where to store results\n    - result_handling: OPTIONAL, how to handle multiple rows\n    - timeout: OPTIONAL, query timeout (default: 30s)\n    - when: OPTIONAL, conditional enrichment filter\n    - on_error: OPTIONAL, error handling strategy","examples":[{"connection_id":"my_postgres","description":"Simple product lookup by SKU","output_field":"metadata.product_info","parameters":{"sku":"{{DOC.metadata.sku}}"},"query":"SELECT name, price, stock_quantity FROM products WHERE sku = $1","result_handling":"first"},{"connection_id":"conn_abc123","description":"Lookup customer orders (multiple results)","output_field":"metadata.recent_orders","parameters":{"customer_id":"{{DOC.customer_id}}"},"query":"SELECT order_id, total, status FROM orders WHERE customer_id = $1 ORDER BY created_at DESC LIMIT 10","result_handling":"all"},{"connection_id":"warehouse","description":"Create documents from SQL query (0->M mode)","parameters":{"region":"{{INPUT.target_region}}"},"query":"SELECT id, name, email, tier FROM customers WHERE region = $1","result_handling":"all"},{"connection_id":"my_postgres","description":"Conditional lookup with timeout","on_error":"skip","output_field":"metadata.preferences","parameters":{"user_id":"{{DOC.user_id}}"},"query":"SELECT * FROM user_preferences WHERE user_id = $1","timeout":10,"when":{"field":"metadata.needs_preferences","operator":"eq","value":true}}],"properties":{"connection_id":{"description":"REQUIRED. Organization connection ID (conn_...) or name. Connection must be SQL-capable (PostgreSQL or Snowflake). Use POST /v1/organizations/connections to create connections. Example: 'conn_abc123' or 'my_postgres_db'","examples":["conn_abc123","my_postgres_db","production_warehouse"],"title":"Connection Id","type":"string"},"query":{"description":"REQUIRED. SQL SELECT query to execute. Only SELECT queries allowed - mutations are blocked. Use $1, $2, etc. for PostgreSQL parameter placeholders. Supports template variables in 'parameters' field values. Example: 'SELECT * FROM products WHERE sku = $1'","examples":["SELECT * FROM products WHERE sku = $1","SELECT name, price, stock FROM inventory WHERE product_id = $1","SELECT * FROM customers WHERE region = $1 AND status = $2"],"title":"Query","type":"string"},"parameters":{"additionalProperties":true,"description":"OPTIONAL. Named parameters for SQL query. Keys are parameter names, values support template syntax. Parameters are converted to positional args ($1, $2) in order. Template variables: {{DOC.field}}, {{INPUT.field}}, {{SECRET.name}}. Example: {'sku': '{{DOC.metadata.sku}}'}","examples":[{"sku":"{{DOC.metadata.sku}}"},{"customer_id":"{{DOC.customer_id}}","status":"active"},{"region":"{{INPUT.target_region}}"}],"title":"Parameters","type":"object"},"output_field":{"default":"sql_result","description":"OPTIONAL. Dot-path where SQL results should be stored. Creates nested structure if needed. For FIRST mode: single object. For ALL mode: array of objects. Default: 'sql_result'. Note: Avoid using 'metadata.*' paths as the metadata field is excluded from API responses.","examples":["sql_result","product_info","enrichment.db_data"],"title":"Output Field","type":"string"},"result_handling":{"$ref":"#/$defs/ResultHandling","default":"first","description":"OPTIONAL. How to handle multiple result rows. 'first': Return only first row (default). 'all': Return all rows as array. 'error_if_empty': Error if no results. 'error_if_multiple': Error if more than one result. "},"on_no_results":{"$ref":"#/$defs/OnNoResults","default":"null","description":"OPTIONAL. Behavior when query returns no rows. 'skip': Keep document unchanged. 'null': Set output_field to null (default). 'error': Raise error."},"timeout":{"default":30,"description":"OPTIONAL. Query timeout in seconds. Range: 1-300. Default: 30. Increase for complex queries.","maximum":300,"minimum":1,"title":"Timeout","type":"integer"},"when":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"type":"null"}],"default":null,"description":"OPTIONAL. Conditional filter for selective enrichment. Only documents matching condition will execute SQL lookup. RECOMMENDED for performance optimization."},"on_error":{"$ref":"#/$defs/ErrorHandling","default":"skip","description":"OPTIONAL. Error handling strategy. 'skip': Pass document through unchanged (default). 'remove': Remove failed documents. 'raise': Halt pipeline on error."}},"required":["connection_id","query"],"title":"SQLLookupConfig","type":"object"}},{"stage_id":"traverse_edge","description":"Follow object edges (typed relationships) from each result to the linked documents — e.g. from an ad, traverse `uses_footage` edges to its raw footage, carrying edge attributes like clip_order and start_ticks.","category":"apply","icon":"git-branch","parameter_schema":{"additionalProperties":true,"description":"Parameters for the `traverse_edge` stage.\n\nFrom each document in the working set, follow its root-level ``edges`` of the\nconfigured type(s)/direction and return the LINKED documents as the new\nworking set (e.g. from an ad, follow `uses_footage` edges to its raw footage).","properties":{"edge_type":{"anyOf":[{"type":"string"},{"items":{"type":"string"},"type":"array"}],"description":"Edge type or types to follow (the customer's `edges[].type` vocabulary), e.g. 'used_in_ad' or ['uses_footage', 'b_roll_of'].","title":"Edge Type"},"direction":{"default":"any","description":"Which edge directions to follow: 'out', 'in', or 'any'. Reciprocal edges are stored on both endpoints, so 'any' (default) is usually correct; use 'out'/'in' to disambiguate same-type reciprocal edges.","title":"Direction","type":"string"},"target_collection_id":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Optional: only return traversed documents from this collection (else all documents derived from the linked objects). Overridden by an edge's own `target_collection_id` when that is set.","title":"Target Collection Id"},"include_source":{"default":false,"description":"If true, keep the originating (source) documents in the output alongside the traversed documents; else return only the linked docs.","title":"Include Source","type":"boolean"},"max_per_source":{"default":50,"description":"Max edges to follow per source document (guards fan-out).","minimum":1,"title":"Max Per Source","type":"integer"},"limit":{"default":200,"description":"Max total traversed documents to fetch across all sources.","minimum":1,"title":"Limit","type":"integer"}},"required":["edge_type"],"title":"TraverseEdgeConfig","type":"object"}},{"stage_id":"unwind","description":"Decompose array fields into separate documents (1→N expansion)","category":"apply","icon":"unfold-vertical","parameter_schema":{"description":"Configuration for the unwind (array decomposition) stage.\n\n**Stage Category**: APPLY (1-N transformation)\n\n**Transformation**: N documents → M documents (where M ≥ N, one per array element)\n\n**Purpose**: Decomposes an array field in each document into separate documents,\none per array element. The original document's non-array fields are preserved\nin each resulting document. Similar to MongoDB's $unwind, Snowflake's\nLATERAL FLATTEN, and Spark's explode().\n\n**When to Use**:\n    - Decompose multi-value fields (tags, categories, authors) into individual docs\n    - Flatten nested arrays (chapters, segments, frames) for per-element retrieval\n    - Expand grouped results back into individual items\n    - Prepare array data for per-element scoring, filtering, or enrichment\n\n**When NOT to Use**:\n    - For filtering documents (use FILTER stages)\n    - For sorting documents (use SORT stages)\n    - For restructuring document schema without expansion (use json_transform)\n    - When array fields don't exist (documents will be dropped or preserved based on config)\n\n**Operational Behavior**:\n    - Operates on in-memory document results (no database queries)\n    - Each input document with an array of length K produces K output documents\n    - Documents with null/empty arrays are handled by preserve_null_and_empty flag\n    - Original document fields are preserved alongside the unwound element\n    - Fast operation (simple in-memory expansion)\n\n**Common Pipeline Position**: FILTER → APPLY (this stage) → SORT/FILTER\n\nRequirements:\n    - field: REQUIRED, dot-notation path to the array field to unwind\n    - preserve_null_and_empty: OPTIONAL, keep docs with null/empty arrays\n    - include_array_index: OPTIONAL, add index field to output\n    - output_field: OPTIONAL, rename the unwound value field\n\nUse Cases:\n    - Tag expansion: Unwind tags array for per-tag frequency analysis\n    - Segment decomposition: Unwind video segments for individual scoring\n    - Author expansion: Unwind author list for per-author attribution\n    - Chunk flattening: Unwind text chunks for granular retrieval","examples":[{"description":"Unwind tags array into per-tag documents","field":"metadata.tags","preserve_null_and_empty":false},{"description":"Unwind segments with index tracking","field":"content.segments","include_array_index":"segment_index","preserve_null_and_empty":true},{"description":"Unwind authors into separate field","field":"metadata.authors","output_field":"current_author","preserve_null_and_empty":false},{"description":"Unwind chunks for granular processing","field":"chunks","include_array_index":"chunk_position","preserve_null_and_empty":false}],"properties":{"field":{"description":"REQUIRED. Dot-notation path to the array field to unwind. Each element of this array becomes a separate document. Supports nested paths (e.g., 'metadata.tags', 'content.segments'). If the field is not an array, the document is passed through unchanged. If the field doesn't exist, behavior depends on preserve_null_and_empty.","examples":["metadata.tags","content.segments","metadata.authors","chunks","metadata.categories"],"title":"Field","type":"string"},"preserve_null_and_empty":{"default":false,"description":"OPTIONAL. Whether to preserve documents where the array field is null, missing, or empty. False (default): Documents with null/missing/empty arrays are dropped. True: Documents with null/missing/empty arrays are kept with the field set to null. Similar to MongoDB's preserveNullAndEmptyArrays option.","examples":[false,true],"title":"Preserve Null And Empty","type":"boolean"},"include_array_index":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"OPTIONAL. If provided, adds a field with this name containing the array index (0-based) of the unwound element. Useful for maintaining order information after unwinding. Example: 'array_index' adds a field showing position in original array.","examples":["array_index","position","segment_index",null],"title":"Include Array Index"},"output_field":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"OPTIONAL. If provided, the unwound element is placed in this field instead of replacing the original array field. Useful when you want to keep the original field path and add the unwound value under a different name. If None, the unwound element replaces the array field at the same path.","examples":["unwound_value","current_tag","segment",null],"title":"Output Field"}},"required":["field"],"title":"UnwindStageConfig","type":"object"}},{"stage_id":"external_web_search","description":"Enrich documents with web search results using Exa AI","category":"apply","icon":"globe","parameter_schema":{"description":"Configuration for web search enrichment using Exa AI-native search.\n\n**Stage Category**: APPLY (Document Enrichment - enriches existing documents)\n\n**Transformation**: N documents → N documents (each enriched with web search results)\n\n**Purpose**: Enriches each document in the pipeline by adding AI-native web search\nresults from Exa's neural ranking system to the document's metadata. The query can\nbe templated per-document using {{DOC.*}}, {{INPUT.*}}, etc., enabling dynamic\nsearches based on document content. Results are automatically cached by query to\nminimize redundant API calls.\n\n**When to Use**:\n    - Enrich documents with real-time web context\n    - Add related articles/research to each document\n    - Augment internal data with external web sources\n    - Find competitive intelligence per product/company in documents\n    - Add news/updates related to document entities\n    - Research and discovery based on document content\n    - Context augmentation for RAG pipelines\n\n**When NOT to Use**:\n    - Searching within your own collections (use feature_search instead)\n    - Need full web page content (use web_scrape stage for that)\n    - Want to create NEW documents from web search (this enriches existing ones)\n    - No existing documents to enrich (pipeline must have documents first)\n\n**Operational Behavior**:\n    - ENRICHES existing documents (N → N operation, preserves all docs)\n    - Each document gets web search results added to metadata.web_search\n    - Query is templated per-document with {{DOC.*}} support\n    - Smart caching: identical queries share results (only 1 API call)\n    - Preserves all original document data (ID, collection, score, etc.)\n    - Makes external HTTP request to Exa API (cached per unique query)\n    - Fast operation: 100-500ms per unique query (not per document)\n\n**Common Pipeline Position**:\n    - feature_search → web_search (enrich search results with web context)\n    - feature_search → web_search → llm_filter (search, enrich, then filter)\n    - feature_search → web_search → web_scrape (enrich with URLs, then scrape)\n\n**Cost & Performance**:\n    - Moderate Cost: Exa API charges per unique query (caching reduces costs)\n    - Fast: 100-500ms per unique query, cached queries are instant\n    - Network dependent: requires external API call\n    - Static queries: 1 API call for all documents (highly efficient)\n    - Dynamic queries: 1 API call per unique rendered query\n\n**Output Schema**: Adds to each DocumentResult:\n    - metadata.web_search.query: Rendered query used for this document\n    - metadata.web_search.results: Array of web search results\n    - metadata.web_search.results[].url: Web page URL\n    - metadata.web_search.results[].title: Page title\n    - metadata.web_search.results[].text: Text snippet (if include_text=True)\n    - metadata.web_search.results[].published_date: Publication date\n    - metadata.web_search.results[].author: Author name\n    - metadata.web_search.results[].score: Exa relevance score\n    - metadata.web_search.results[].position: Result position (0-indexed)\n    - metadata.web_search.num_results: Count of results\n    - metadata.web_search.autoprompt_used: Whether autoprompt was enabled\n\nRequirements:\n    - query: REQUIRED, search query text (supports templates like {INPUT.query})\n    - num_results: OPTIONAL, number of results (default 10, max 100)\n    - use_autoprompt: OPTIONAL, use Exa's query enhancement (default True)\n    - start_published_date: OPTIONAL, filter by publication date\n    - category: OPTIONAL, filter by content type\n    - include_text: OPTIONAL, include text snippets (default True)\n\nUse Cases:\n    - RAG enhancement: Enrich documents with current web context before LLM\n    - Product research: Add competitor info to each product document\n    - News enrichment: Add latest news to company/entity documents\n    - Academic research: Add related papers to each research document\n    - Documentation augmentation: Add official docs/guides to each result\n    - Competitive intelligence: Enrich results with competitor mentions\n    - Fact verification: Add source citations from web to each claim\n\nExamples:\n    Static query enrichment (all documents get same web results):\n        ```json\n        {\n            \"query\": \"latest AI developments 2024\",\n            \"num_results\": 10,\n            \"include_text\": true\n        }\n        ```\n        Result: 1 API call total, all documents enriched with same 10 web results\n\n    Dynamic per-document enrichment (query varies by document):\n        ```json\n        {\n            \"query\": \"{{DOC.metadata.product_name}} reviews and comparisons\",\n            \"num_results\": 5,\n            \"include_text\": true\n        }\n        ```\n        Result: 1 API call per unique product name (automatically cached)\n\n    Hybrid query (combines input + document fields):\n        ```json\n        {\n            \"query\": \"{{INPUT.topic}} {{DOC.metadata.category}}\",\n            \"num_results\": 3,\n            \"start_published_date\": \"2024-01-01\"\n        }\n        ```\n        Result: Caching optimizes for documents with same topic+category combo\n\n    News enrichment with date filter:\n        ```json\n        {\n            \"query\": \"{{DOC.metadata.company_name}} latest news\",\n            \"num_results\": 5,\n            \"category\": \"news\",\n            \"start_published_date\": \"2024-11-01\"\n        }\n        ```\n        Result: Recent news added to each company document's metadata","examples":[{"description":"Basic web search with template variable","include_text":true,"num_results":10,"query":"{{INPUT.query}}","use_autoprompt":true},{"description":"News search with date filter","include_text":true,"num_results":5,"query":"AI developments","start_published_date":"2024-11-01","use_autoprompt":true},{"category":"research paper","description":"Research paper search","include_text":true,"num_results":20,"query":"neural network architectures","use_autoprompt":true},{"category":"github","description":"GitHub repository search","include_text":false,"num_results":15,"query":"python web scraping libraries","use_autoprompt":false},{"category":"news","description":"Company research with minimal results","include_text":true,"num_results":3,"query":"{{INPUT.company_name}} latest product launches","start_published_date":"2024-10-01"}],"properties":{"query":{"default":"{{INPUT.query}}","description":"Search query text for Exa AI search. Supports template variables: {{INPUT.field}} for query inputs, {{DOC.field}} for document fields in enrichment context. Exa uses neural ranking for semantic search, so natural language queries work well. Examples: 'machine learning tutorials', 'latest AI developments', '{{INPUT.user_query}}', 'news about {{DOC.metadata.company_name}}'","examples":["machine learning tutorials","latest developments in AI","{{INPUT.query}}","news about {{DOC.metadata.company}}"],"title":"Query","type":"string"},"num_results":{"default":10,"description":"OPTIONAL. Number of search results to return. Must be between 1 and 100. Default is 10. More results = higher API costs. Consider using lower values for faster responses and cost control.","examples":[5,10,20],"maximum":100,"minimum":1,"title":"Num Results","type":"integer"},"use_autoprompt":{"default":true,"description":"OPTIONAL. Enable Exa's autoprompt feature for query enhancement. When True, Exa optimizes the query for better search results. Default is True. Recommended for most use cases. Disable if you want exact query matching without enhancement.","title":"Use Autoprompt","type":"boolean"},"start_published_date":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"OPTIONAL. Filter results to content published after this date. Format: YYYY-MM-DD (e.g., '2024-01-01'). When NOT specified, returns results from all dates. Useful for finding recent content, news, or time-sensitive information.","examples":["2024-01-01","2024-11-01","2023-06-15"],"title":"Start Published Date"},"category":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"OPTIONAL. Filter results by content category. When NOT specified, searches across all categories. Common categories: 'research paper', 'news', 'github', 'tweet', 'company', 'pdf', 'personal site', 'blog'. Case-insensitive. Use for focused domain search.","examples":["research paper","news","github","tweet","blog","company"],"title":"Category"},"include_text":{"default":true,"description":"OPTIONAL. Include text snippets in search results. When True, each result includes a text preview (~200 words). Default is True. Disable to reduce API costs and response size. Text snippets are stored in metadata.text field of DocumentResult.","title":"Include Text","type":"boolean"}},"title":"WebSearchConfig","type":"object"}},{"stage_id":"agentic_enrich","description":"Enrich documents using a multi-turn reasoning agent with tool access for taxonomy lookup, example search, and content analysis","category":"enrich","icon":"brain-circuit","parameter_schema":{"$defs":{"AnalysisProviderConfig":{"description":"Configuration for the secondary LLM used for content analysis (e.g., Gemini for video).","properties":{"provider":{"$ref":"#/$defs/LLMProvider","default":"google","description":"LLM provider for content analysis. Defaults to Google (Gemini) for multimodal perception."},"model_name":{"default":"gemini-2.5-flash-lite","description":"Model name for content analysis.","title":"Model Name","type":"string"},"api_key":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"BYOK API key for the analysis provider. Supports {{secrets.*}} template syntax.","title":"Api Key"}},"title":"AnalysisProviderConfig","type":"object"},"DynamicValue":{"description":"A value that should be dynamically resolved from the query request.","properties":{"type":{"const":"dynamic","default":"dynamic","title":"Type","type":"string"},"field":{"description":"The dot-notation path to the value in the runtime query request, e.g., 'inputs.user_id'","examples":["inputs.query_text","filters.AND[0].value"],"title":"Field","type":"string"}},"required":["field"],"title":"DynamicValue","type":"object"},"EscalationConfig":{"description":"Configuration for pro model escalation on borderline results.\n\nWhen self-consistency voting produces low agreement, re-evaluates with\na more capable model for a definitive result.","properties":{"provider":{"anyOf":[{"$ref":"#/$defs/LLMProvider"},{"type":"null"}],"default":null,"description":"LLM provider for escalation model. Defaults to same as main provider."},"model":{"default":"gemini-2.5-pro","description":"Model name for escalation. Typically a larger/more capable model.","title":"Model","type":"string"},"api_key":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"BYOK API key for escalation model. Supports {{secrets.*}} template syntax.","title":"Api Key"},"trigger_agreement_below":{"default":0.7,"description":"Escalate when average agreement across criteria falls below this threshold.","maximum":1.0,"minimum":0.0,"title":"Trigger Agreement Below","type":"number"},"trigger_disagreements_above":{"default":3,"description":"Escalate when more than this many fields have disagreement.","minimum":1,"title":"Trigger Disagreements Above","type":"integer"}},"title":"EscalationConfig","type":"object"},"FilterCondition":{"description":"Represents a single filter condition.\n\nAttributes:\n    field: The field to filter on\n    operator: The comparison operator\n    value: The value to compare against","properties":{"field":{"description":"Field name to filter on","title":"Field","type":"string"},"operator":{"$ref":"#/$defs/FilterOperator","default":"eq","description":"Comparison operator"},"value":{"anyOf":[{"$ref":"#/$defs/DynamicValue"},{}],"description":"Value to compare against","title":"Value"}},"required":["field","value"],"title":"FilterCondition","type":"object"},"FilterOperator":{"description":"Supported filter operators across database implementations.","enum":["eq","ne","gt","lt","gte","lte","in","nin","contains","starts_with","ends_with","regex","exists","is_null","text","phrase","geo_radius","geo_bounding_box","geo_polygon"],"title":"FilterOperator","type":"string"},"LLMProvider":{"description":"Supported LLM providers.","enum":["openai","google","anthropic"],"title":"LLMProvider","type":"string"},"LogicalOperator":{"additionalProperties":true,"description":"Represents a logical operation (AND, OR, NOT) on filter conditions.\n\nAllows nesting with a defined depth limit.\n\nAlso supports shorthand syntax where field names can be passed directly\nas key-value pairs for equality filtering (e.g., {\"metadata.title\": \"value\"}).","properties":{"AND":{"anyOf":[{"items":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"$ref":"#/$defs/FilterCondition"}]},"type":"array"},{"type":"null"}],"default":null,"description":"Logical AND operation - all conditions must be true","example":[{"field":"name","operator":"eq","value":"John"},{"field":"age","operator":"gte","value":30}],"title":"And"},"OR":{"anyOf":[{"items":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"$ref":"#/$defs/FilterCondition"}]},"type":"array"},{"type":"null"}],"default":null,"description":"Logical OR operation - at least one condition must be true","example":[{"field":"status","operator":"eq","value":"active"},{"field":"role","operator":"eq","value":"admin"}],"title":"Or"},"NOT":{"anyOf":[{"items":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"$ref":"#/$defs/FilterCondition"}]},"type":"array"},{"type":"null"}],"default":null,"description":"Logical NOT operation - all conditions must be false","example":[{"field":"department","operator":"eq","value":"HR"},{"field":"location","operator":"eq","value":"remote"}],"title":"Not"},"case_sensitive":{"anyOf":[{"type":"boolean"},{"type":"null"}],"default":false,"description":"Whether to perform case-sensitive matching","example":true,"title":"Case Sensitive"}},"title":"LogicalOperator","type":"object"},"SelfConsistencyConfig":{"description":"Configuration for self-consistency voting (N parallel LLM evaluations).\n\nRuns N independent LLM calls with varying temperatures and merges results\nvia majority voting per output field. Improves reliability by eliminating\nstochastic single-call failures (+10-15% accuracy).","properties":{"n":{"default":3,"description":"Number of parallel evaluations per document.","maximum":5,"minimum":2,"title":"N","type":"integer"},"temperatures":{"default":[0.0,0.3,0.3],"description":"Temperature for each evaluation. First call is deterministic (0.0), subsequent calls use slight variation for diversity. Length must match n.","items":{"type":"number"},"title":"Temperatures","type":"array"},"agreement_threshold":{"default":0.7,"description":"Minimum average agreement across all output fields. Below this threshold, triggers escalation (if configured).","maximum":1.0,"minimum":0.0,"title":"Agreement Threshold","type":"number"},"disagreement_max":{"default":3,"description":"Maximum number of fields with disagreement before triggering escalation. A disagreement is any field where not all N evaluators agree.","minimum":1,"title":"Disagreement Max","type":"integer"}},"title":"SelfConsistencyConfig","type":"object"}},"description":"Configuration for multi-turn agentic document enrichment.\n\nUses a reasoning agent (default: Claude) that can call tools to research\ntaxonomy definitions, query classified examples, and delegate perceptual\nanalysis to a secondary LLM (default: Gemini) before producing a structured\nclassification.\n\nTransformation Pattern (ENRICH category):\n- Input: N documents\n- Output: N documents (1-to-1 enrichment with agent-produced fields)","properties":{"provider":{"anyOf":[{"$ref":"#/$defs/LLMProvider"},{"type":"null"}],"default":null,"description":"LLM provider for the reasoning agent. Defaults to Anthropic (Claude)."},"model_name":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Model name for the reasoning agent. Defaults to claude-sonnet-4-5-20250929.","title":"Model Name"},"api_key":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"BYOK API key for the reasoning agent. Supports {{secrets.*}} template syntax.","title":"Api Key"},"system_prompt":{"description":"System prompt for the reasoning agent. Supports {{INPUT.*}}, {{DOC.*}}, {{CONTEXT.*}} template variables.","title":"System Prompt","type":"string"},"output_schema":{"additionalProperties":true,"description":"JSON schema describing the structured output the agent must produce.","title":"Output Schema","type":"object"},"output_field":{"default":"classification","description":"Dot-path where the agent's structured output should be stored on each document.","title":"Output Field","type":"string"},"taxonomy_id":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Taxonomy ID to load via the get_taxonomy_categories tool. Enables the tool when set.","title":"Taxonomy Id"},"example_collection_ids":{"anyOf":[{"items":{"type":"string"},"type":"array"},{"type":"null"}],"default":null,"description":"Collection IDs to search for already-classified examples via the query_examples tool. Enables the tool when set.","title":"Example Collection Ids"},"analysis_provider":{"$ref":"#/$defs/AnalysisProviderConfig","description":"Configuration for the secondary LLM used by the analyze_content tool."},"enabled_tools":{"anyOf":[{"items":{"type":"string"},"type":"array"},{"type":"null"}],"default":null,"description":"Explicit list of tool names to enable. When None, tools are auto-enabled based on config (e.g., get_taxonomy_categories enabled when taxonomy_id is set).","title":"Enabled Tools"},"max_turns":{"default":8,"description":"Maximum agent reasoning turns per document.","maximum":20,"minimum":1,"title":"Max Turns","type":"integer"},"timeout_seconds":{"default":60.0,"description":"Maximum wall-clock seconds for the agent loop per document.","maximum":300.0,"minimum":5.0,"title":"Timeout Seconds","type":"number"},"temperature":{"default":0.0,"description":"Sampling temperature for the reasoning agent.","maximum":1.0,"minimum":0.0,"title":"Temperature","type":"number"},"self_consistency":{"anyOf":[{"$ref":"#/$defs/SelfConsistencyConfig"},{"type":"null"}],"default":null,"description":"Enable self-consistency voting: run N parallel LLM evaluations per document and merge via majority voting. Improves reliability at the cost of N× LLM calls."},"escalation":{"anyOf":[{"$ref":"#/$defs/EscalationConfig"},{"type":"null"}],"default":null,"description":"Enable pro model escalation for borderline results. When self-consistency agreement is low, re-evaluate with a more capable model."},"score_threshold":{"anyOf":[{"maximum":1.0,"minimum":0.0,"type":"number"},{"type":"null"}],"default":null,"description":"Fast rejection threshold. Documents with a similarity score below this value skip VLM evaluation and are auto-tagged as failed. Saves ~80%% of VLM calls for obviously non-compliant results. The score is read from the document's 'score' field (set by prior stages).","title":"Score Threshold"},"batch_size":{"default":1,"description":"Documents to process per batch (kept low due to multi-turn cost).","maximum":5,"minimum":1,"title":"Batch Size","type":"integer"},"when":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"type":"null"}],"default":null,"description":"Conditional filter: only enrich documents matching this condition. Documents not matching are passed through unchanged. CRITICAL FOR COST SAVINGS - agentic enrichment is 3-15 LLM calls per document."}},"required":["system_prompt","output_schema"],"title":"AgenticEnrichmentConfig","type":"object"}},{"stage_id":"classify","description":"Classify documents using a built-in task or a custom model","category":"apply","icon":"tags","parameter_schema":{"description":"Configuration for classifying documents.\n\nTwo modes of operation, mutually exclusive — set exactly one:\n\n1. Built-in task (``task``): use a Mixpeek-managed classifier. Currently\n   only ``\"nsfw\"`` (content-safety over text/image/video via the\n   ``mixpeek/content-classifier-v1`` model). This is the dropdown of\n   available classification tasks shown in Studio.\n2. Custom classifier (``feature_uri``): point at your own classifier\n   plugin. The plugin must accept {text: str} or {document: dict} and\n   return {labels: [{label, confidence}]}.\n\nCommon Pipeline:\n    feature_search (retrieve) → classify (label) → sort_relevance","examples":[{"description":"Classify documents with custom model","document_field":"content","feature_uri":"mixpeek://my_classifier@1.0.0/classify","min_confidence":0.5,"output_field":"classification","top_k_labels":3},{"description":"Flag/annotate NSFW content","document_field":"content","nsfw_threshold":0.7,"task":"nsfw"},{"description":"Drop NSFW results","drop_if_unsafe":true,"task":"nsfw"}],"properties":{"task":{"anyOf":[{"const":"nsfw","type":"string"},{"type":"null"}],"default":null,"description":"Built-in classification task. Available tasks: 'nsfw' — content-safety (text/image/video) via the mixpeek/content-classifier-v1 model. Leave unset to use a custom classifier via feature_uri.","examples":["nsfw"],"title":"Task"},"feature_uri":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Feature URI of a custom classifier plugin. Required unless `task` is set. Format: mixpeek://plugin_name@version/feature_name","examples":["mixpeek://my_classifier@1.0.0/classify"],"title":"Feature Uri"},"document_field":{"default":"content","description":"Document field path containing text to classify","examples":["content","metadata.description"],"title":"Document Field","type":"string"},"image_field":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"task='nsfw' only: doc field path with an image URL to classify instead of text.","examples":["metadata.image_url"],"title":"Image Field"},"video_field":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"task='nsfw' only: doc field path with a video URL to classify instead of text.","examples":["metadata.video_url"],"title":"Video Field"},"output_field":{"default":"classification","description":"Document field path to store classification results","title":"Output Field","type":"string"},"nsfw_threshold":{"default":0.7,"description":"task='nsfw': classification threshold passed to the content-classifier model.","maximum":1,"minimum":0,"title":"Nsfw Threshold","type":"number"},"drop_if_unsafe":{"default":false,"description":"task='nsfw': drop documents flagged is_nsfw instead of just annotating.","title":"Drop If Unsafe","type":"boolean"},"max_document_chars":{"default":5000,"description":"Maximum characters of document text to send for classification","maximum":50000,"minimum":100,"title":"Max Document Chars","type":"integer"},"top_k_labels":{"anyOf":[{"minimum":1,"type":"integer"},{"type":"null"}],"default":null,"description":"Custom classifier only: keep only the top-k labels by confidence. If None, returns all.","title":"Top K Labels"},"min_confidence":{"anyOf":[{"maximum":1.0,"minimum":0.0,"type":"number"},{"type":"null"}],"default":null,"description":"Custom classifier only: minimum confidence threshold. Labels below this are dropped.","title":"Min Confidence"},"batch_size":{"default":10,"description":"Number of documents to classify per inference call","maximum":100,"minimum":1,"title":"Batch Size","type":"integer"},"max_concurrency":{"default":5,"description":"Maximum concurrent inference requests","maximum":20,"minimum":1,"title":"Max Concurrency","type":"integer"}},"title":"ClassifyConfig","type":"object"}},{"stage_id":"code_execution","description":"Execute custom code in secure isolated sandboxes","category":"enrich","icon":"code","parameter_schema":{"$defs":{"ErrorHandling":{"description":"How to handle code execution errors.","enum":["skip","raise"],"title":"ErrorHandling","type":"string"}},"description":"Configuration for executing custom code in secure isolated sandboxes.\n\n**Stage Category**: ENRICH (1-1 Enrichment)\n\n**Transformation**: N documents → N documents (same count, expanded schema)\n\n**Purpose**: Executes user-provided code in isolated sandboxes to compute\ncustom enrichments for each document. The code receives all documents as\na JSON array and must return a list of results matching the input length.\n\n**When to Use**:\n    - Custom transformations not covered by built-in stages\n    - Data extraction (regex, parsing)\n    - Unit conversions and normalization\n    - Cross-document computations (relative scores, rankings)\n    - Prototyping custom enrichment logic\n    - Complex string manipulations\n\n**When NOT to Use**:\n    - Simple field transformations (use json_transform)\n    - LLM-based enrichment (use llm_enrich)\n    - External API calls (use api_call stage)\n    - When deterministic built-in stages suffice\n\n**Operational Behavior**:\n    - Creates ONE sandbox per stage execution (~200ms startup)\n    - Passes ALL documents as JSON array to user code\n    - User code must set `result` to list matching input length\n    - Merges results back into documents at output_field\n    - Supports Python, TypeScript, and JavaScript\n\n**Template Support**:\n    - {{INPUT.*}}: Pipeline input parameters (evaluated before execution)\n    - {{CONTEXT.*}}: Execution context (namespace_id, internal_id)\n    - {{SECRET.*}}: Organization vault secrets (e.g., {{SECRET.api_key}})\n    - Documents are passed as runtime `docs` variable (NOT a template)\n\n**Using Secrets**:\n    Secrets stored in your organization vault can be referenced in code and env:\n    - In env vars: {\"API_KEY\": \"{{SECRET.stripe_api_key}}\"}\n    - In code: \"api_key = '{{SECRET.my_key}}'\"  (less common, prefer env vars)\n    Secrets are automatically loaded from the vault and redacted in error messages.\n\nRequirements:\n    - code: REQUIRED, code to execute (receives `docs`, must set `result`)\n    - output_field: REQUIRED, where to store computed results\n    - language: OPTIONAL, execution language (default: python)\n    - timeout_ms: OPTIONAL, execution timeout (default: 5000ms)\n\nExamples:\n    Word count enrichment:\n        ```json\n        {\n            \"code\": \"result = [{'word_count': len(d.get('content', '').split())} for d in docs]\",\n            \"output_field\": \"text_stats\"\n        }\n        ```\n\n    Cross-document relative scores:\n        ```json\n        {\n            \"code\": \"avg = sum(d.get('score', 0) for d in docs) / len(docs)\\nresult = [{'relative': d.get('score', 0) / avg} for d in docs]\",\n            \"output_field\": \"score_analysis\"\n        }\n        ```","examples":[{"code":"result = [{'word_count': len(d.get('content', '').split())} for d in docs]","description":"Word count enrichment","output_field":"text_stats"},{"code":"import re\nresult = [{'emails': re.findall(r'[\\w.+-]+@[\\w-]+\\.[\\w.-]+', d.get('content', ''))} for d in docs]","description":"Email extraction","output_field":"extracted_emails","timeout_ms":3000},{"code":"avg = sum(d.get('score', 0) for d in docs) / len(docs) if docs else 1\nresult = [{'relative_score': d.get('score', 0) / avg, 'rank': i+1} for i, d in enumerate(docs)]","description":"Cross-document relative scoring","output_field":"score_analysis"},{"code":"const result = docs.map((d: any) => ({ wordCount: (d.content || '').split(/\\s+/).length }));","description":"TypeScript execution","language":"typescript","output_field":"ts_stats"},{"code":"import os\nimport urllib.request\napi_key = os.getenv('API_KEY')\nresult = [{'enriched': True} for d in docs]","description":"External API call using vault secret","env":{"API_KEY":"{{SECRET.external_api_key}}"},"output_field":"api_enrichment"}],"properties":{"code":{"default":"result = [{'word_count': len(d.get('content', '').split())} for d in docs]","description":"Code to execute in the sandbox. Receives 'docs' variable (list of document dicts). Must set 'result' variable to a list matching input length. Supports {{INPUT.*}}, {{CONTEXT.*}}, and {{SECRET.*}} templates.","examples":["result = [{'word_count': len(d.get('content', '').split())} for d in docs]","avg = sum(d.get('score', 0) for d in docs) / len(docs)\nresult = [{'relative_score': d.get('score', 0) / avg} for d in docs]","import re\nresult = [{'emails': re.findall(r'[\\w.+-]+@[\\w-]+\\.[\\w.-]+', d.get('content', ''))} for d in docs]"],"title":"Code","type":"string"},"language":{"default":"python","description":"Execution language. Python recommended for most use cases.","enum":["python","typescript","javascript"],"title":"Language","type":"string"},"output_field":{"default":"computed","description":"Document field path where results are merged. Dot notation supported: 'enrichment.computed'","examples":["computed","metadata.analysis","enrichment.custom"],"title":"Output Field","type":"string"},"result_variable":{"default":"result","description":"Variable name containing the output list in your code. Must be a JSON-serializable list with length == len(docs).","title":"Result Variable","type":"string"},"timeout_ms":{"default":5000,"description":"Execution timeout in milliseconds (100ms-30s, default 5s)","maximum":30000,"minimum":100,"title":"Timeout Ms","type":"integer"},"max_output_size":{"default":100000,"description":"Max output size in bytes (default 100KB, max 1MB)","maximum":1000000,"minimum":1024,"title":"Max Output Size","type":"integer"},"env":{"additionalProperties":{"type":"string"},"description":"Environment variables available during execution. Supports INPUT and SECRET templates: {'API_KEY': '{{SECRET.stripe_key}}', 'USER_ID': '{{INPUT.user_id}}'}","examples":[{"API_KEY":"{{SECRET.openai_api_key}}"},{"ENV":"production","STRIPE_KEY":"{{SECRET.stripe_key}}"}],"title":"Env","type":"object"},"on_error":{"$ref":"#/$defs/ErrorHandling","default":"skip","description":"'skip': On error, return input documents unchanged. 'raise': Fail entire pipeline on any error."}},"title":"CodeExecutionConfig","type":"object"}},{"stage_id":"document_enrich","description":"Join and enrich documents with data from another collection","category":"apply","icon":"link","parameter_schema":{"$defs":{"DynamicValue":{"description":"A value that should be dynamically resolved from the query request.","properties":{"type":{"const":"dynamic","default":"dynamic","title":"Type","type":"string"},"field":{"description":"The dot-notation path to the value in the runtime query request, e.g., 'inputs.user_id'","examples":["inputs.query_text","filters.AND[0].value"],"title":"Field","type":"string"}},"required":["field"],"title":"DynamicValue","type":"object"},"FilterCondition":{"description":"Represents a single filter condition.\n\nAttributes:\n    field: The field to filter on\n    operator: The comparison operator\n    value: The value to compare against","properties":{"field":{"description":"Field name to filter on","title":"Field","type":"string"},"operator":{"$ref":"#/$defs/FilterOperator","default":"eq","description":"Comparison operator"},"value":{"anyOf":[{"$ref":"#/$defs/DynamicValue"},{}],"description":"Value to compare against","title":"Value"}},"required":["field","value"],"title":"FilterCondition","type":"object"},"FilterOperator":{"description":"Supported filter operators across database implementations.","enum":["eq","ne","gt","lt","gte","lte","in","nin","contains","starts_with","ends_with","regex","exists","is_null","text","phrase","geo_radius","geo_bounding_box","geo_polygon"],"title":"FilterOperator","type":"string"},"LogicalOperator":{"additionalProperties":true,"description":"Represents a logical operation (AND, OR, NOT) on filter conditions.\n\nAllows nesting with a defined depth limit.\n\nAlso supports shorthand syntax where field names can be passed directly\nas key-value pairs for equality filtering (e.g., {\"metadata.title\": \"value\"}).","properties":{"AND":{"anyOf":[{"items":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"$ref":"#/$defs/FilterCondition"}]},"type":"array"},{"type":"null"}],"default":null,"description":"Logical AND operation - all conditions must be true","example":[{"field":"name","operator":"eq","value":"John"},{"field":"age","operator":"gte","value":30}],"title":"And"},"OR":{"anyOf":[{"items":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"$ref":"#/$defs/FilterCondition"}]},"type":"array"},{"type":"null"}],"default":null,"description":"Logical OR operation - at least one condition must be true","example":[{"field":"status","operator":"eq","value":"active"},{"field":"role","operator":"eq","value":"admin"}],"title":"Or"},"NOT":{"anyOf":[{"items":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"$ref":"#/$defs/FilterCondition"}]},"type":"array"},{"type":"null"}],"default":null,"description":"Logical NOT operation - all conditions must be false","example":[{"field":"department","operator":"eq","value":"HR"},{"field":"location","operator":"eq","value":"remote"}],"title":"Not"},"case_sensitive":{"anyOf":[{"type":"boolean"},{"type":"null"}],"default":false,"description":"Whether to perform case-sensitive matching","example":true,"title":"Case Sensitive"}},"title":"LogicalOperator","type":"object"},"StageCacheBehavior":{"description":"Cache behavior modes for retriever stages.\n\nControls internal caching of stage operations for performance optimization.\nAll modes are safe and automatic with LRU eviction - no manual cache management needed.\n\nValues:\n    AUTO: Smart automatic caching (default, recommended)\n    DISABLED: Skip internal caching completely\n    AGGRESSIVE: Cache even non-deterministic operations (use with caution)\n\nCache Architecture:\n    - Redis with LRU eviction policy (memory-bounded)\n    - Namespace-isolated per organization (multi-tenant safe)\n    - Stage-specific keyspaces prevent conflicts\n    - Cache keys hash (stage_name, inputs, parameters)\n    - Automatic invalidation on parameter changes\n\nPerformance Impact:\n    - AUTO: 50-90% latency reduction for repeated operations\n    - Cache lookup overhead: <5ms\n    - Hit rates: Typically 60-80% in production\n\nWhen to Use Each Mode:\n    AUTO (default):\n        - Deterministic transformations (parsing, formatting, reshaping)\n        - Stable external API calls (embeddings, standard inference)\n        - Operations without side effects\n        - Most use cases - this is the recommended default\n\n    DISABLED:\n        - Templates with now(), random(), or time-sensitive functions\n        - External APIs that must be called every time (real-time data)\n        - Operations with side effects\n        - Rapidly changing data where caching would serve stale results\n\n    AGGRESSIVE:\n        - When you fully understand caching implications\n        - For debugging or testing cache behavior\n        - Only use if you know cache invalidation is handled elsewhere\n        - Generally not recommended for production\n\nExamples:\n    Basic usage (auto mode, no config needed):\n        {\"cache_behavior\": \"auto\"}  # or omit - this is the default\n\n    Disable for time-sensitive operations:\n        {\"cache_behavior\": \"disabled\"}  # Template has {{now()}}\n\n    With custom TTL:\n        {\"cache_behavior\": \"auto\", \"cache_ttl_seconds\": 300}","enum":["auto","disabled","aggressive"],"title":"StageCacheBehavior","type":"string"}},"description":"Configuration for enriching documents with data from another collection.\n\n**Stage Category**: APPLY (1-1 Inner Join/Enrichment)\n\n**Transformation**: N documents → N documents (same count, expanded schema)\n\n**Purpose**: Applies each input document to a lookup operation in another collection,\nmerging matching data back. This performs JOIN-like operations similar to SQL INNER JOIN.\nEach input document produces exactly one output document with added fields.\n\n**When to Use**:\n    - After FILTER/SORT to add related reference data\n    - To combine data from multiple collections (e.g., products + catalog info)\n    - When documents need contextual information from other sources\n    - For denormalizing data at query time instead of storage time\n    - To attach user profiles, metadata, or related entities\n\n**When NOT to Use**:\n    - For initial document retrieval (use FILTER stages: hybrid_search)\n    - For removing documents (use FILTER stages)\n    - For reordering results (use SORT stages)\n    - When the target collection is very large (performance impact)\n    - For 1-N joins that expand document count (use taxonomy with multi-match)\n\n**Operational Behavior**:\n    - Applies each input document to a collection lookup (1-1 operation)\n    - Performs database lookups for each document (MongoDB queries)\n    - Maintains document count: N in → N out\n    - Expands schema: adds fields from target collection\n    - Moderate performance (depends on target collection size and indexes)\n    - Left join semantics: missing matches result in null/absent fields\n\n**Common Pipeline Position**: FILTER → SORT → APPLY (this stage)\n\n**Join Operation**: This is a LEFT JOIN - all source documents are kept,\nenrichment fields are added when matches are found. Missing matches result\nin null/absent fields rather than document removal.\n\nRequirements:\n    - target_collection_id: REQUIRED, collection to join with\n    - source_field: REQUIRED, field in current documents to match\n    - target_field: REQUIRED, field in target collection to match against\n    - fields_to_merge: OPTIONAL, specific fields to merge (or entire document)\n    - output_field: OPTIONAL, where to place enrichment (root or nested path)\n\nUse Cases:\n    - Enrich product search results with full catalog data\n    - Add user profile information to activity logs\n    - Join cluster assignments with detailed metadata\n    - Attach reference data (categories, taxonomies) to documents\n    - Combine fragmented data across collections\n\nExamples:\n    Basic field-based join:\n        ```json\n        {\n            \"target_collection_id\": \"col_products\",\n            \"source_field\": \"metadata.product_id\",\n            \"target_field\": \"product_id\",\n            \"fields_to_merge\": [\"name\", \"price\", \"category\"]\n        }\n        ```\n\n    Nested field join with custom output:\n        ```json\n        {\n            \"target_collection_id\": \"col_users\",\n            \"source_field\": \"lineage.source_object_id\",\n            \"target_field\": \"user_id\",\n            \"output_field\": \"enrichments.user_profile\",\n            \"fields_to_merge\": [\"name\", \"email\", \"role\"]\n        }\n        ```\n\n    Conditional enrichment (only for specific categories):\n        ```json\n        {\n            \"target_collection_id\": \"col_catalog\",\n            \"source_field\": \"metadata.sku\",\n            \"target_field\": \"sku\",\n            \"fields_to_merge\": [\"description\", \"specs\"],\n            \"when\": {\n                \"field\": \"metadata.category\",\n                \"operator\": \"eq\",\n                \"value\": \"electronics\"\n            }\n        }\n        ```","examples":[{"description":"RETRIEVER-BASED: Simple field join using anonymous attribute filter","fields_to_merge":["primary_text","transcription"],"output_field":"text_context","retriever_config":{"stages":[{"config":{"parameters":{"field":"root_object_id","operator":"eq","value":"{{DOC.root_object_id}}"},"stage_id":"attribute_filter"},"stage_type":"filter"}]},"target_collection_id":"col_text_chunks"},{"description":"RETRIEVER-BASED: Semantic join using existing retriever","fields_to_merge":["name","price","image_url"],"output_field":"similar_products","retriever_id":"ret_find_similar_products","retriever_inputs":{"category":"{{DOC.metadata.category}}","query":"{{DOC.description}}"},"strategy":"append"},{"description":"RETRIEVER-BASED: Complex multi-stage anonymous retriever","fields_to_merge":["title","summary","tags"],"output_field":"enrichments.related_docs","retriever_config":{"stages":[{"config":{"parameters":{"final_top_k":5,"fusion":"weighted","searches":[{"feature_uri":"mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1","query":{"input_mode":"text","text":"{{DOC.title}}"},"top_k":50,"weight":0.7},{"feature_uri":"mixpeek://sparse_extractor@v1/lexical","query":{"input_mode":"text","text":"{{DOC.title}}"},"top_k":50,"weight":0.3}]},"stage_id":"feature_search"},"stage_type":"filter"},{"config":{"parameters":{"field":"metadata.category","operator":"eq","value":"{{DOC.category}}"},"stage_id":"attribute_filter"},"stage_type":"filter"}]},"target_collection_id":"col_enrichment_data"},{"description":"DIRECT JOIN (LEGACY): Basic product enrichment","source_field":"metadata.product_id","strategy":"enrich","target_collection_id":"col_products_v1","target_field":"product_id"},{"description":"Selective field merge - only specific fields from catalog","fields_to_merge":["name","price","category","description"],"output_field":"enrichments.product_data","source_field":"metadata.sku","strategy":"enrich","target_collection_id":"col_catalog","target_field":"sku"},{"allow_missing":true,"cache_behavior":"auto","cache_ttl_seconds":3600,"description":"User profile enrichment with nested output and caching","fields_to_merge":["name","email","tier","preferences"],"output_field":"enrichments.user_profile","source_field":"user_id","strategy":"enrich","target_collection_id":"col_users","target_field":"id"},{"allow_missing":true,"cache_behavior":"auto","description":"Conditional enrichment - only for electronics category (auto-optimized)","fields_to_merge":["specifications","warranty"],"output_field":"metadata.technical_details","source_field":"metadata.sku","target_collection_id":"col_specs","target_field":"sku","when":{"field":"metadata.category","operator":"eq","value":"electronics"}},{"allow_missing":false,"description":"Strict join - filter out documents without matches","fields_to_merge":["metadata.title","metadata.tags"],"source_field":"lineage.source_object_id","target_collection_id":"col_metadata","target_field":"object_id"},{"cache_behavior":"auto","description":"Complex conditional join with optimizer integration","output_field":"enrichments.extended","source_field":"document_id","strategy":"enrich","target_collection_id":"col_extended_data","target_field":"ref_id","when":{"AND":[{"field":"metadata.verified","operator":"eq","value":true},{"field":"metadata.priority","operator":"gte","value":5}]}},{"cache_behavior":"auto","cache_ttl_seconds":300,"description":"High-frequency join with aggressive caching and custom TTL","fields_to_merge":["name","price","stock"],"source_field":"sku","target_collection_id":"col_catalog","target_field":"product_sku"},{"cache_behavior":"disabled","description":"Real-time join with cache disabled for fresh data","fields_to_merge":["current_price","availability"],"output_field":"live_data","source_field":"product_id","target_collection_id":"col_live_prices","target_field":"id"}],"properties":{"cache_behavior":{"$ref":"#/$defs/StageCacheBehavior","default":"auto","description":"Controls internal caching behavior for this stage. OPTIONAL - defaults to 'auto' for transparent performance. \n\n'auto' (default): Automatic caching for deterministic operations. Stage intelligently caches results based on inputs and parameters. Use for transformations, parsing, formatting, stable API calls. Cache invalidates automatically when parameters change. Recommended for 95% of use cases. \n\n'disabled': Skip all internal caching. Every execution runs fresh without cache lookup. Use for templates with now(), random(), or external APIs that must be called every time (real-time data). No performance benefit but guarantees fresh execution. \n\n'aggressive': Cache even non-deterministic operations. Use ONLY when you fully understand caching implications. May cache time-sensitive or random data. Generally not recommended - prefer 'auto' or 'disabled'. \n\nNote: This controls internal stage caching. Retriever-level caching (cache_config.cache_stage_names) is separate and caches complete stage outputs.","examples":["auto","disabled","aggressive"]},"cache_ttl_seconds":{"anyOf":[{"minimum":0,"type":"integer"},{"type":"null"}],"default":null,"description":"Time-to-live for cache entries in seconds. OPTIONAL - defaults to None (LRU eviction only). \n\nWhen None (default, recommended): Cache uses Redis LRU eviction policy. Most frequently used items stay cached automatically. No manual TTL management needed. Memory bounded by Redis maxmemory setting. \n\nWhen specified: Cache entries expire after this duration regardless of usage. Useful for data that becomes stale after specific time periods. Lower values for frequently changing external data. Higher values for stable transformations. \n\nExamples:\n- None: LRU-based eviction (recommended for most cases)\n- 300: 5 minutes (for semi-static external data)\n- 3600: 1 hour (for stable transformations)\n- 86400: 24 hours (for rarely changing operations)\n\n\nPerformance Note: TTL adds minimal overhead (<1ms) but forces eviction even for frequently accessed items. Use None unless you have specific staleness requirements.","examples":[null,300,3600,86400],"title":"Cache Ttl Seconds"},"retriever_id":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"ID of an existing retriever to use for finding enrichment data. When provided, uses the full retriever pipeline (semantic search, filters, etc.) instead of simple field matching. Mutually exclusive with retriever_config.","examples":["ret_find_similar_products","ret_user_lookup"],"title":"Retriever Id"},"retriever_config":{"anyOf":[{"additionalProperties":true,"type":"object"},{"type":"null"}],"default":null,"description":"Anonymous retriever definition for finding enrichment data. Allows defining a custom retriever inline without creating it separately. Mutually exclusive with retriever_id. Must have 'stages' array with at least one stage.","examples":[{"stages":[{"config":{"parameters":{"field":"root_object_id","operator":"eq","value":"{{DOC.root_object_id}}"},"stage_id":"attribute_filter"},"stage_type":"filter"}]}],"title":"Retriever Config"},"retriever_inputs":{"anyOf":[{"additionalProperties":true,"type":"object"},{"type":"null"}],"default":null,"description":"Template mapping from source document fields to retriever inputs. Supports template syntax: {{DOC.field_name}} to reference source document fields. Used when retriever_id or retriever_config is specified.","examples":[{"category":"{{DOC.category}}","query":"{{DOC.description}}"},{"product_id":"{{DOC.metadata.product_id}}"}],"title":"Retriever Inputs"},"target_collection_id":{"anyOf":[{"type":"string"},{"type":"null"}],"default":"{{COLLECTION_ID}}","description":"Collection ID to fetch enrichment data from. REQUIRED for direct joins (when retriever_id/retriever_config not provided). Also used to scope retriever queries when retriever-based join is used. NOTE: You must replace the default placeholder with your actual collection ID.","examples":["col_products_v1","col_users"],"title":"Target Collection Id"},"source_field":{"anyOf":[{"type":"string"},{"type":"null"}],"default":"source_object_id","description":"Dot-path to field in current document to match on. REQUIRED for direct joins (when retriever_id/retriever_config not provided). For retriever-based joins, use retriever_inputs instead.","examples":["metadata.product_id","lineage.source_object_id","cluster_id"],"title":"Source Field"},"target_field":{"anyOf":[{"type":"string"},{"type":"null"}],"default":"document_id","description":"Field in target collection to match against. REQUIRED for direct joins (when retriever_id/retriever_config not provided). Ignored for retriever-based joins.","examples":["product_id","user_id","id"],"title":"Target Field"},"fields_to_merge":{"anyOf":[{"items":{"type":"string"},"type":"array"},{"type":"null"}],"default":null,"description":"Specific fields from target document to merge. If None, merges entire document. Supports dot-notation for nested fields.","examples":[["name","price","category"],["metadata.title","metadata.description"],null],"title":"Fields To Merge"},"output_field":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Dot-path where enrichment data should be placed. If None, merges directly into document root. Use 'enrichments.{name}' to namespace enrichments.","examples":["enrichments.product_data","metadata.user_profile",null],"title":"Output Field"},"strategy":{"default":"enrich","description":"How to handle the merge: 'enrich' = add fields to existing document, 'replace' = replace document with enriched version, 'append' = add as array item","examples":["enrich","replace","append"],"title":"Strategy","type":"string"},"when":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"type":"null"}],"default":null,"description":"Conditional filter to determine which documents should be enriched. Documents not matching the condition pass through unchanged."},"allow_missing":{"default":true,"description":"If True, documents without matching enrichment data pass through unchanged. If False, documents without matches are filtered out.","title":"Allow Missing","type":"boolean"}},"title":"DocumentEnrichmentConfig","type":"object"}},{"stage_id":"llm_enrich","description":"Enrich documents with LLM-generated fields using natural language prompts","category":"apply","icon":"sparkles","parameter_schema":{"$defs":{"DynamicValue":{"description":"A value that should be dynamically resolved from the query request.","properties":{"type":{"const":"dynamic","default":"dynamic","title":"Type","type":"string"},"field":{"description":"The dot-notation path to the value in the runtime query request, e.g., 'inputs.user_id'","examples":["inputs.query_text","filters.AND[0].value"],"title":"Field","type":"string"}},"required":["field"],"title":"DynamicValue","type":"object"},"FilterCondition":{"description":"Represents a single filter condition.\n\nAttributes:\n    field: The field to filter on\n    operator: The comparison operator\n    value: The value to compare against","properties":{"field":{"description":"Field name to filter on","title":"Field","type":"string"},"operator":{"$ref":"#/$defs/FilterOperator","default":"eq","description":"Comparison operator"},"value":{"anyOf":[{"$ref":"#/$defs/DynamicValue"},{}],"description":"Value to compare against","title":"Value"}},"required":["field","value"],"title":"FilterCondition","type":"object"},"FilterOperator":{"description":"Supported filter operators across database implementations.","enum":["eq","ne","gt","lt","gte","lte","in","nin","contains","starts_with","ends_with","regex","exists","is_null","text","phrase","geo_radius","geo_bounding_box","geo_polygon"],"title":"FilterOperator","type":"string"},"LLMProvider":{"description":"Supported LLM providers for content generation.\n\nEach provider has different strengths, pricing, and multimodal capabilities.\nChoose based on your use case, performance requirements, and budget.\n\nValues:\n    OPENAI: OpenAI GPT models (GPT-4o, GPT-4.1, O3-mini)\n        - Best for: General purpose, vision tasks, structured outputs\n        - Multimodal: Text, images\n        - Performance: Fast (100-500ms), reliable\n        - Cost: Moderate to high ($0.15-$10 per 1M tokens)\n        - Use when: Need high-quality generation with vision support\n\n    GOOGLE: Google Gemini models (Gemini 3.1 Flash Lite, Gemini 2.5 Pro)\n        - Best for: Fast generation, video understanding, cost-efficiency\n        - Multimodal: Text, images, video, audio, PDFs\n        - Performance: Very fast (50-200ms)\n        - Cost: Low to moderate ($0.075-$0.40 per 1M tokens)\n        - Use when: Need video/audio/PDF support or cost-efficiency\n\n    ANTHROPIC: Anthropic Claude models (Claude 3.5 Sonnet, Claude 3.5 Haiku)\n        - Best for: Long context, complex reasoning, safety\n        - Multimodal: Text, images\n        - Performance: Moderate (200-800ms)\n        - Cost: Moderate to high ($0.25-$15 per 1M tokens)\n        - Use when: Need long context or complex reasoning\n\nExamples:\n    - Use OPENAI for production with structured JSON outputs\n    - Use GOOGLE for video summarization and cost-sensitive workloads\n    - Use ANTHROPIC for complex reasoning with long documents","enum":["openai","google","anthropic"],"title":"LLMProvider","type":"string"},"LogicalOperator":{"additionalProperties":true,"description":"Represents a logical operation (AND, OR, NOT) on filter conditions.\n\nAllows nesting with a defined depth limit.\n\nAlso supports shorthand syntax where field names can be passed directly\nas key-value pairs for equality filtering (e.g., {\"metadata.title\": \"value\"}).","properties":{"AND":{"anyOf":[{"items":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"$ref":"#/$defs/FilterCondition"}]},"type":"array"},{"type":"null"}],"default":null,"description":"Logical AND operation - all conditions must be true","example":[{"field":"name","operator":"eq","value":"John"},{"field":"age","operator":"gte","value":30}],"title":"And"},"OR":{"anyOf":[{"items":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"$ref":"#/$defs/FilterCondition"}]},"type":"array"},{"type":"null"}],"default":null,"description":"Logical OR operation - at least one condition must be true","example":[{"field":"status","operator":"eq","value":"active"},{"field":"role","operator":"eq","value":"admin"}],"title":"Or"},"NOT":{"anyOf":[{"items":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"$ref":"#/$defs/FilterCondition"}]},"type":"array"},{"type":"null"}],"default":null,"description":"Logical NOT operation - all conditions must be false","example":[{"field":"department","operator":"eq","value":"HR"},{"field":"location","operator":"eq","value":"remote"}],"title":"Not"},"case_sensitive":{"anyOf":[{"type":"boolean"},{"type":"null"}],"default":false,"description":"Whether to perform case-sensitive matching","example":true,"title":"Case Sensitive"}},"title":"LogicalOperator","type":"object"}},"description":"Configuration for augmenting documents with LLM generated fields.\n\n**Stage Category**: APPLY (1-1 Enrichment/Generation)\n\n**Transformation**: N documents → N documents (same count, expanded schema)\n\n**Purpose**: Applies LLM generation to each input document, creating new fields with\ngenerated content. Use this for summaries, insights, descriptions, or transformations.\nEach input document produces exactly one output document with added generated fields.\n\n**When to Use**:\n    - After FILTER/SORT to enhance final results with generated content\n    - For summarization of long content\n    - To extract structured data (entities, insights, key points)\n    - For content transformation (translation, rephrasing, formatting)\n    - To generate descriptions, titles, or metadata\n    - For creative augmentation (suggestions, recommendations)\n\n**When NOT to Use**:\n    - For removing documents (use FILTER: llm_filter instead)\n    - For simple field transformations (use direct field mapping)\n    - For initial document retrieval (use FILTER: hybrid_search)\n    - For reordering (use SORT stages)\n    - When fast response time is critical (LLM generation is slow, 200ms-5s)\n    - When cost is a major concern (LLM generation is very expensive)\n    - For large batch processing (consider async batch jobs instead)\n\n**Operational Behavior**:\n    - Applies LLM generation to each input document (1-1 operation)\n    - Maintains document count: N in → N out\n    - Expands schema: adds new generated fields to each document\n    - Makes HTTP requests to Engine service for LLM inference\n    - Very slow operation (LLM generation, 200ms-5s per document batch)\n    - Processes documents in batches to optimize throughput\n    - Supports concurrent batching for parallel LLM calls\n\n**Common Pipeline Position**: FILTER → SORT → APPLY (this stage)\n\n**Cost & Performance**:\n    - Very Expensive: LLM generation costs per document (10-100x vs embeddings)\n    - Very Slow: 200ms-5s per batch depending on LLM and generation length\n    - CRITICAL: Use `when` parameter for selective enrichment (massive cost savings)\n    - Consider enriching only top-ranked results after RANK stage\n    - Smaller batch sizes often better for latency\n\n**Conditional Enrichment**: Supports `when` parameter to only enrich\nspecific documents. CRITICAL FOR COST SAVINGS - LLM generation is expensive!\n    - Only summarize long documents (word_count > 500)\n    - Only process high-priority items\n    - Only enrich specific content types (articles, not images)\n\nRequirements:\n    - provider: OPTIONAL, LLM provider (openai, google, anthropic). Auto-inferred if not specified.\n    - model_name: OPTIONAL, specific model name. Uses provider default if not specified.\n    - prompt: REQUIRED, LLM prompt template (supports {DOC.field}, {INPUT.field})\n    - output_field: REQUIRED, where to store generated content\n    - batch_size: OPTIONAL, documents per batch (default 5)\n    - schema: OPTIONAL, JSON schema for structured output\n    - when: OPTIONAL but RECOMMENDED for cost control\n\nUse Cases:\n    - Summarization: Generate 3-sentence summaries of articles\n    - Insight extraction: Extract key takeaways and insights\n    - Description generation: Create product descriptions from specs\n    - Translation: Translate content to other languages\n    - Entity extraction: Extract people, places, organizations\n    - Recommendation generation: Create personalized suggestions\n\nExamples:\n    Unconditional enrichment:\n        ```json\n        {\n            \"provider\": \"openai\",\n            \"model_name\": \"gpt-4o-mini\",\n            \"prompt\": \"Summarize the document\",\n            \"output_field\": \"metadata.summary\"\n        }\n        ```\n\n    Conditional enrichment (only summarize long documents):\n        ```json\n        {\n            \"provider\": \"google\",\n            \"model_name\": \"gemini-2.5-flash-lite\",\n            \"prompt\": \"Summarize the document\",\n            \"output_field\": \"metadata.summary\",\n            \"when\": {\n                \"field\": \"metadata.word_count\",\n                \"operator\": \"gt\",\n                \"value\": 500\n            }\n        }\n        ```","examples":[{"batch_size":4,"description":"Unconditional summary enrichment (recommended format)","model_name":"gemini-2.5-flash-lite","output_field":"metadata.summary","prompt":"Summarise the document in three bullet points","provider":"google"},{"description":"Extract risks with JSON schema","model_name":"gemini-2.5-flash-lite","output_field":"metadata.risks","prompt":"Extract key risks from the document","provider":"google","schema":{"items":{"type":"string"},"type":"array"}},{"batch_size":3,"description":"Conditional enrichment - only summarize long English articles (COST SAVINGS!)","model_name":"gpt-4o-mini","output_field":"metadata.detailed_summary","prompt":"Provide a detailed summary of this article","provider":"openai","when":{"AND":[{"field":"metadata.word_count","operator":"gt","value":1000},{"field":"metadata.category","operator":"in","value":["article","blog"]},{"field":"metadata.language","operator":"eq","value":"en"}]}},{"description":"Provider-only (uses default model for provider)","output_field":"metadata.insights","prompt":"Extract key insights from {{DOC.text}}","provider":"openai"},{"description":"OR condition - process urgent items OR high value customers","model_name":"claude-sonnet-4-5-20250929","output_field":"metadata.recommendations","prompt":"Provide personalized recommendations for this customer","provider":"anthropic","temperature":0.7,"when":{"OR":[{"field":"metadata.customer_tier","operator":"eq","value":"premium"},{"field":"metadata.lifetime_value","operator":"gte","value":10000}]}},{"description":"Legacy format (deprecated but supported for backward compatibility)","inference_name":"openai:gpt-4o-mini","output_field":"metadata.summary","prompt":"Summarize the document"}],"properties":{"feature_uri":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Feature URI of a custom LLM/generation plugin. When set, overrides provider and model_name. The plugin must accept {prompt: str, document: dict} and return {text: str}. Format: mixpeek://plugin_name@version/feature_name","examples":["mixpeek://my_enricher@1.0.0/enrich"],"title":"Feature Uri"},"provider":{"anyOf":[{"$ref":"#/$defs/LLMProvider"},{"type":"null"}],"default":null,"description":"LLM provider to use. Supported providers:\n- openai: GPT models (GPT-4o, GPT-4o-mini)\n- google: Gemini models (Gemini 2.5 Flash)\n- anthropic: Claude models (Claude 3.5 Sonnet/Haiku)\n\nIf not specified, defaults to 'google'. Can be auto-inferred from model_name. Ignored when feature_uri is set.","examples":["openai","google","anthropic"]},"model_name":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Specific LLM model to use. If not specified, uses provider default.\nExamples: gemini-2.5-flash-lite, gpt-4o-mini, claude-3-5-haiku-20241022","examples":["gemini-2.5-flash-lite","gpt-4o-mini","claude-sonnet-4-5-20250929"],"title":"Model Name"},"inference_name":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"deprecated":true,"description":"DEPRECATED: Use 'provider' and 'model_name' instead.\nLegacy format: 'provider:model' (e.g., 'gemini:gemini-2.5-flash-lite').\nKept for backward compatibility only.","title":"Inference Name"},"prompt":{"default":"Summarize the following content in 2-3 sentences: {{DOC.content}}","description":"Prompt template for the LLM (supports doc/input templates).","title":"Prompt","type":"string"},"output_field":{"default":"summary","description":"Dot-path where the enrichment result should be stored.","title":"Output Field","type":"string"},"batch_size":{"default":5,"description":"Number of documents to enrich per LLM request batch.","maximum":25,"minimum":1,"title":"Batch Size","type":"integer"},"schema":{"anyOf":[{"additionalProperties":true,"type":"object"},{"type":"null"}],"default":null,"description":"Optional JSON schema instructions for the LLM output.","title":"Schema"},"max_tokens":{"anyOf":[{"maximum":16000,"minimum":100,"type":"integer"},{"type":"null"}],"default":null,"description":"Maximum output tokens for the LLM response. If not specified, uses the provider default (4000). Increase this if output is being truncated. Note: Gemini counts tokens more aggressively than Claude/GPT — a 1000-token limit may produce only ~350 characters with Gemini.","title":"Max Tokens"},"temperature":{"default":0.2,"description":"Sampling temperature passed to the LLM.","maximum":1.0,"minimum":0.0,"title":"Temperature","type":"number"},"api_key":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"OPTIONAL. Bring Your Own Key (BYOK) - use your own LLM API key instead of Mixpeek's.\n\n**How to use:**\n1. Store your API key as an organization secret via POST /v1/organizations/secrets\n   Example: {\"secret_name\": \"openai_api_key\", \"secret_value\": \"sk-proj-...\"}\n\n2. Reference it here using template syntax: {{secrets.openai_api_key}}\n\n**Benefits:**\n- Use your own API credits and rate limits\n- Keep your API keys secure in Mixpeek's encrypted vault\n- No changes needed to your retriever when rotating keys\n\nIf not provided, uses Mixpeek's default API keys (usage charged to your account).","examples":["{{secrets.openai_api_key}}","{{secrets.anthropic_key}}"],"title":"Api Key"},"multimodal_inputs":{"anyOf":[{"additionalProperties":{"type":"string"},"type":"object"},{"type":"null"}],"default":null,"description":"OPTIONAL. Declare INPUT fields that carry multimodal content (images, videos).\n\nMaps INPUT field names to content types: 'image' or 'video'.\nWhen set, the stage extracts those INPUT values and sends them as multimodal\ncontent alongside the text prompt — enabling image-vs-document comparison,\nvisual similarity scoring, and other cross-modal LLM tasks.\n\nWithout this, {{INPUT.query_image}} resolves to a raw base64/URL string in the\nprompt text, which the LLM cannot interpret as an actual image.\n\nExample: {\"query_image\": \"image\"} — the value of INPUT.query_image (a URL or\nbase64 string) is sent as an image part in the multimodal LLM request.","examples":[{"query_image":"image"},{"query_video":"video","reference_image":"image"}],"title":"Multimodal Inputs"},"use_vcache":{"anyOf":[{"type":"boolean"},{"type":"null"}],"default":null,"description":"Whether to use semantic caching (vCache) for LLM calls in this stage. When True, semantically similar prompts return cached responses, reducing cost. When False, every call goes directly to the LLM provider, reducing latency. When None (default), falls back to the global VCACHE_ENABLED setting.","title":"Use Vcache"},"when":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"type":"null"}],"default":null,"description":"OPTIONAL. Conditional filter that documents must satisfy to be enriched with LLM. Uses LogicalOperator (AND/OR/NOT) for complex boolean logic, or simple field/operator/value for single conditions. Documents NOT matching this condition will SKIP enrichment (pass-through unchanged). CRITICAL FOR COST SAVINGS - LLM calls are expensive! Only enrich documents that need it. When NOT specified, ALL documents are enriched unconditionally (may incur high costs). \n\nUse cases:\n- Only summarize documents with word_count > 500\n- Only enrich English articles/blogs\n- Only process high-priority items\n\n\nSimple condition example: {\"field\": \"metadata.word_count\", \"operator\": \"gt\", \"value\": 500}\nBoolean AND example: {\"AND\": [{\"field\": \"category\", \"operator\": \"in\", \"value\": [\"article\"]}, ...]}\n","examples":[{"field":"metadata.word_count","operator":"gt","value":500},{"field":"metadata.should_summarize","operator":"eq","value":true},{"AND":[{"field":"metadata.category","operator":"in","value":["article","blog"]},{"field":"metadata.language","operator":"eq","value":"en"},{"field":"metadata.word_count","operator":"gt","value":1000}]},{"OR":[{"field":"metadata.urgent","operator":"eq","value":true},{"field":"metadata.priority","operator":"gte","value":8}]}]}},"title":"LLMEnrichmentConfig","type":"object"}},{"stage_id":"taxonomy_enrich","description":"Enrich documents with taxonomy data via vector similarity matching","category":"apply","icon":"git-merge","parameter_schema":{"$defs":{"DynamicValue":{"description":"A value that should be dynamically resolved from the query request.","properties":{"type":{"const":"dynamic","default":"dynamic","title":"Type","type":"string"},"field":{"description":"The dot-notation path to the value in the runtime query request, e.g., 'inputs.user_id'","examples":["inputs.query_text","filters.AND[0].value"],"title":"Field","type":"string"}},"required":["field"],"title":"DynamicValue","type":"object"},"EnrichmentField":{"description":"Field-level enrichment behaviour specification.\n\nDefines how to copy fields from taxonomy source nodes to enriched documents.\nSupports field renaming via target_field parameter.\n\nExamples:\n    - Copy field as-is: {\"field_path\": \"category\", \"merge_mode\": \"replace\"}\n    - Rename field: {\"field_path\": \"label\", \"target_field\": \"visual_style\", \"merge_mode\": \"replace\"}\n    - Append to array: {\"field_path\": \"tags\", \"merge_mode\": \"append\"}","examples":[{"field_path":"metadata.tags","merge_mode":"append"},{"field_path":"category","merge_mode":"replace"},{"field_path":"label","merge_mode":"replace","target_field":"visual_style"},{"field_path":"description","merge_mode":"replace","target_field":"style_description"}],"properties":{"field_path":{"description":"Dot-notation path of the field to copy from the taxonomy node.","title":"Field Path","type":"string"},"target_field":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Optional target field name in the enriched document. If specified, the source field will be renamed to this name. If not specified, the field_path is used as the target name. Use this to rename fields during enrichment (e.g., label → visual_style).","examples":["visual_style","style_description","brand_name"],"title":"Target Field"},"merge_mode":{"$ref":"#/$defs/EnrichmentMergeMode","default":"replace","description":"Whether to overwrite the target's value or append (for arrays)."}},"required":["field_path"],"title":"EnrichmentField","type":"object"},"EnrichmentMergeMode":{"description":"How a field from the taxonomy node should be merged into the target doc.","enum":["replace","append"],"title":"EnrichmentMergeMode","type":"string"},"FilterCondition":{"description":"Represents a single filter condition.\n\nAttributes:\n    field: The field to filter on\n    operator: The comparison operator\n    value: The value to compare against","properties":{"field":{"description":"Field name to filter on","title":"Field","type":"string"},"operator":{"$ref":"#/$defs/FilterOperator","default":"eq","description":"Comparison operator"},"value":{"anyOf":[{"$ref":"#/$defs/DynamicValue"},{}],"description":"Value to compare against","title":"Value"}},"required":["field","value"],"title":"FilterCondition","type":"object"},"FilterOperator":{"description":"Supported filter operators across database implementations.","enum":["eq","ne","gt","lt","gte","lte","in","nin","contains","starts_with","ends_with","regex","exists","is_null","text","phrase","geo_radius","geo_bounding_box","geo_polygon"],"title":"FilterOperator","type":"string"},"LogicalOperator":{"additionalProperties":true,"description":"Represents a logical operation (AND, OR, NOT) on filter conditions.\n\nAllows nesting with a defined depth limit.\n\nAlso supports shorthand syntax where field names can be passed directly\nas key-value pairs for equality filtering (e.g., {\"metadata.title\": \"value\"}).","properties":{"AND":{"anyOf":[{"items":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"$ref":"#/$defs/FilterCondition"}]},"type":"array"},{"type":"null"}],"default":null,"description":"Logical AND operation - all conditions must be true","example":[{"field":"name","operator":"eq","value":"John"},{"field":"age","operator":"gte","value":30}],"title":"And"},"OR":{"anyOf":[{"items":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"$ref":"#/$defs/FilterCondition"}]},"type":"array"},{"type":"null"}],"default":null,"description":"Logical OR operation - at least one condition must be true","example":[{"field":"status","operator":"eq","value":"active"},{"field":"role","operator":"eq","value":"admin"}],"title":"Or"},"NOT":{"anyOf":[{"items":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"$ref":"#/$defs/FilterCondition"}]},"type":"array"},{"type":"null"}],"default":null,"description":"Logical NOT operation - all conditions must be false","example":[{"field":"department","operator":"eq","value":"HR"},{"field":"location","operator":"eq","value":"remote"}],"title":"Not"},"case_sensitive":{"anyOf":[{"type":"boolean"},{"type":"null"}],"default":false,"description":"Whether to perform case-sensitive matching","example":true,"title":"Case Sensitive"}},"title":"LogicalOperator","type":"object"}},"description":"Configuration for enriching documents with taxonomy assignments.\n\n**Stage Category**: APPLY (1-1 or 1-N depending on configuration)\n\n**Transformation**:\n    - 1-1 mode (top_k=1): N documents → N documents (same count, expanded schema)\n    - 1-N mode (top_k>1): N documents → N*M documents (outer join/tagging)\n\n**Purpose**: Applies each document to a taxonomy search, matching against predefined\ntaxonomy nodes using vector similarity. Can operate as 1-1 enrichment (single best match)\nor 1-N expansion (multiple matching tags).\n\n**When to Use**:\n    - After FILTER/SORT to classify and tag retrieved documents\n    - For automatic content categorization (topics, genres, entities)\n    - When you have labeled reference data (people, products, categories)\n    - For face recognition (matching faces against enrolled identities)\n    - To apply hierarchical categorization (parent/child relationships)\n    - For entity linking (matching content to knowledge base entities)\n    - **1-1 mode** (top_k=1): Single best match enrichment\n    - **1-N mode** (top_k>1): Multi-tag expansion (document multiplication)\n\n**When NOT to Use**:\n    - For initial document retrieval from collections (use FILTER: hybrid_search)\n    - For removing documents (use FILTER stages)\n    - For reordering results (use SORT stages)\n    - For general field-based JOINs (use document_enrich instead)\n    - When you don't have a predefined taxonomy collection\n\n**Operational Behavior**:\n    - Applies each input document to taxonomy vector search\n    - Performs vector similarity search against taxonomy collection (MVS)\n    - Document count: N in → N out (top_k=1) or N*M out (top_k>1)\n    - Expands or maintains schema depending on mode\n    - Moderate performance (vector similarity searches per document)\n    - Supports conditional enrichment (via `when` parameter for cost savings)\n\n**Common Pipeline Position**: FILTER → SORT → APPLY (this stage)\n\n**Conditional Enrichment**: Supports `when` parameter to only enrich\ndocuments matching specific criteria. Critical for:\n    - Cost savings (vector searches are compute-intensive)\n    - Selective enrichment based on document properties\n    - Applying different taxonomies to different document types\n\nRequirements:\n    - taxonomy_id: REQUIRED - ID of the taxonomy to use for enrichment\n    - fields: OPTIONAL, which taxonomy fields to merge into documents\n    - top_k: OPTIONAL, max taxonomy matches per document (default 3)\n    - min_score: OPTIONAL, minimum similarity threshold (default 0.0)\n    - when: OPTIONAL, condition for selective enrichment\n\nUse Cases:\n    - Face recognition: Match detected faces to employee directory\n    - Content classification: Tag articles with topic categories\n    - Product categorization: Assign products to taxonomy of categories\n    - Entity linking: Link mentions to knowledge base entities\n    - Brand detection: Identify brand logos in images\n\nExamples:\n    Basic taxonomy enrichment:\n        ```json\n        {\n            \"taxonomy_id\": \"tax_abc123\",\n            \"top_k\": 3\n        }\n        ```\n\n    Conditional enrichment (only enrich if category=product):\n        ```json\n        {\n            \"taxonomy_id\": \"tax_product_classifier\",\n            \"top_k\": 3,\n            \"when\": {\n                \"AND\": [\n                    {\"field\": \"metadata.category\", \"operator\": \"eq\", \"value\": \"product\"},\n                    {\"field\": \"metadata.has_image\", \"operator\": \"eq\", \"value\": true}\n                ]\n            }\n        }\n        ```","examples":[{"description":"Basic taxonomy enrichment","min_score":0.6,"taxonomy_id":"tax_abc123def456","top_k":5},{"description":"With explicit enrichment fields","fields":[{"field_path":"metadata.tags","merge_mode":"append"}],"min_score":0.6,"taxonomy_id":"tax_product_categories","top_k":5},{"description":"Conditional enrichment - only enrich products","min_score":0.7,"taxonomy_id":"tax_product_classifier","top_k":3,"when":{"AND":[{"field":"metadata.category","operator":"eq","value":"product"},{"field":"metadata.has_image","operator":"eq","value":true}]}},{"description":"Simple conditional enrichment - only verified documents","taxonomy_id":"tax_content_classifier","top_k":5,"when":{"field":"metadata.verified","operator":"eq","value":true}},{"description":"OR condition - enrich urgent OR high priority items","taxonomy_id":"tax_priority_classifier","top_k":3,"when":{"OR":[{"field":"metadata.urgent","operator":"eq","value":true},{"field":"metadata.priority","operator":"gte","value":8}]}}],"properties":{"taxonomy_id":{"default":"{{TAXONOMY_ID}}","description":"ID of the taxonomy to use for enrichment. The taxonomy's configured input_mappings determine which vector field from source documents to use for similarity matching. NOTE: You must replace the default placeholder with your actual taxonomy ID.","examples":["tax_abc123def456","tax_product_categories"],"title":"Taxonomy Id","type":"string"},"fields":{"description":"Fields from the taxonomy node to merge into the document.","items":{"$ref":"#/$defs/EnrichmentField"},"title":"Fields","type":"array"},"top_k":{"default":3,"description":"Maximum taxonomy assignments to attach per document.","maximum":50,"minimum":1,"title":"Top K","type":"integer"},"min_score":{"default":0.0,"description":"Minimum similarity score required to keep an assignment.","maximum":1.0,"minimum":0.0,"title":"Min Score","type":"number"},"when":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"type":"null"}],"default":null,"description":"OPTIONAL. Conditional filter that documents must satisfy to be enriched. Uses LogicalOperator (AND/OR/NOT) for complex boolean logic, or simple field/operator/value for single conditions. Documents NOT matching this condition will SKIP enrichment (pass-through unchanged). Useful for cost savings (only enrich relevant documents) and conditional processing. When NOT specified, ALL documents are enriched unconditionally. \n\nSimple condition example: {\"field\": \"metadata.category\", \"operator\": \"eq\", \"value\": \"product\"}\nBoolean AND example: {\"AND\": [{\"field\": \"x\", \"operator\": \"eq\", \"value\": \"y\"}, ...]}\nBoolean OR example: {\"OR\": [{\"field\": \"x\", \"operator\": \"eq\", \"value\": \"y\"}, ...]}\n","examples":[{"field":"metadata.category","operator":"eq","value":"product"},{"AND":[{"field":"metadata.category","operator":"eq","value":"product"},{"field":"metadata.verified","operator":"eq","value":true}]},{"OR":[{"field":"metadata.priority","operator":"gte","value":5},{"field":"metadata.urgent","operator":"eq","value":true}]}]}},"title":"TaxonomyEnrichmentConfig","type":"object"}},{"stage_id":"web_scrape","description":"Fetch full web page content from URLs","category":"apply","icon":"file-text","parameter_schema":{"$defs":{"DynamicValue":{"description":"A value that should be dynamically resolved from the query request.","properties":{"type":{"const":"dynamic","default":"dynamic","title":"Type","type":"string"},"field":{"description":"The dot-notation path to the value in the runtime query request, e.g., 'inputs.user_id'","examples":["inputs.query_text","filters.AND[0].value"],"title":"Field","type":"string"}},"required":["field"],"title":"DynamicValue","type":"object"},"ErrorHandling":{"description":"Error handling strategy for web scrape failures.\n\nSKIP: Skip documents that fail to fetch, continue with others.\n    - Failed documents are passed through unchanged\n    - Best for optional enrichment where failures are acceptable\n    - Example: Adding extra context that isn't critical\n\nREMOVE: Remove documents that fail to fetch from results.\n    - Failed documents are filtered out completely\n    - Best when enriched content is required\n    - Example: Must have full content to proceed\n\nRAISE: Raise exception on first failure, halt pipeline.\n    - Pipeline stops immediately on any failure\n    - Best for critical enrichment where failures indicate problems\n    - Example: Required content for compliance/audit","enum":["skip","remove","raise"],"title":"ErrorHandling","type":"string"},"FilterCondition":{"description":"Represents a single filter condition.\n\nAttributes:\n    field: The field to filter on\n    operator: The comparison operator\n    value: The value to compare against","properties":{"field":{"description":"Field name to filter on","title":"Field","type":"string"},"operator":{"$ref":"#/$defs/FilterOperator","default":"eq","description":"Comparison operator"},"value":{"anyOf":[{"$ref":"#/$defs/DynamicValue"},{}],"description":"Value to compare against","title":"Value"}},"required":["field","value"],"title":"FilterCondition","type":"object"},"FilterOperator":{"description":"Supported filter operators across database implementations.","enum":["eq","ne","gt","lt","gte","lte","in","nin","contains","starts_with","ends_with","regex","exists","is_null","text","phrase","geo_radius","geo_bounding_box","geo_polygon"],"title":"FilterOperator","type":"string"},"LogicalOperator":{"additionalProperties":true,"description":"Represents a logical operation (AND, OR, NOT) on filter conditions.\n\nAllows nesting with a defined depth limit.\n\nAlso supports shorthand syntax where field names can be passed directly\nas key-value pairs for equality filtering (e.g., {\"metadata.title\": \"value\"}).","properties":{"AND":{"anyOf":[{"items":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"$ref":"#/$defs/FilterCondition"}]},"type":"array"},{"type":"null"}],"default":null,"description":"Logical AND operation - all conditions must be true","example":[{"field":"name","operator":"eq","value":"John"},{"field":"age","operator":"gte","value":30}],"title":"And"},"OR":{"anyOf":[{"items":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"$ref":"#/$defs/FilterCondition"}]},"type":"array"},{"type":"null"}],"default":null,"description":"Logical OR operation - at least one condition must be true","example":[{"field":"status","operator":"eq","value":"active"},{"field":"role","operator":"eq","value":"admin"}],"title":"Or"},"NOT":{"anyOf":[{"items":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"$ref":"#/$defs/FilterCondition"}]},"type":"array"},{"type":"null"}],"default":null,"description":"Logical NOT operation - all conditions must be false","example":[{"field":"department","operator":"eq","value":"HR"},{"field":"location","operator":"eq","value":"remote"}],"title":"Not"},"case_sensitive":{"anyOf":[{"type":"boolean"},{"type":"null"}],"default":false,"description":"Whether to perform case-sensitive matching","example":true,"title":"Case Sensitive"}},"title":"LogicalOperator","type":"object"},"URLPatterns":{"description":"URL pattern filters for crawling.\n\nUsed to include or exclude specific URL patterns when crawling.\nPatterns are Python regular expressions.","properties":{"include":{"description":"Regex patterns for URLs to include. If provided, URLs must match at least one pattern to be crawled. If empty, all URLs (subject to other filters) are included. Example: ['/blog/.*', '/docs/.*']","examples":[["/blog/.*","/docs/.*"],["/products/\\d+"]],"items":{"type":"string"},"title":"Include","type":"array"},"exclude":{"description":"Regex patterns for URLs to exclude. URLs matching any pattern are skipped. Applied after include patterns. Example: ['/login', '/admin/.*', '\\\\?.*']","examples":[["/login","/admin/.*"],["/user/.*","\\\\?.*"]],"items":{"type":"string"},"title":"Exclude","type":"array"}},"title":"URLPatterns","type":"object"}},"description":"Configuration for web scrape (content fetching) stage.\n\n**Stage Category**: ENRICH (1-1 Enrichment)\n\n**Transformation**: N documents → N documents (same count, expanded schema)\n\n**Purpose**: Fetches full web page content from URLs found in document fields.\nUses the Engine's Playwright service to extract clean text, metadata, and HTML\nfrom web pages. Enriches documents with complete content beyond the snippets\nprovided by search APIs like Exa.\n\n**When to Use**:\n    - After web_search stage to fetch full content from search results\n    - When documents contain URLs that need content extraction\n    - For article/blog content aggregation\n    - To get full page text beyond API snippets\n    - For content analysis requiring complete text\n    - Web scraping as part of retrieval pipeline\n\n**When NOT to Use**:\n    - When URL snippets/titles are sufficient (web_search provides these)\n    - For very large numbers of URLs (slow, browser-intensive)\n    - When content is behind authentication (Playwright can't authenticate)\n    - For dynamic content requiring complex interactions\n    - When speed is critical (browser rendering is slow: 2-5s per page)\n\n**Operational Behavior**:\n    - Enriches each input document (1-1 operation)\n    - Maintains document count: N in → N out (or fewer with REMOVE error handling)\n    - Expands schema: adds web content fields to each document\n    - Makes HTTP requests to Engine Playwright service\n    - Slow operation: 100ms-5s per document depending on strategy\n    - Supports concurrent requests with configurable batch size\n    - Browser-intensive for JavaScript strategy\n\n**Common Pipeline Position**:\n    - web_search → web_scrape (get snippets then full content)\n    - semantic_search → web_scrape (if docs have URLs)\n    - filter → web_scrape → llm_enrich (fetch then analyze)\n\n**Cost & Performance**:\n    - Expensive: Browser rendering is resource-intensive\n    - Slow: 2-5s per page for JavaScript, 100-500ms for static\n    - Use STATIC strategy when possible for speed\n    - Use conditional enrichment (when parameter) to reduce load\n    - Consider batch_size for balancing speed vs resource usage\n\n**Conditional Enrichment**: Supports `when` parameter to only fetch content\nfor specific documents. CRITICAL FOR PERFORMANCE - web scraping is slow!\n    - Only fetch for high-scoring results\n    - Only fetch specific content types\n    - Only fetch when URL field is present\n\nRequirements:\n    - url_field: REQUIRED, document field containing URL to fetch\n    - output_field: REQUIRED, where to store fetched content\n    - strategy: OPTIONAL, scraping strategy (static/javascript/auto, default: auto)\n    - timeout: OPTIONAL, request timeout in seconds (default: 10)\n    - include_html: OPTIONAL, include raw HTML in response (default: False)\n    - min_content_length: OPTIONAL, minimum content length for auto fallback (default: 500)\n    - when: OPTIONAL but RECOMMENDED for performance\n    - on_error: OPTIONAL, error handling strategy (skip/remove/raise, default: skip)\n    - batch_size: OPTIONAL, concurrent requests (default: 5)\n\nUse Cases:\n    - Content aggregation: Fetch full articles after web search\n    - Research pipelines: Get complete documents for analysis\n    - Content monitoring: Scrape pages for change detection\n    - Data enrichment: Add full page text to search results\n    - Competitive analysis: Fetch competitor content\n    - News aggregation: Get full articles from headlines\n\nExamples:\n    Basic web scrape after search:\n        ```json\n        {\n            \"url_field\": \"metadata.url\",\n            \"output_field\": \"metadata.full_content\",\n            \"strategy\": \"auto\"\n        }\n        ```\n\n    Fast static scraping:\n        ```json\n        {\n            \"url_field\": \"metadata.url\",\n            \"output_field\": \"metadata.content\",\n            \"strategy\": \"static\",\n            \"timeout\": 5,\n            \"on_error\": \"skip\"\n        }\n        ```\n\n    JavaScript SPA scraping:\n        ```json\n        {\n            \"url_field\": \"metadata.url\",\n            \"output_field\": \"metadata.rendered_content\",\n            \"strategy\": \"javascript\",\n            \"timeout\": 30,\n            \"include_html\": true\n        }\n        ```\n\n    Conditional scrape (only high scores):\n        ```json\n        {\n            \"url_field\": \"metadata.url\",\n            \"output_field\": \"metadata.full_text\",\n            \"strategy\": \"auto\",\n            \"when\": {\n                \"field\": \"score\",\n                \"operator\": \"gte\",\n                \"value\": 0.8\n            },\n            \"batch_size\": 3\n        }\n        ```","examples":[{"description":"Basic web scrape after search","on_error":"skip","output_field":"metadata.full_content","strategy":"auto","timeout":10,"url_field":"metadata.url"},{"batch_size":10,"description":"Fast static scraping for articles","include_html":false,"on_error":"skip","output_field":"metadata.article_text","strategy":"static","timeout":5,"url_field":"metadata.url"},{"batch_size":2,"description":"JavaScript rendering for SPAs","include_html":true,"on_error":"remove","output_field":"metadata.rendered_content","strategy":"javascript","timeout":30,"url_field":"metadata.app_url"},{"batch_size":5,"description":"Conditional scrape - only high-scoring results (PERFORMANCE OPTIMIZATION!)","on_error":"skip","output_field":"metadata.full_text","strategy":"auto","timeout":10,"url_field":"metadata.url","when":{"field":"score","operator":"gte","value":0.8}},{"batch_size":5,"description":"Conditional with AND - only articles above threshold","output_field":"metadata.content","strategy":"static","timeout":10,"url_field":"metadata.url","when":{"AND":[{"field":"score","operator":"gte","value":0.7},{"field":"metadata.type","operator":"in","value":["article","blog"]}]}},{"description":"Direct URL mode - scrape URL from inputs (no existing documents needed)","output_field":"web_content","strategy":"auto","timeout":15,"url":"{{inputs.url}}"},{"batch_size":5,"description":"Crawl mode - crawl documentation site","max_depth":2,"max_pages":30,"output_field":"page_content","same_domain_only":true,"strategy":"auto","url":"{{inputs.docs_url}}","url_patterns":{"include":["/docs/.*","/guide/.*"]}},{"delay_between_requests":1.0,"description":"Crawl mode - shallow crawl with URL filtering","max_depth":1,"max_pages":20,"output_field":"page_content","strategy":"static","url":"{{inputs.blog_url}}","url_patterns":{"exclude":["/tag/.*","/author/.*","/login"],"include":["/posts/.*","/articles/.*"]}},{"description":"Crawl mode - seed page only (depth 0)","max_depth":0,"max_pages":1,"output_field":"content","strategy":"auto","url":"https://example.com"}],"properties":{"url":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Direct URL to fetch (supports templates like '{{inputs.url}}'). Use this when you want to scrape a URL from inputs without needing existing documents. Creates a new document with the scraped content. Either 'url' or 'url_field' must be provided, but not both.","examples":["{{inputs.url}}","https://example.com"],"title":"Url"},"url_field":{"anyOf":[{"type":"string"},{"type":"null"}],"default":"metadata.url","description":"Dot-path to document field containing URL to fetch. Use this when enriching existing documents that have URL fields. Supports nested paths like 'metadata.url' or 'data.source.link'. The field must contain a valid HTTP/HTTPS URL. If field is missing or URL is invalid, behavior depends on on_error setting. Either 'url' or 'url_field' must be provided, but not both.","examples":["metadata.url","url","metadata.source_url","data.link"],"title":"Url Field"},"output_field":{"default":"web_content","description":"Dot-path where fetched content should be stored. Creates nested structure if path doesn't exist. Stores object with: text, title, url, final_url, content_length, metadata, strategy_used. Example: 'web_content' creates doc.web_content.text, doc.web_content.title, etc.","examples":["enrichment.full_content","metadata.web_page","enrichment.scraped_data"],"title":"Output Field","type":"string"},"strategy":{"default":"auto","description":"OPTIONAL. Scraping strategy to use. 'static': Fast HTML parsing, no JavaScript (best for articles/blogs). 'javascript': Full browser rendering (slow, required for SPAs). 'auto': Try static first, fallback to JavaScript if needed. Default is 'auto' for best balance of speed and completeness.","examples":["auto","static","javascript"],"title":"Strategy","type":"string"},"timeout":{"default":10,"description":"OPTIONAL. Request timeout in seconds. Range: 1-60 seconds. Default is 10. JavaScript rendering may need higher timeouts (20-30s). Shorter timeouts fail faster but may miss slow-loading pages.","maximum":60,"minimum":1,"title":"Timeout","type":"integer"},"include_html":{"default":false,"description":"OPTIONAL. Include raw HTML in output. Default is False. Set to True if you need HTML for custom parsing. Warning: Significantly increases response size and memory usage.","title":"Include Html","type":"boolean"},"min_content_length":{"default":500,"description":"OPTIONAL. Minimum content length (characters) for auto strategy. If static parsing yields less than this, fallback to JavaScript. Default is 500 characters. Higher values = more likely to use JavaScript. Range: 100+.","minimum":100,"title":"Min Content Length","type":"integer"},"when":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"type":"null"}],"default":null,"description":"OPTIONAL. Conditional filter that documents must satisfy to be enriched with web content. Uses LogicalOperator (AND/OR/NOT) for complex boolean logic, or simple field/operator/value for single conditions. Documents NOT matching this condition will SKIP enrichment (pass-through unchanged). CRITICAL FOR PERFORMANCE - web scraping is slow! Only fetch content for documents that need it. When NOT specified, ALL documents are enriched unconditionally (may be very slow). \n\nUse cases:\n- Only fetch for high-scoring results (score >= 0.8)\n- Only fetch specific content types (category = 'article')\n- Only fetch when URL field is present\n\n\nSimple condition example: {\"field\": \"score\", \"operator\": \"gte\", \"value\": 0.8}\nBoolean AND example: {\"AND\": [{\"field\": \"metadata.category\", \"operator\": \"eq\", \"value\": \"article\"}, ...]}\n","examples":[{"field":"score","operator":"gte","value":0.8},{"field":"metadata.url","operator":"exists","value":true},{"AND":[{"field":"score","operator":"gte","value":0.7},{"field":"metadata.category","operator":"in","value":["article","blog"]}]}]},"on_error":{"$ref":"#/$defs/ErrorHandling","default":"skip","description":"OPTIONAL. How to handle fetch errors. 'skip': Skip failed documents, pass them through unchanged (default). 'remove': Remove failed documents from results entirely. 'raise': Raise exception on first failure, halt pipeline. Default is 'skip' for fault tolerance."},"batch_size":{"default":5,"description":"OPTIONAL. Number of concurrent fetch requests. Range: 1-20. Default is 5. Higher values = faster but more resource intensive. Lower values = slower but safer for rate limits. Consider Engine resource capacity when setting.","maximum":20,"minimum":1,"title":"Batch Size","type":"integer"},"max_depth":{"anyOf":[{"maximum":5,"minimum":0,"type":"integer"},{"type":"null"}],"default":null,"description":"OPTIONAL. Maximum link depth to follow from seed URL. None = single URL mode (default behavior). 0 = only seed page. 1 = seed + direct links from seed. 2 = seed + links + links from those pages. Higher values exponentially increase crawl time. Recommended: 1-2 for most use cases.","examples":[null,0,1,2],"title":"Max Depth"},"max_pages":{"anyOf":[{"maximum":100,"minimum":1,"type":"integer"},{"type":"null"}],"default":null,"description":"OPTIONAL. Maximum total pages to crawl. None = single URL mode (default behavior). Crawling stops when this limit is reached regardless of depth. Acts as a safety limit to prevent runaway crawls. Range: 1-100.","examples":[null,10,20,50],"title":"Max Pages"},"same_domain_only":{"default":true,"description":"OPTIONAL. Only follow links to the same domain as seed URL. Default is True (strongly recommended). Set to False only for intentional cross-domain crawling. Prevents crawl explosion to external sites.","title":"Same Domain Only","type":"boolean"},"url_patterns":{"anyOf":[{"$ref":"#/$defs/URLPatterns"},{"type":"null"}],"default":null,"description":"OPTIONAL. Include/exclude regex patterns for filtering discovered URLs. Include patterns: URLs must match at least one (if provided). Exclude patterns: URLs matching any are skipped. Applied after same_domain_only filter. Example: {'include': ['/docs/.*'], 'exclude': ['/login', '/admin/.*']}","examples":[{"include":["/docs/.*","/guide/.*"]},{"exclude":["/login","/admin/.*"]},{"exclude":["/tag/.*","/author/.*"],"include":["/blog/.*"]}]},"delay_between_requests":{"default":0.5,"description":"OPTIONAL. Delay in seconds between starting new requests (politeness). Helps avoid overwhelming target servers. 0 = no delay (use with caution). Default is 0.5 seconds. Recommended: 0.5-1.0 for most sites.","maximum":5.0,"minimum":0,"title":"Delay Between Requests","type":"number"}},"title":"WebScrapeConfig","type":"object"}},{"stage_id":"agent_search","description":"","category":"filter","icon":"brain-circuit","parameter_schema":{"$defs":{"ContextSourceConfig":{"description":"Where to load structured context (tree index, knowledge graph) from.\n\nAllows the agent to access hierarchical or structured data\nstored in document payloads or collection-level fields.\n\nAttributes:\n    type: Source type - 'document_payload' reads from a document's payload field,\n          'collection_field' reads from a collection-level metadata field.\n    field: Dot-notation path to the field containing context data.\n    document_id: Optional specific document to load context from.\n                 If not set, context is loaded from the first document in working set.","properties":{"type":{"default":"document_payload","description":"Source type for context data.\n- 'document_payload': Read from a specific document's payload field\n- 'collection_field': Read from collection-level metadata","examples":["document_payload","collection_field"],"title":"Type","type":"string"},"field":{"default":"_internal.metadata.tree_index","description":"Dot-notation path to the field containing structured context.\nExamples: '_internal.metadata.tree_index', 'hierarchy', 'graph_data'","examples":["_internal.metadata.tree_index","hierarchy"],"title":"Field","type":"string"},"document_id":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Specific document_id to load context from. If not set, uses the first document in the current working set.","title":"Document Id"}},"title":"ContextSourceConfig","type":"object"},"LLMProvider":{"description":"Supported LLM providers for content generation.\n\nEach provider has different strengths, pricing, and multimodal capabilities.\nChoose based on your use case, performance requirements, and budget.\n\nValues:\n    OPENAI: OpenAI GPT models (GPT-4o, GPT-4.1, O3-mini)\n        - Best for: General purpose, vision tasks, structured outputs\n        - Multimodal: Text, images\n        - Performance: Fast (100-500ms), reliable\n        - Cost: Moderate to high ($0.15-$10 per 1M tokens)\n        - Use when: Need high-quality generation with vision support\n\n    GOOGLE: Google Gemini models (Gemini 3.1 Flash Lite, Gemini 2.5 Pro)\n        - Best for: Fast generation, video understanding, cost-efficiency\n        - Multimodal: Text, images, video, audio, PDFs\n        - Performance: Very fast (50-200ms)\n        - Cost: Low to moderate ($0.075-$0.40 per 1M tokens)\n        - Use when: Need video/audio/PDF support or cost-efficiency\n\n    ANTHROPIC: Anthropic Claude models (Claude 3.5 Sonnet, Claude 3.5 Haiku)\n        - Best for: Long context, complex reasoning, safety\n        - Multimodal: Text, images\n        - Performance: Moderate (200-800ms)\n        - Cost: Moderate to high ($0.25-$15 per 1M tokens)\n        - Use when: Need long context or complex reasoning\n\nExamples:\n    - Use OPENAI for production with structured JSON outputs\n    - Use GOOGLE for video summarization and cost-sensitive workloads\n    - Use ANTHROPIC for complex reasoning with long documents","enum":["openai","google","anthropic"],"title":"LLMProvider","type":"string"}},"additionalProperties":true,"description":"Execution parameters resolved at runtime.\n\nExtends AgentSearchConfig with any additional runtime parameters\nthat may be injected during execution.","examples":[{"description":"Iterative refinement (default strategy)","max_iterations":5,"strategy":"iterative_refinement","timeout_seconds":60},{"context_source":{"field":"_internal.metadata.tree_index","type":"document_payload"},"description":"Tree navigation with custom context source","max_iterations":7,"strategy":"tree_navigation"},{"api_key":"{{secrets.openai_api_key}}","description":"Multi-hop with BYOK and custom model","max_iterations":5,"model_name":"gpt-4o","provider":"openai","strategy":"multi_hop","timeout_seconds":90},{"description":"Custom strategy with specific stages","max_iterations":3,"stages":["feature_search","attribute_filter","rerank"],"strategy":"custom","system_prompt":"Search and refine results about {{INPUT.query}}."}],"properties":{"provider":{"anyOf":[{"$ref":"#/$defs/LLMProvider"},{"type":"null"}],"default":null,"description":"LLM provider for agent reasoning. Supported providers:\n- openai: GPT models (GPT-4o, GPT-4o-mini)\n- google: Gemini models (Gemini 3.1 Flash Lite) — cheapest, recommended\n- anthropic: Claude models (Claude 3.5 Haiku/Sonnet)\n\nIf not specified, defaults to 'google'.","examples":["google","openai","anthropic"]},"model_name":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Specific LLM model for agent reasoning. If not specified, uses provider default.\nModels with tool calling support required.","examples":["gemini-2.5-flash-lite","gpt-4o-mini","claude-haiku-4-5-20251001"],"title":"Model Name"},"api_key":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"OPTIONAL. Bring Your Own Key (BYOK) - use your own LLM API key.\n\n**How to use:**\n1. Store your API key as an organization secret via POST /v1/organizations/secrets\n   Example: {\"secret_name\": \"openai_api_key\", \"secret_value\": \"sk-proj-...\"}\n\n2. Reference it here using template syntax: {{secrets.openai_api_key}}\n\nIf not provided, uses Mixpeek's default API keys.","examples":["{{secrets.openai_api_key}}","{{secrets.anthropic_key}}"],"title":"Api Key"},"strategy":{"default":"iterative_refinement","description":"Reasoning strategy determining default tools and system prompt.\n\n- 'iterative_refinement': Start broad, analyze, narrow iteratively (default)\n- 'tree_navigation': Navigate hierarchical document indexes top-down\n- 'multi_hop': Follow references across documents for complex queries\n- 'full_catalog': Expose every registered retriever stage (search, filter, sort, rerank, enrich, reduce) as tools and let the model compose freely\n- 'custom': User-provided stages and system_prompt (full control)","examples":["iterative_refinement","tree_navigation","multi_hop","full_catalog","custom"],"title":"Strategy","type":"string"},"stages":{"anyOf":[{"items":{"type":"string"},"type":"array"},{"type":"null"}],"default":null,"description":"Override which retriever stages are available as agent tools.\nEach stage name maps to a registered retriever stage that the LLM can invoke.\nIf not specified, uses the strategy's default stage set.\n\nAvailable stages: feature_search, attribute_filter, llm_filter, rerank, llm_enrich, taxonomy_enrich","examples":[["feature_search","attribute_filter"],["feature_search","attribute_filter","llm_filter"]],"title":"Stages"},"system_prompt":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Override the strategy's default system prompt for the LLM agent.\nSupports template variables: {{INPUT.field}} for pipeline inputs.\nIf not specified, uses the strategy's built-in system prompt.","examples":["You are navigating a document tree to find relevant sections about {{INPUT.query}}.","Search iteratively to find documents matching the user's complex query."],"title":"System Prompt"},"max_iterations":{"default":5,"description":"Maximum number of LLM reasoning turns. Each turn may invoke one or more tools.\nHigher values allow deeper reasoning but increase cost and latency.\nThe agent may stop earlier if it determines results are sufficient.","examples":[3,5,10],"maximum":20,"minimum":1,"title":"Max Iterations","type":"integer"},"timeout_seconds":{"default":60.0,"description":"Maximum wall-clock time in seconds for the entire agent reasoning loop.\nIncludes all LLM calls and sub-stage executions.\nThe agent returns partial results if timeout is hit.","examples":[30.0,60.0,120.0],"maximum":300.0,"minimum":5.0,"title":"Timeout Seconds","type":"number"},"context_source":{"anyOf":[{"$ref":"#/$defs/ContextSourceConfig"},{"type":"null"}],"default":null,"description":"Configuration for loading structured context (tree index, knowledge graph).\nThe context is included in the initial prompt to guide agent reasoning.\nUseful for tree_navigation strategy where the agent needs the tree structure."},"feedback":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"User feedback from a prior execution of this retriever to inject into the agent's system prompt. Use this to correct or refine the agent's behavior based on a previous result that was unsatisfactory.\n\nExample: 'Previous results included marketing documents — focus only on technical specs.'\n\nThe feedback is prepended to the system prompt so the agent adjusts its strategy before the first tool call.","examples":["Previous results were too broad — narrow to documents from the last 6 months.","The agent searched the wrong collection last time — focus on collection 'products'."],"title":"Feedback"},"min_confidence":{"default":0.0,"description":"Minimum confidence score (0.0–1.0) required to accept the agent's results. When the agent calls finish_search with a confidence below this threshold, the loop continues rather than stopping. Set to 0.0 (default) to disable — the agent's own done=True signal is sufficient to stop.\n\nUse 0.7–0.9 for high-stakes queries where low-confidence results should trigger more searching. The hard ceiling is still max_iterations.","examples":[0.0,0.7,0.85],"maximum":1.0,"minimum":0.0,"title":"Min Confidence","type":"number"},"auto_strategy":{"default":false,"description":"When True, perform a lightweight LLM routing call before the main loop to automatically select the best strategy (iterative_refinement, tree_navigation, multi_hop) based on the query. Overrides the 'strategy' field.\n\nAdds ~0.5–1s latency but improves strategy fit for complex or ambiguous queries. Recommended when queries vary significantly in structure.","examples":[false,true],"title":"Auto Strategy","type":"boolean"}},"title":"AgentSearchParameters","type":"object"}},{"stage_id":"attribute_filter","description":"Filter documents by attribute values (fetches from Qdrant if first stage)","category":"filter","icon":"list-filter","parameter_schema":{"$defs":{"DynamicValue":{"description":"A value that should be dynamically resolved from the query request.","properties":{"type":{"const":"dynamic","default":"dynamic","title":"Type","type":"string"},"field":{"description":"The dot-notation path to the value in the runtime query request, e.g., 'inputs.user_id'","examples":["inputs.query_text","filters.AND[0].value"],"title":"Field","type":"string"}},"required":["field"],"title":"DynamicValue","type":"object"},"FilterCondition":{"description":"Represents a single filter condition.\n\nAttributes:\n    field: The field to filter on\n    operator: The comparison operator\n    value: The value to compare against","properties":{"field":{"description":"Field name to filter on","title":"Field","type":"string"},"operator":{"$ref":"#/$defs/FilterOperator","default":"eq","description":"Comparison operator"},"value":{"anyOf":[{"$ref":"#/$defs/DynamicValue"},{}],"description":"Value to compare against","title":"Value"}},"required":["field","value"],"title":"FilterCondition","type":"object"},"FilterOperator":{"description":"Supported filter operators across database implementations.","enum":["eq","ne","gt","lt","gte","lte","in","nin","contains","starts_with","ends_with","regex","exists","is_null","text","phrase","geo_radius","geo_bounding_box","geo_polygon"],"title":"FilterOperator","type":"string"},"LogicalOperator":{"additionalProperties":true,"description":"Represents a logical operation (AND, OR, NOT) on filter conditions.\n\nAllows nesting with a defined depth limit.\n\nAlso supports shorthand syntax where field names can be passed directly\nas key-value pairs for equality filtering (e.g., {\"metadata.title\": \"value\"}).","properties":{"AND":{"anyOf":[{"items":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"$ref":"#/$defs/FilterCondition"}]},"type":"array"},{"type":"null"}],"default":null,"description":"Logical AND operation - all conditions must be true","example":[{"field":"name","operator":"eq","value":"John"},{"field":"age","operator":"gte","value":30}],"title":"And"},"OR":{"anyOf":[{"items":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"$ref":"#/$defs/FilterCondition"}]},"type":"array"},{"type":"null"}],"default":null,"description":"Logical OR operation - at least one condition must be true","example":[{"field":"status","operator":"eq","value":"active"},{"field":"role","operator":"eq","value":"admin"}],"title":"Or"},"NOT":{"anyOf":[{"items":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"$ref":"#/$defs/FilterCondition"}]},"type":"array"},{"type":"null"}],"default":null,"description":"Logical NOT operation - all conditions must be false","example":[{"field":"department","operator":"eq","value":"HR"},{"field":"location","operator":"eq","value":"remote"}],"title":"Not"},"case_sensitive":{"anyOf":[{"type":"boolean"},{"type":"null"}],"default":false,"description":"Whether to perform case-sensitive matching","example":true,"title":"Case Sensitive"}},"title":"LogicalOperator","type":"object"}},"description":"Configuration for filtering documents by attribute conditions.\n\n**Stage Category**: FILTER\n\n**Transformation**: N documents → ≤N documents (subset, same schema)\n\n**Purpose**: Produces a subset of input documents by removing those that don't match\nattribute conditions. Output documents have identical schema to input.\n\n**When to Use**:\n    - **As First Stage**: To retrieve and filter documents by attributes without semantic search\n      (e.g., \"get all active products with priority >= 5\"). Fetches up to 1000 documents\n      per collection from MVS, then applies filter conditions.\n    - **As Subsequent Stage**: To narrow results from previous stages by specific attributes\n      (status, date range, category, tags). Operates purely on in-memory results.\n    - When you need to apply business logic filtering (active items, published content)\n    - Before expensive stages (SORT, APPLY) to reduce processing overhead\n    - For structured/fast filtering based on document properties\n\n**When NOT to Use**:\n    - For reordering results (use SORT stages: sort_relevance, sort_attribute)\n    - For complex semantic filtering (use llm_filter instead)\n    - For enriching documents with additional data (use APPLY stages)\n    - For aggregating to single document (use REDUCE stages)\n\n**Operational Behavior**:\n    - **As First Stage**: Fetches documents directly from MVS (up to 1000 per collection)\n      using scroll API. Supports pre_filters which leverage MVS's native filtering\n      (including full-text search with TEXT operator). Results are then filtered in-memory\n      using the stage's filter conditions. This allows attribute_filter to be used as an\n      initial retrieval stage without requiring a prior search/embedding stage.\n    - **As Subsequent Stage**: Operates purely on in-memory results from previous stages\n      (no database queries). This is the typical use case for post-filtering.\n    - Produces subset of documents (removes non-matching)\n    - Fast operation (simple condition evaluation)\n    - Processes documents in batches for memory efficiency\n    - Supports complex boolean logic (AND/OR/NOT)\n    - Output schema = Input schema (no schema changes)\n\n**Common Pipeline Position**: FILTER (this stage) → SORT → APPLY\n\n**Important Limitations**:\n    - When used as first stage, maximum 1000 documents per collection are fetched from MVS\n    - For large collections, consider using semantic_search or other retrieval stages first\n    - Vectors are not fetched from MVS (only payloads) to optimize performance\n\n**Two modes of operation:**\n\n1. **Simple mode** (single condition): Specify `field`, `operator`, and `value`\n2. **Boolean mode** (multiple conditions): Specify `conditions` with AND/OR/NOT logic\n\nBoth modes support template variables that are evaluated for every document.\n\nExamples:\n    Simple single condition:\n        ```json\n        {\n            \"field\": \"metadata.status\",\n            \"operator\": \"eq\",\n            \"value\": \"active\"\n        }\n        ```\n\n    Boolean AND:\n        ```json\n        {\n            \"conditions\": {\n                \"AND\": [\n                    {\"field\": \"metadata.status\", \"operator\": \"eq\", \"value\": \"active\"},\n                    {\"field\": \"metadata.priority\", \"operator\": \"gte\", \"value\": 5}\n                ]\n            }\n        }\n        ```\n\n    Boolean OR:\n        ```json\n        {\n            \"conditions\": {\n                \"OR\": [\n                    {\"field\": \"metadata.urgent\", \"operator\": \"eq\", \"value\": true},\n                    {\"field\": \"metadata.priority\", \"operator\": \"gte\", \"value\": 8}\n                ]\n            }\n        }\n        ```\n\n    Nested boolean logic:\n        ```json\n        {\n            \"conditions\": {\n                \"AND\": [\n                    {\"field\": \"metadata.status\", \"operator\": \"eq\", \"value\": \"active\"},\n                    {\n                        \"OR\": [\n                            {\"field\": \"metadata.category\", \"operator\": \"eq\", \"value\": \"urgent\"},\n                            {\"field\": \"metadata.category\", \"operator\": \"eq\", \"value\": \"critical\"}\n                        ]\n                    }\n                ]\n            }\n        }\n        ```","examples":[{"batch_size":100,"description":"As first stage: fetch and filter all active products from MVS","field":"metadata.status","operator":"eq","value":"active"},{"batch_size":50,"case_insensitive":true,"description":"Simple single condition filter on existing results","field":"metadata.category","operator":"eq","value":"announcements"},{"description":"Simple numeric comparison","field":"metadata.release_year","operator":"gte","value":2023},{"conditions":{"AND":[{"field":"metadata.status","operator":"eq","value":"active"},{"field":"metadata.priority","operator":"gte","value":5}]},"description":"Boolean AND - multiple conditions must all be true"},{"conditions":{"OR":[{"field":"metadata.urgent","operator":"eq","value":true},{"field":"metadata.priority","operator":"gte","value":8}]},"description":"Boolean OR - at least one condition must be true"},{"batch_size":200,"conditions":{"AND":[{"field":"metadata.status","operator":"eq","value":"published"},{"OR":[{"field":"metadata.category","operator":"eq","value":"blog"},{"field":"metadata.category","operator":"eq","value":"news"}]}]},"description":"Nested boolean logic - complex conditions"}],"properties":{"field":{"anyOf":[{"type":"string"},{"type":"null"}],"default":"metadata.status","description":"Dot-delimited field path to evaluate on each document. Supports template variables (e.g. '{{DOC.metadata.category}}'). REQUIRED for simple mode. NOT USED when 'conditions' is specified.","examples":["metadata.category","metadata.release_year","metadata.tags","{{INPUT.dynamic_field}}"],"title":"Field"},"operator":{"anyOf":[{"$ref":"#/$defs/FilterOperator"},{"type":"null"}],"default":"eq","description":"Comparison operator to apply. Supported operators: eq, ne, gt, gte, lt, lte, in, nin, contains, starts_with, ends_with, regex, exists, is_null. REQUIRED for simple mode. NOT USED when 'conditions' is specified.","examples":["eq","gte","in","contains"]},"value":{"anyOf":[{},{"type":"null"}],"default":"{{INPUT.filter_value}}","description":"Comparison value. Can be a literal or template expression. For `in`/`nin` operators supply a list. For `exists`/`is_null` use a boolean. REQUIRED for simple mode. NOT USED when 'conditions' is specified.","examples":["enterprise",2024,["beta","ga"],"{{INPUT.minimum_version}}"],"title":"Value"},"natural_language":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Plain-English filter description. When set, an LLM converts it to a structured `conditions` tree at runtime via tool-use. Takes precedence over simple-mode fields (field/operator/value). Ignored if `conditions` is also set.","examples":["active premium customers with priority >= 5","published blog posts from 2024 that are not archived"],"title":"Natural Language"},"field_hints":{"anyOf":[{"items":{"type":"string"},"type":"array"},{"type":"null"}],"default":null,"description":"Optional list of available field paths to constrain the LLM when converting `natural_language` to conditions. If omitted, the model infers field names from the NL query.","examples":[["metadata.status","metadata.priority","metadata.category"]],"title":"Field Hints"},"conditions":{"anyOf":[{"$ref":"#/$defs/LogicalOperator"},{"type":"null"}],"default":null,"description":"Complex filter conditions using boolean logic (AND/OR/NOT). Use this for combining multiple filter conditions. REQUIRED for boolean mode. Cannot be used with 'field'/'operator'/'value'.","examples":[{"AND":[{"field":"metadata.status","operator":"eq","value":"active"},{"field":"metadata.priority","operator":"gte","value":5}]},{"OR":[{"field":"metadata.urgent","operator":"eq","value":true},{"field":"metadata.category","operator":"in","value":["critical","high"]}]}]},"batch_size":{"default":100,"description":"Number of documents to evaluate per batch. The executor streams documents through the filter in chunks to avoid large in-memory spikes.","examples":[25,50,200],"maximum":1000,"minimum":1,"title":"Batch Size","type":"integer"},"case_insensitive":{"default":false,"description":"When true, string comparisons are performed case-insensitively where the operator supports it. Applies to both simple and boolean modes.","title":"Case Insensitive","type":"boolean"}},"title":"AttributeFilterConfig","type":"object"}},{"stage_id":"feature_search","description":"Filter documents by vector similarity using feature embeddings","category":"filter","icon":"filter","parameter_schema":{"$defs":{"ContentInput":{"description":"Generic content input for automatic content-type detection.\n\nUsed for URL or base64 inputs where the content type is not known upfront.\nThe system will automatically detect the content type (image, video, text, etc.)\nand route to the appropriate feature extractor.\n\n**IMPORTANT**: Exactly one of `url` or `base64` must be provided (mutually exclusive).\n\nUse Cases:\n    - User provides a URL without specifying content type\n    - Client sends base64-encoded content\n    - Generic search where query can be any modality\n\nRequirements:\n    - Provide exactly ONE: url OR base64 (mutually exclusive)\n    - System performs automatic content type detection\n    - Supported content types: images, videos, audio, documents\n\nExamples:\n    URL input:\n        ```json\n        {\"url\": \"https://example.com/image.jpg\"}\n        ```\n\n    Base64 image input:\n        ```json\n        {\"base64\": \"data:image/jpeg;base64,/9j/4AAQSkZJRg...\"}\n        ```\n\n    Base64 video input:\n        ```json\n        {\"base64\": \"data:video/mp4;base64,AAAAIGZ0eXBpc2...\"}\n        ```","examples":[{"description":"Image URL for visual search","url":"https://example.com/product-image.jpg"},{"description":"Video URL for multimodal search","url":"https://cdn.example.com/ad-creative.mp4"},{"description":"S3 object in Mixpeek bucket","url":"s3://mixpeek-server-dev/int_xxx/ns_xxx/video.mp4"},{"base64":"data:image/jpeg;base64,/9j/4AAQSkZJRg...","description":"Base64-encoded image"},{"base64":"data:video/mp4;base64,AAAAIGZ0eXBpc2...","description":"Base64-encoded video"}],"properties":{"url":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"OPTIONAL. URL to content for embedding generation. Mutually exclusive with base64 - provide exactly one. System will automatically detect content type (image, video, text, etc.) via HTTP HEAD request and/or file extension analysis. Supported protocols: HTTP, HTTPS, S3 (for Mixpeek-managed buckets). S3 URLs must point to the configured AWS_BUCKET. NOTE: For multimodal embeddings, URL-based inputs may fail if the upstream embedding provider cannot fetch the URL (e.g., geo-restrictions, rate limiting, redirects). Consider using the base64 field instead for more reliable results.","examples":["https://example.com/video.mp4","https://cdn.example.com/image.jpg","https://storage.example.com/document.pdf","s3://mixpeek-server-dev/int_xxx/ns_xxx/video.mp4"],"title":"Url"},"base64":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"OPTIONAL. Base64-encoded content for embedding generation (RECOMMENDED for multimodal embeddings). Mutually exclusive with url - provide exactly one. Must include data URI scheme (data:mime/type;base64,...) for reliable MIME detection. System will automatically detect content type from data URI prefix. Supported types: images (image/jpeg, image/png), videos (video/mp4), audio, documents. Maximum size: Limited by your namespace configuration.","examples":["data:image/jpeg;base64,/9j/4AAQSkZJRg...","data:video/mp4;base64,AAAAIGZ0eXBpc2..."],"title":"Base64"}},"title":"ContentInput","type":"object"},"ContentQueryInput":{"description":"Content-based query input with automatic format detection.\n\nSystem auto-detects content format:\n- data:mime/type;base64,... → decoded as base64 data URI (RECOMMENDED)\n- http://, https://, s3:// → fetched as URL\n- Otherwise → treated as raw base64 string\n\n**IMPORTANT - Base64 Data URIs Are Recommended for Reliability:**\nWhen using multimodal embeddings (e.g., Vertex AI multimodal), base64 data URIs\n(``data:image/jpeg;base64,...``) are the most reliable input format. Direct HTTP\nURLs may fail if the upstream embedding provider cannot fetch them (e.g., due to\ngeo-restrictions, rate limiting, authentication requirements, or redirects).\nFor best results, fetch the image/video client-side and send a base64 data URI.\n\nUse Cases:\n    - Visual search with image/video content\n    - Reverse image search\n    - Multimodal search with content\n    - Template-based content queries\n\nExamples:\n    Base64 data URI (RECOMMENDED - most reliable):\n        ```json\n        {\"input_mode\": \"content\", \"value\": \"data:image/jpeg;base64,/9j/...\"}\n        ```\n\n    Image URL (may fail if provider cannot fetch the URL):\n        ```json\n        {\"input_mode\": \"content\", \"value\": \"https://example.com/image.jpg\"}\n        ```\n\n    Template-based:\n        ```json\n        {\"input_mode\": \"content\", \"value\": \"{{INPUT.image_url}}\"}\n        ```\n\n    Legacy syntax (backward compatible):\n        ```json\n        {\"input_mode\": \"content\", \"content\": {\"url\": \"https://...\"}}\n        ```","examples":[{"description":"Base64 data URI (RECOMMENDED - most reliable for multimodal embeddings)","input_mode":"content","value":"data:image/jpeg;base64,/9j/4AAQSkZJRg..."},{"description":"Image URL (may fail if embedding provider cannot fetch the URL)","input_mode":"content","value":"https://example.com/reference-image.jpg"},{"description":"Template-based content query","input_mode":"content","value":"{{INPUT.image_url}}"}],"properties":{"input_mode":{"const":"content","default":"content","description":"Discriminator field. Always 'content' for content-based queries.","title":"Input Mode","type":"string"},"value":{"anyOf":[{"type":"string"},{"items":{},"type":"array"},{"type":"null"}],"default":null,"description":"Content input with auto-detection. System auto-detects format: - list of floats → used as raw embedding vector (no inference) - data:mime/type;base64,... → decoded as base64 data URI (RECOMMENDED for reliability) - http://, https://, s3:// → fetched as URL - Otherwise → treated as raw base64 string. NOTE: Base64 data URIs are recommended over URLs for multimodal embeddings because some URLs may fail if the embedding provider cannot fetch them directly. Supports template variables: {{INPUT.field_name}}.","examples":["data:image/jpeg;base64,/9j/4AAQSkZJRg...","https://example.com/image.jpg","{{INPUT.image_url}}"],"title":"Value"},"content":{"anyOf":[{"$ref":"#/$defs/ContentInput"},{"type":"null"}],"default":null,"description":"Legacy content input (DEPRECATED - use 'value' instead). Provide URL or base64 separately via ContentInput model."}},"title":"ContentQueryInput","type":"object"},"DocumentQueryInput":{"description":"Document reference query input for similarity search.\n\nUse existing document's pre-computed features without re-processing.\nPerfect for \"find similar documents\" functionality. No inference is performed.\n\nUse Cases:\n    - Find similar documents\n    - Reverse image search using indexed images\n    - Document-to-document similarity\n    - Multi-hop similarity chains\n\nExamples:\n    Simple document reference:\n        ```json\n        {\n            \"input_mode\": \"document\",\n            \"document_ref\": {\n                \"collection_id\": \"col_products\",\n                \"document_id\": \"doc_item_123\"\n            }\n        }\n        ```","examples":[{"description":"Find similar products","document_ref":{"collection_id":"col_products","document_id":"doc_product_123"},"input_mode":"document"},{"description":"Reverse image search","document_ref":{"collection_id":"col_media","document_id":"doc_img_sunset"},"input_mode":"document"}],"properties":{"input_mode":{"const":"document","default":"document","description":"Discriminator field. Always 'document' for document reference queries.","title":"Input Mode","type":"string"},"document_ref":{"$ref":"#/$defs/DocumentReference","description":"Reference to existing document's pre-computed features. The system fetches the document's feature vectors for the specified feature_uri and uses them directly without re-processing. Document must exist and have features for the specified feature_uri."}},"required":["document_ref"],"title":"DocumentQueryInput","type":"object"},"DocumentReference":{"description":"Reference to an existing document to use its pre-computed features.\n\nUse this to perform similarity search using a document's existing embeddings\nwithout re-processing the document. The system will fetch the document's\nfeature vectors and use them directly for the search.\n\nUse Cases:\n    - \"Find documents similar to this one\"\n    - Reverse image search using indexed images\n    - Document-to-document similarity\n    - Multi-hop similarity chains\n\nExamples:\n    Find similar documents:\n        ```json\n        {\n            \"collection_id\": \"col_abc123\",\n            \"document_id\": \"doc_xyz789\"\n        }\n        ```","examples":[{"collection_id":"col_products","description":"Reference to a product document","document_id":"doc_product_123"},{"collection_id":"col_media","description":"Reference to an image in media collection","document_id":"doc_img_sunset"}],"properties":{"collection_id":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Collection ID containing the reference document. Can be the same or different from the target search collection. Must be accessible within the current namespace. None values are handled by on_empty behavior (skip/random/error).","examples":["col_abc123","col_products"],"title":"Collection Id"},"document_id":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Document ID to use as similarity reference. The document must exist and have feature vectors for the specified feature_uri. If the document doesn't have the required feature, the search will fail. None values are handled by on_empty behavior (skip/random/error).","examples":["doc_xyz789","doc_image_001"],"title":"Document Id"}},"title":"DocumentReference","type":"object"},"FacetFieldConfig":{"description":"Configuration for a single facet field.\n\nFaceting counts unique values for a field, enabling faceted search interfaces\n(e.g., \"Show 45 results in 'Sports', 23 in 'Music'\").\n\n┌─────────────────────────────────────────────────────────────────────────────┐\n│ IMPORTANT: INDEX REQUIREMENT                                                │\n│                                                                             │\n│ The field MUST have a keyword index in MVS for faceting to work.         │\n│ Fields are automatically indexed during collection creation for common      │\n│ metadata fields. For custom fields, ensure indexing is configured.          │\n│                                                                             │\n│ Without an index, faceting will fail with an error.                         │\n└─────────────────────────────────────────────────────────────────────────────┘\n\nExample:\n    ```json\n    {\n        \"key\": \"metadata.category\",\n        \"limit\": 10,\n        \"exact\": false\n    }\n    ```","examples":[{"description":"Category facet with default settings","key":"metadata.category"},{"description":"File type facet with more results","key":"metadata.file_type","limit":20},{"description":"Author facet with exact counts","exact":true,"key":"metadata.author","limit":50}],"properties":{"key":{"description":"REQUIRED. Field path to facet on (e.g., 'metadata.category', 'status'). Supports nested fields using dot notation. The field MUST have a keyword index in MVS - faceting will fail without it. Common indexed fields: metadata.*, status, collection_id, source_object_id.","examples":["metadata.category","metadata.file_type","status","metadata.author"],"title":"Key","type":"string"},"limit":{"default":10,"description":"OPTIONAL. Maximum number of unique values to return (default: 10). Results are sorted by count descending, then by value ascending. Higher limits increase response size but provide more comprehensive facets. Common values: 5 (compact), 10 (standard), 20-50 (detailed).","examples":[5,10,20,50],"maximum":1000,"minimum":1,"title":"Limit","type":"integer"},"exact":{"default":false,"description":"OPTIONAL. Use exact counts instead of approximate (default: False). - False (default): Fast approximate counts, suitable for most UIs.   Approximate counts are accurate enough for display purposes. - True: Slower but precise counts, useful for debugging or analytics.   Use when exact numbers matter (e.g., reports, dashboards).","examples":[false,true],"title":"Exact","type":"boolean"}},"required":["key"],"title":"FacetFieldConfig","type":"object"},"FeatureSearchConfig":{"description":"Configuration for a single feature search within the feature_filter stage.\n\nEach feature search specifies:\n- Which feature URI to search (embedding index)\n- What input to search with (text, URL, or base64)\n- Search parameters (top_k, score threshold)\n- Optional weight for fusion\n\nMultiple feature searches are combined using the stage's fusion strategy.\n\nExamples:\n    Text semantic search:\n        ```json\n        {\n            \"feature_uri\": \"mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1\",\n            \"query\": {\"input_mode\": \"text\", \"value\": \"Hello world!\"},\n            \"top_k\": 100\n        }\n        ```\n\n    Image visual search with URL (auto-detected):\n        ```json\n        {\n            \"feature_uri\": \"mixpeek://clip_extractor@v1/image_embedding\",\n            \"query\": {\"input_mode\": \"content\", \"value\": \"{{INPUT.image_url}}\"},\n            \"top_k\": 50\n        }\n        ```\n\n    Video search with query preprocessing:\n        ```json\n        {\n            \"feature_uri\": \"mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding\",\n            \"query\": {\"input_mode\": \"content\", \"value\": \"{{INPUT.video}}\"},\n            \"query_preprocessing\": {\n                \"params\": {\"split_method\": \"time\", \"time_split_interval\": 10},\n                \"max_chunks\": 20,\n                \"aggregation\": \"rrf\"\n            },\n            \"top_k\": 100\n        }\n        ```","examples":[{"description":"Text embedding search with score threshold","feature_uri":"mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1","min_score":0.7,"query":{"input_mode":"text","value":"{{INPUT.user_query}}"},"top_k":100,"weight":1.0},{"description":"Image embedding search from URL","feature_uri":"mixpeek://clip_extractor@v1/image_embedding","query":{"input_mode":"content","value":"{{INPUT.image_url}}"},"top_k":50,"weight":0.6},{"description":"Video embedding search from base64","feature_uri":"mixpeek://multimodal_extractor@v1/video_embedding","min_score":0.6,"query":{"input_mode":"content","value":"{{INPUT.video_base64}}"},"top_k":75},{"description":"Optional text search (skip if empty, for multi-modal)","feature_uri":"mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1","on_empty":"skip","query":{"input_mode":"text","value":"{{INPUT.text_query}}"},"top_k":100},{"description":"Optional image search (skip if empty, for multi-modal)","feature_uri":"mixpeek://clip_extractor@v1/image_embedding","on_empty":"skip","query":{"input_mode":"content","value":"{{INPUT.image_url}}"},"top_k":100},{"description":"Single search with random fallback (always returns results)","feature_uri":"mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1","on_empty":"random","query":{"input_mode":"text","value":"{{INPUT.optional_query}}"},"top_k":50},{"collection_identifiers":["col_products","col_reviews"],"description":"Per-search collection targeting (hybrid search across different collections)","feature_uri":"mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1","query":{"input_mode":"text","value":"{{INPUT.query}}"},"top_k":100,"weight":0.6}],"properties":{"feature_uri":{"description":"REQUIRED. Feature URI specifying which embedding index to search. Format: 'mixpeek://extractor@version/output' or 'namespace://collection/feature'. Must reference a valid dense vector index in the collection. The feature must exist and be indexed for all documents in the collection.","examples":["mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1","mixpeek://clip@v1/image_embedding","mixpeek://multimodal@v1/embedding","namespace://products/description_embedding"],"minLength":1,"title":"Feature Uri","type":"string"},"query":{"description":"REQUIRED. Input for this feature search. Can be plain text, URL with auto-detection, or base64 content. For text features: Use text mode. For image/video features: Use content mode with URL or base64. Supports template variables: {{INPUT.field_name}}, {{DOC.field_name}}, etc.","discriminator":{"mapping":{"content":"#/$defs/ContentQueryInput","document":"#/$defs/DocumentQueryInput","multi_content":"#/$defs/MultiContentQueryInput","text":"#/$defs/TextQueryInput","vector":"#/$defs/VectorQueryInput"},"propertyName":"input_mode"},"oneOf":[{"$ref":"#/$defs/TextQueryInput"},{"$ref":"#/$defs/ContentQueryInput"},{"$ref":"#/$defs/DocumentQueryInput"},{"$ref":"#/$defs/VectorQueryInput"},{"$ref":"#/$defs/MultiContentQueryInput"}],"title":"Query"},"top_k":{"default":100,"description":"OPTIONAL. Number of results to fetch for this specific feature search. Defaults to 100. This is the per-feature top_k, independent of the final_top_k parameter. Higher values: More comprehensive but slower. Lower values: Faster but may miss relevant results. MVS will fetch this many results before fusion.","examples":[50,100,200],"maximum":1000,"minimum":1,"title":"Top K","type":"integer"},"min_score":{"anyOf":[{"maximum":1.0,"minimum":0.0,"type":"number"},{"type":"null"}],"default":null,"description":"OPTIONAL. Minimum similarity score threshold for this feature search. NOT REQUIRED - if not specified, no score filtering is applied. Filters out results below this threshold BEFORE fusion. Typical values: 0.5-0.8 depending on model and use case. Lower threshold: More recall, less precision. Higher threshold: More precision, less recall.","examples":[0.5,0.7,0.8],"title":"Min Score"},"weight":{"default":1.0,"description":"OPTIONAL. Weight for this feature search when using 'weighted' fusion. Defaults to 1.0 (equal weight). Ignored for 'rrf' and 'max' fusion strategies. Sum of all feature weights should typically equal 1.0 for normalized scores. Higher weight: More influence on final ranking. Example: Text=0.7, Image=0.3 for text-heavy search.","examples":[0.3,0.5,0.7,1.0],"maximum":1.0,"minimum":0.0,"title":"Weight","type":"number"},"lexical":{"default":false,"description":"OPTIONAL. If true, this search is a LEXICAL (BM25) search: the query text is matched against the full-text index instead of being embedded into a vector. Combine with a dense (vector) search under 'rrf' fusion for dense+lexical hybrid retrieval — BM25 catches exact tokens (brand names, prices like $9.99, promo codes, CTAs) that dense embeddings miss. The query must be text (input_mode='text'); feature_uri is used only for collection scoping (not a vector index). NOTE: BM25 matches across ALL indexed string payload fields, not a single field.","title":"Lexical","type":"boolean"},"optional":{"default":false,"description":"OPTIONAL. If true, this search is skipped (excluded from fusion) when its target feature's vector index is known to be empty (0 vectors indexed), allowing the other searches to carry the results — and a warning is emitted explaining the skip. This is distinct from `on_empty`, which governs an empty QUERY INPUT; `optional` governs an empty TARGET INDEX. Defaults to false (today's behavior: a required search is never skipped, even if its index is empty — it simply returns 0 candidates). If ALL searches are skipped, the request still errors. Use for graceful degradation in hybrid/multi-feature search where one index may still be building.","title":"Optional","type":"boolean"},"on_empty":{"$ref":"#/$defs/OnEmptyBehavior","default":"error","description":"OPTIONAL. Behavior when input is empty after template resolution. Defaults to 'error' (fail if input missing). \n\n┌─────────┬────────────────────────────────────────────────────────────┐\n│ Value   │ Behavior                                                   │\n├─────────┼────────────────────────────────────────────────────────────┤\n│ error   │ Fail with error (input is required) - DEFAULT              │\n│ skip    │ Exclude from fusion (let other searches drive results)     │\n│ random  │ Use random vector (always return results)                  │\n└─────────┴────────────────────────────────────────────────────────────┘\n\n\n'error' (default): Strict mode - fail fast if input is missing. Use when input is required and missing input indicates a bug. \n\n'skip': Graceful degradation - exclude this search from fusion. Use for multi-modal search where user may provide text OR image OR both. If all searches skip (all inputs empty), returns error. \n\n'random': Always return results - use random vector as fallback. Use for single-feature optional search where you always want results.","examples":["error","skip","random"]},"query_preprocessing":{"anyOf":[{"$ref":"#/$defs/QueryPreprocessing"},{"type":"null"}],"default":null,"description":"OPTIONAL. Enable query preprocessing for large file inputs. When set, the query input (video, PDF, long text) is decomposed into chunks using the same extractor pipeline that indexed the data, N parallel searches are run, and results are fused. Only applicable for content-mode queries (URLs or base64)."},"collection_identifiers":{"anyOf":[{"items":{"type":"string"},"type":"array"},{"type":"null"}],"default":null,"description":"OPTIONAL. Collection identifiers to search for this specific feature search. Can be collection IDs or names. Enables per-search collection targeting for hybrid/multi-feature searches. \n\nFallback Priority (most specific wins):\n1. This field (per-search targeting) - most specific\n2. Stage-level collection_identifiers\n3. Retriever-level collection_identifiers\n\n\nUse Cases:\n- Hybrid search where different features exist in different collections\n- Text embeddings in col_products, image embeddings in col_media\n- Fine-grained collection targeting per feature URI\n\n\nNote: All collections (across all searches) must be declared in the retriever's collection_identifiers at creation time for validation.","examples":[["col_products"],["col_media","col_archive"]],"title":"Collection Identifiers"}},"required":["feature_uri","query"],"title":"FeatureSearchConfig","type":"object"},"FeatureSearchGroupBy":{"description":"Database-level grouping for feature search (uses MVS query_points_groups).\n\nEnables efficient grouping at the database level rather than in-memory post-processing.\nPerfect for decompose/recompose patterns where you search chunks but want to return\nparent documents.\n\nThis mirrors the output_mode behavior of the group_by REDUCE stage for API consistency,\nbut executes at the database level for better performance on large result sets.\n\nStage Category: FILTER (grouping is part of the MVS query)\n\nPerformance: Database-level grouping is significantly faster than fetching all results\nand grouping in memory. MVS handles the grouping natively.\n\nUse Cases:\n    - Decompose/recompose: Search 500 text chunks, return top 25 unique documents\n    - Deduplication: One best result per product_id\n    - Scene → Video grouping: Search video frames, return parent videos\n\nExamples:\n    Deduplication (one result per video):\n        ```json\n        {\"field\": \"video_id\", \"max_per_group\": 1, \"output_mode\": \"first\"}\n        ```\n\n    Top 3 chunks per document:\n        ```json\n        {\"field\": \"source_object_id\", \"max_per_group\": 3, \"output_mode\": \"all\"}\n        ```\n\n    Flatten grouped results:\n        ```json\n        {\"field\": \"category\", \"max_per_group\": 5, \"output_mode\": \"flatten\"}\n        ```","examples":[{"description":"Deduplication: one best result per video","field":"video_id","max_per_group":1,"output_mode":"first"},{"description":"Top 3 chunks per parent document","field":"source_object_id","limit":25,"max_per_group":3,"output_mode":"all"},{"description":"Flatten results grouped by category","field":"metadata.category","max_per_group":5,"output_mode":"flatten"}],"properties":{"field":{"description":"REQUIRED. Field path to group documents by using dot notation. Documents with the same field value are grouped together. Common fields: 'source_object_id' (parent object from decomposition), 'video_id' (media grouping), 'product_id' (e-commerce), 'metadata.category' (nested categorical field). The field must exist in the document payload. IMPORTANT: This field MUST have a keyword payload index configured on the namespace. Without an index, grouping will fail with 'Index required but not found'. Create indexes via PATCH /v1/namespaces/{id} with payload_indexes.","examples":["source_object_id","video_id","product_id","metadata.category","metadata.author_id"],"title":"Field","type":"string"},"max_per_group":{"default":1,"description":"OPTIONAL. Maximum number of documents to keep per group (MVS's group_size). Documents are sorted by score (highest first) before limiting. Default: 1 (deduplication - keeps only highest scoring doc per group). Use 1 for deduplication, 3-5 for preview results, 10+ for comprehensive results. Note: This is a best-effort parameter in MVS.","examples":[1,3,5,10],"maximum":100,"minimum":1,"title":"Max Per Group","type":"integer"},"limit":{"default":25,"description":"OPTIONAL. Maximum number of groups to return. This overrides final_top_k when grouping is enabled. Default: 25 groups.","examples":[10,25,50,100],"maximum":500,"minimum":1,"title":"Limit","type":"integer"},"output_mode":{"default":"all","description":"OPTIONAL. Controls what documents are returned per group. Mirrors the group_by REDUCE stage for API consistency. 'first': Return only the top document per group (deduplication, fastest).          Use for: unique results per group (e.g., one video per brand). 'all': Return all documents grouped by field (default, shows full context).        Use for: showing chunks within each parent object. 'flatten': Return all documents as flat list (loses group structure).            Use for: need all docs but don't care about grouping metadata. Default: 'all'.","enum":["first","all","flatten"],"examples":["first","all","flatten"],"title":"Output Mode","type":"string"}},"required":["field"],"title":"FeatureSearchGroupBy","type":"object"},"FusionStrategy":{"description":"Score fusion strategies for combining multiple feature searches.\n\n┌──────────┬──────────────┬─────────────────────────────────────────────────────┐\n│ Strategy │ MVS Native│ Description                                         │\n├──────────┼──────────────┼─────────────────────────────────────────────────────┤\n│ rrf      │ ✅ Yes       │ Rank-based fusion, robust default (recommended)     │\n│ dbsf     │ ✅ Yes       │ Score-based with statistical normalization          │\n│ weighted │ ❌ No        │ Manual score-weighted fusion with custom weights    │\n│ max      │ ❌ No        │ Take maximum score across features                  │\n│ learned  │ ❌ No        │ Bandit-learned weights, enables personalization     │\n└──────────┴──────────────┴─────────────────────────────────────────────────────┘\n\nPerformance Note:\n    - rrf/dbsf: Single MVS call (fastest)\n    - weighted/max/learned: Separate queries per feature, merged client-side\n\n═══════════════════════════════════════════════════════════════════════════════\nRRF (Reciprocal Rank Fusion) - RECOMMENDED DEFAULT\n═══════════════════════════════════════════════════════════════════════════════\n    Formula: score = Σ(1 / (k + rank_i))\n    - Rank-based: ignores raw scores, uses position only\n    - Robust to score scale differences between embedding models\n    - No tuning required, works well out-of-the-box\n    - MVS default k=2 (configurable)\n    - Best for: General-purpose multi-vector search\n\n═══════════════════════════════════════════════════════════════════════════════\nDBSF (Distribution-Based Score Fusion)\n═══════════════════════════════════════════════════════════════════════════════\n    Formula: Normalizes scores using mean ± 3σ, then sums\n    - Score-based with statistical normalization\n    - Handles different score distributions across features\n    - Available since MVS v1.11\n    - Best for: When score magnitudes matter but scales differ\n\n═══════════════════════════════════════════════════════════════════════════════\nWEIGHTED (Manual Weight Fusion)\n═══════════════════════════════════════════════════════════════════════════════\n    Formula: score = Σ(weight_i * score_i)\n    - User specifies weight per feature search\n    - Weights defined in each FeatureSearchConfig.weight field\n    - Runs N separate queries, merges results client-side\n    - Best for: When you know relative feature importance upfront\n\n═══════════════════════════════════════════════════════════════════════════════\nMAX (Maximum Score Fusion)\n═══════════════════════════════════════════════════════════════════════════════\n    Formula: score = max(score_1, score_2, ..., score_n)\n    - Takes best match from any single feature\n    - Good when any modality match is sufficient\n    - Runs N separate queries, merges results client-side\n    - Best for: \"OR\" semantics (match text OR image OR audio)\n\n═══════════════════════════════════════════════════════════════════════════════\nLEARNED (Bandit-Learned Fusion)\n═══════════════════════════════════════════════════════════════════════════════\n    Formula: score = Σ(learned_weight_i * score_i)\n    - Learns optimal weights from user feedback (clicks, purchases, etc.)\n    - Uses Thompson Sampling (Beta-Bernoulli bandit) by default\n    - Supports personalization via context (user_id, segment, etc.)\n    - Cold start handling with hierarchical fallback\n    - REQUIRES learning_config to be specified\n    - Runs N separate queries to get per-feature scores for learning\n    - Best for: Personalized search, learning feature importance over time\n\nExample Usage:\n    ```python\n    # Fast, no tuning needed (recommended)\n    fusion = FusionStrategy.RRF\n\n    # Score-aware with normalization\n    fusion = FusionStrategy.DBSF\n\n    # Manual weights (0.7 text, 0.3 image)\n    fusion = FusionStrategy.WEIGHTED\n    searches = [\n        {\"feature_uri\": \"...\", \"weight\": 0.7, ...},\n        {\"feature_uri\": \"...\", \"weight\": 0.3, ...},\n    ]\n\n    # Best match from any feature\n    fusion = FusionStrategy.MAX\n\n    # Personalized, learns from feedback\n    fusion = FusionStrategy.LEARNED\n    learning_config = LearnedFusionConfig(\n        context_features=[\"INPUT.user_id\"],\n        reward_signal=\"click\"\n    )\n    ```","enum":["rrf","dbsf","weighted","max","learned"],"title":"FusionStrategy","type":"string"},"LearnedFusionConfig":{"description":"Configuration for learned fusion with bandit-based weight optimization.\n\nEnables personalized feature weighting by learning from user interactions.\nThe system learns which features (text, image, audio, etc.) matter most\nfor different users or contexts.\n\n┌─────────────────────────────────────────────────────────────────────────────┐\n│ HOW LEARNED FUSION WORKS                                                     │\n├─────────────────────────────────────────────────────────────────────────────┤\n│ 1. Query arrives with context (user_id, segment, etc.)                      │\n│ 2. Look up bandit state for this context                                    │\n│ 3. Sample feature weights from Beta distributions                           │\n│ 4. Execute separate queries per feature with learned weights                │\n│ 5. Fuse results using sampled weights                                       │\n│ 6. On feedback (click), update bandit for relevant features                 │\n└─────────────────────────────────────────────────────────────────────────────┘\n\nCold Start Handling:\n    - NEW users: Uses demographic context (user_segment, device_type)\n    - Returning users: Uses personal context after min_interactions\n    - Fallback: Global context as ultimate fallback\n\nRequirements:\n    - Interactions API must be called with feedback (clicks, etc.)\n    - Context features should be passed in query inputs\n\nExample:\n    ```python\n    LearnedFusionConfig(\n        algorithm=LearningAlgorithm.THOMPSON_SAMPLING,\n        context_features=[\"INPUT.user_id\"],\n        demographic_features=[\"INPUT.user_segment\", \"INPUT.device_type\"],\n        reward_signal=\"click\",\n        min_interactions=5,\n    )\n    ```","examples":[{"description":"No config needed - uses sensible defaults. Per-user personalization, click signal.","title":"Zero-Config (Recommended Start)","value":{}},{"description":"Learn optimal feature weights per user. After 5 interactions, uses personal weights.","title":"Per-User Personalization","value":{"context_features":["INPUT.user_id"],"reward_signal":"click"}},{"description":"Optimize for purchases. New users get segment-based weights, returning users get personal.","title":"E-Commerce (Purchase-Optimized)","value":{"context_features":["INPUT.user_id"],"demographic_features":["INPUT.user_segment","INPUT.device_type"],"min_interactions":3,"reward_map":{"add_to_cart":2.0,"purchase":3.0}}},{"description":"Learn per-session preferences. Higher exploration for content diversity.","title":"Content Discovery (Session-Based)","value":{"context_features":["INPUT.session_id"],"exploration_bonus":1.5,"reward_map":{"click":1.0,"long_view":1.5}}},{"description":"Stick with what works. Lower exploration, trust known winners.","title":"Enterprise Search (Conservative)","value":{"context_features":["INPUT.user_id","INPUT.department"],"exploration_bonus":0.5,"min_interactions":10,"reward_map":{"click":1.0,"view":0.5}}},{"description":"No personalization - same weights for all users. Good for establishing baseline.","title":"Global Learning (A/B Test Baseline)","value":{"context_features":[],"fallback_strategy":"global","reward_signal":"click"}}],"properties":{"algorithm":{"$ref":"#/$defs/LearningAlgorithm","default":"thompson_sampling","description":"Learning algorithm for weight optimization. THOMPSON_SAMPLING (default): Beta-Bernoulli bandit with natural exploration. Works immediately, no tuning needed, best for most use cases."},"context_features":{"default":["INPUT.user_id"],"description":"Template variables for personal context (e.g., ['INPUT.user_id']). Default is ['INPUT.user_id'] for per-user personalization. Set to empty list for global learning (same weights for all users). Personal context is used after user has min_interactions. Supports: INPUT.*, CONTEXT.*, etc.","examples":[["INPUT.user_id"],["INPUT.user_id","INPUT.session_type"]],"items":{"type":"string"},"title":"Context Features","type":"array"},"demographic_features":{"description":"Template variables for demographic fallback context. Used for NEW users with < min_interactions. Enables cold start by learning from similar user segments. Examples: INPUT.user_segment, INPUT.device_type, INPUT.country.","examples":[["INPUT.user_segment","INPUT.device_type"],["INPUT.country","INPUT.language"]],"items":{"type":"string"},"title":"Demographic Features","type":"array"},"fallback_strategy":{"default":"hierarchical","description":"Strategy when user lacks sufficient history. 'hierarchical': Try personal → demographic → global (recommended). 'global': Skip demographic, fall back directly to global.","enum":["hierarchical","global"],"title":"Fallback Strategy","type":"string"},"min_interactions":{"default":5,"description":"Minimum interactions before using personal context. Below this threshold, uses demographic or global context. Prevents overfitting to small samples. Typical range: 3-10.","maximum":100,"minimum":1,"title":"Min Interactions","type":"integer"},"reward_signal":{"default":"click","description":"SINGLE interaction type that counts as positive reward (legacy / backward-compat fallback). To weight MULTIPLE interaction types — or give them different magnitudes — use `reward_map`, which takes precedence whenever an interaction has a computed reward_value (the normal path). Comma-separated values are NOT supported here.","examples":["click","purchase","positive_feedback"],"title":"Reward Signal","type":"string"},"exploration_bonus":{"default":1.0,"description":"Controls exploration vs exploitation balance. 1.0 = balanced (default). Higher = more exploration (try diverse weights). Lower = more exploitation (use known winners). Typical range: 0.5-2.0.","maximum":10.0,"minimum":0.1,"title":"Exploration Bonus","type":"number"},"prior_alpha":{"default":1.0,"description":"Beta distribution prior for positive feedback (α). 1.0 = uniform prior (no initial bias). Higher = initial belief that features are effective.","minimum":0.1,"title":"Prior Alpha","type":"number"},"prior_beta":{"default":1.0,"description":"Beta distribution prior for negative feedback (β). 1.0 = uniform prior (no initial bias). Higher = initial belief that features are ineffective.","minimum":0.1,"title":"Prior Beta","type":"number"},"reward_map":{"additionalProperties":{"type":"number"},"description":"Maps interaction types to reward magnitudes. Positive = reward, negative = penalty. Takes precedence over reward_signal when provided.","title":"Reward Map","type":"object"},"decay_factor":{"default":0.995,"description":"Per-day exponential decay for older interactions. 1.0 = no decay.","maximum":1.0,"minimum":0.9,"title":"Decay Factor","type":"number"},"decay_window_days":{"default":365,"description":"Interactions older than this many days are ignored entirely.","maximum":3650,"minimum":1,"title":"Decay Window Days","type":"integer"},"min_weight":{"default":0.05,"description":"Floor weight — no feature can drop below this after sampling.","maximum":0.5,"minimum":0.0,"title":"Min Weight","type":"number"},"max_weight":{"default":0.95,"description":"Ceiling weight — no feature can exceed this after sampling.","maximum":1.0,"minimum":0.5,"title":"Max Weight","type":"number"},"exploration_decay":{"default":0.99,"description":"Per-interaction decay applied to exploration_bonus. Reduces exploration as confidence grows.","maximum":1.0,"minimum":0.9,"title":"Exploration Decay","type":"number"},"exploration_floor":{"default":0.1,"description":"Exploration bonus will not decay below this floor.","maximum":5.0,"minimum":0.0,"title":"Exploration Floor","type":"number"},"max_reward_per_interaction":{"default":5.0,"description":"Maximum absolute reward value from a single interaction. Prevents outlier domination.","maximum":100.0,"minimum":0.1,"title":"Max Reward Per Interaction","type":"number"},"rollout_pct":{"default":100.0,"description":"Percentage of requests that use learned weights. Others get static fusion. Deterministic bucketing by user_id.","maximum":100.0,"minimum":0.0,"title":"Rollout Pct","type":"number"},"shadow_mode":{"default":false,"description":"When true, compute learned fusion weights for logging but serve static fusion results.","title":"Shadow Mode","type":"boolean"},"circuit_breaker_timeout_ms":{"default":1000,"description":"Max time (ms) to resolve learned weights before the circuit breaker trips and the request falls back to static fusion. Default 1000ms. Lower it to fail fast under latency pressure; raise it if cold ClickHouse queries legitimately need more time.","maximum":30000,"minimum":1,"title":"Circuit Breaker Timeout Ms","type":"integer"}},"title":"LearnedFusionConfig","type":"object"},"LearningAlgorithm":{"description":"Algorithms for learning feature fusion weights from feedback.\n\n┌─────────────────────────────────────────────────────────────────────────────┐\n│ ALGORITHM COMPARISON                                                         │\n├───────────────────┬─────────────────────────────────────────────────────────┤\n│ Algorithm         │ Description                                             │\n├───────────────────┼─────────────────────────────────────────────────────────┤\n│ thompson_sampling │ Beta-Bernoulli bandit, natural exploration, no tuning  │\n│ epsilon_greedy    │ (Future) Simple ε-greedy exploration                   │\n│ ucb               │ (Future) Upper Confidence Bound with guarantees        │\n└───────────────────┴─────────────────────────────────────────────────────────┘\n\nTHOMPSON_SAMPLING (Recommended):\n    - Beta-Bernoulli bandit with probabilistic exploration\n    - Works immediately with uniform priors (α=1, β=1)\n    - No hyperparameter tuning required\n    - Natural exploration/exploitation balance via sampling\n    - Converges to optimal weights as feedback accumulates\n    - Best for: Binary feedback (click/no-click), most use cases\n\nHow it works:\n    1. Each feature has Beta(α, β) distribution\n    2. Sample weight from each distribution before search\n    3. Apply sampled weights to fusion\n    4. On positive feedback (click): increment α for strong features\n    5. On no feedback: increment β for features in shown docs\n    6. Distribution shifts toward effective features over time","enum":["thompson_sampling","epsilon_greedy","ucb"],"title":"LearningAlgorithm","type":"string"},"MultiContentQueryInput":{"description":"Multi-file query input for feature indexes that support multi-content embedding.\n\nAccepts a list of URLs and/or text strings and embeds them together in a single\nAPI call to produce ONE query vector. Only valid when the feature URI's vector\nindex has ``supports_multi_query=True`` (e.g., gemini_multifile_extractor).\n\nAt execution time the system validates that the target feature URI supports\nmulti-content queries and raises an error if it does not.\n\nUse Cases:\n    - Object-level search: query with multiple files that describe the same item\n      (e.g., product image + spec sheet + description)\n    - Cross-modal similarity: find objects similar to a combination of inputs\n    - Reverse lookup: find objects like \"this image AND this text together\"\n\nExamples:\n    URLs and text together:\n        ```json\n        {\n            \"input_mode\": \"multi_content\",\n            \"values\": [\n                \"https://example.com/product.jpg\",\n                \"s3://bucket/spec.pdf\",\n                \"Lightweight carbon-fiber trail running shoe\"\n            ]\n        }\n        ```\n\n    Template-based (values resolved from retriever inputs):\n        ```json\n        {\n            \"input_mode\": \"multi_content\",\n            \"values\": [\"{{INPUT.image_url}}\", \"{{INPUT.description}}\"]\n        }\n        ```","examples":[{"input_mode":"multi_content","values":["https://example.com/product.jpg","s3://bucket/spec.pdf","Lightweight carbon-fiber trail running shoe"]},{"input_mode":"multi_content","values":["{{INPUT.image_url}}","{{INPUT.description}}"]}],"properties":{"input_mode":{"const":"multi_content","default":"multi_content","description":"Discriminator field. Always 'multi_content' for multi-file queries.","title":"Input Mode","type":"string"},"values":{"description":"List of content items to embed together. Each item is either: (1) a URL (http://, https://, s3://) — fetched and embedded as a file, or (2) a plain text string — embedded as text. Supports template variables: '{{INPUT.field_name}}'. All items are passed to the underlying model in one API call, producing a single query embedding. Only valid for feature URIs whose vector index has supports_multi_query=True.","items":{"type":"string"},"minItems":1,"title":"Values","type":"array"}},"required":["values"],"title":"MultiContentQueryInput","type":"object"},"OnEmptyBehavior":{"description":"Behavior when a feature search input is empty or missing.\n\nControls what happens when the resolved input (text, URL, or base64) is empty,\nNone, or evaluates to an empty string after template resolution.\n\n┌─────────┬────────────────────────────────────────────────────────────────────┐\n│ Value   │ Behavior                                                           │\n├─────────┼────────────────────────────────────────────────────────────────────┤\n│ error   │ Fail with validation error (input is required)                     │\n│ skip    │ Exclude this search from fusion (let other searches drive results) │\n│ random  │ Use random vector to return results (for optional single-search)   │\n└─────────┴────────────────────────────────────────────────────────────────────┘\n\nUse Cases:\n    - error (default): Strict mode - fail fast if required input is missing.\n      Best for: Production pipelines where missing input indicates a bug.\n\n    - skip: Graceful degradation - exclude from fusion if input missing.\n      Best for: Multi-modal search where user may provide text OR image OR both.\n      If text is empty but image is provided, only image search runs.\n\n    - random: Always return results - use random vector as fallback.\n      Best for: Single-feature search where you always want results returned,\n      even if the input is empty (e.g., \"show me anything\" behavior).\n\nExamples:\n    Multi-modal with optional inputs (skip empty):\n        ```json\n        {\n            \"searches\": [\n                {\"feature_uri\": \"text-embed\", \"on_empty\": \"skip\", ...},\n                {\"feature_uri\": \"image-embed\", \"on_empty\": \"skip\", ...}\n            ]\n        }\n        ```\n        - Text only provided → text search runs\n        - Image only provided → image search runs\n        - Both provided → fusion of both\n        - Neither provided → error (no searches to run)\n\n    Single search, always return results:\n        ```json\n        {\n            \"searches\": [\n                {\"feature_uri\": \"text-embed\", \"on_empty\": \"random\", ...}\n            ]\n        }\n        ```\n        - Input provided → normal search\n        - Input empty → random results from collection","enum":["error","skip","random"],"title":"OnEmptyBehavior","type":"string"},"QueryPreprocessing":{"description":"Configuration for query preprocessing — large file decomposition at query time.\n\nWhen a query input is a large file (video, PDF, long text), preprocessing\ndecomposes it using the same extractor pipeline that indexed the data,\ngenerates N embeddings (one per chunk), runs N parallel searches, and\nfuses the results into a single ranked list.\n\nThis is \"ingestion applied to the query\" — same decomposition and embedding,\nbut vectors are used for search instead of storage.","properties":{"feature_uri":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Feature URI for the extractor pipeline to use for decomposition. If None, inherits from the parent search's feature_uri.","title":"Feature Uri"},"params":{"anyOf":[{"additionalProperties":true,"type":"object"},{"type":"null"}],"default":null,"description":"Extractor-specific parameter overrides. Same params as ingestion: split_method, time_split_interval, chunk_size, chunk_overlap, etc.","title":"Params"},"max_chunks":{"default":20,"description":"Maximum number of chunks to search with. Caps parallel queries and embedding calls to control cost. Chunks are evenly sampled across the file if the extractor produces more than max_chunks.","maximum":500,"minimum":1,"title":"Max Chunks","type":"integer"},"aggregation":{"default":"rrf","description":"Fusion strategy for combining results from N chunk queries. 'rrf': Reciprocal Rank Fusion (balanced, recommended). 'max': Keep highest score per document (best for 'find this exact moment'). 'avg': Average scores (best for 'find similar overall content').","title":"Aggregation","type":"string"},"dedup_field":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Optional payload field to deduplicate results by. E.g., '_internal.document_id' to collapse chunks from the same parent document.","title":"Dedup Field"}},"title":"QueryPreprocessing","type":"object"},"StageCacheBehavior":{"description":"Cache behavior modes for retriever stages.\n\nControls internal caching of stage operations for performance optimization.\nAll modes are safe and automatic with LRU eviction - no manual cache management needed.\n\nValues:\n    AUTO: Smart automatic caching (default, recommended)\n    DISABLED: Skip internal caching completely\n    AGGRESSIVE: Cache even non-deterministic operations (use with caution)\n\nCache Architecture:\n    - Redis with LRU eviction policy (memory-bounded)\n    - Namespace-isolated per organization (multi-tenant safe)\n    - Stage-specific keyspaces prevent conflicts\n    - Cache keys hash (stage_name, inputs, parameters)\n    - Automatic invalidation on parameter changes\n\nPerformance Impact:\n    - AUTO: 50-90% latency reduction for repeated operations\n    - Cache lookup overhead: <5ms\n    - Hit rates: Typically 60-80% in production\n\nWhen to Use Each Mode:\n    AUTO (default):\n        - Deterministic transformations (parsing, formatting, reshaping)\n        - Stable external API calls (embeddings, standard inference)\n        - Operations without side effects\n        - Most use cases - this is the recommended default\n\n    DISABLED:\n        - Templates with now(), random(), or time-sensitive functions\n        - External APIs that must be called every time (real-time data)\n        - Operations with side effects\n        - Rapidly changing data where caching would serve stale results\n\n    AGGRESSIVE:\n        - When you fully understand caching implications\n        - For debugging or testing cache behavior\n        - Only use if you know cache invalidation is handled elsewhere\n        - Generally not recommended for production\n\nExamples:\n    Basic usage (auto mode, no config needed):\n        {\"cache_behavior\": \"auto\"}  # or omit - this is the default\n\n    Disable for time-sensitive operations:\n        {\"cache_behavior\": \"disabled\"}  # Template has {{now()}}\n\n    With custom TTL:\n        {\"cache_behavior\": \"auto\", \"cache_ttl_seconds\": 300}","enum":["auto","disabled","aggressive"],"title":"StageCacheBehavior","type":"string"},"TextQueryInput":{"description":"Text-based query input for text embedding search.\n\nPlain text is always treated as literal text, even if it looks like a URL.\nPerfect for searching text that happens to contain URLs or special characters.\n\nUse Cases:\n    - Semantic text search\n    - Question answering\n    - Document search by description\n    - Template-based text queries\n\nExamples:\n    Simple text:\n        ```json\n        {\"input_mode\": \"text\", \"value\": \"machine learning best practices\"}\n        ```\n\n    Template-based:\n        ```json\n        {\"input_mode\": \"text\", \"value\": \"{{INPUT.user_query}}\"}\n        ```\n\n    Legacy syntax (backward compatible):\n        ```json\n        {\"input_mode\": \"text\", \"text\": \"machine learning\"}\n        ```","examples":[{"description":"Simple text query","input_mode":"text","value":"machine learning best practices"},{"description":"Template-based text query","input_mode":"text","value":"{{INPUT.user_query}}"}],"properties":{"input_mode":{"const":"text","default":"text","description":"Discriminator field. Always 'text' for text-based queries.","title":"Input Mode","type":"string"},"value":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Plain text query string (RECOMMENDED). Always treated as literal text for embedding, even if it looks like a URL. Supports template variables: {{INPUT.field_name}}. Empty strings are allowed (creates zero vector for optional inputs).","examples":["machine learning","red sports car","{{INPUT.user_query}}"],"title":"Value"},"text":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Legacy text field (DEPRECATED - use 'value' instead). Supports template variables: {{INPUT.field_name}}.","examples":["machine learning","{{INPUT.query}}"],"title":"Text"}},"title":"TextQueryInput","type":"object"},"VectorQueryInput":{"description":"Raw embedding vector query input for direct vector similarity search.\n\nAccepts a pre-computed embedding vector and uses it directly for similarity\nsearch without any inference. Useful for programmatic use cases such as\ntaxonomy enrichment where embeddings are already available.\n\nUse Cases:\n    - Taxonomy enrichment (passing pre-computed document embeddings)\n    - Programmatic similarity search with known vectors\n    - Cross-collection matching with pre-extracted features\n    - ColBERT/multi-vector search with pre-computed token embeddings\n\nExamples:\n    Single dense vector:\n        ```json\n        {\"input_mode\": \"vector\", \"value\": [0.1, 0.2, 0.3, ...]}\n        ```\n\n    Multi-dense vector (ColBERT — list of token embeddings):\n        ```json\n        {\"input_mode\": \"vector\", \"value\": [[0.1, 0.2], [0.3, 0.4], ...]}\n        ```\n\n    Template-based (from taxonomy input mapping):\n        ```json\n        {\"input_mode\": \"vector\", \"value\": \"{{INPUT.query_image}}\"}\n        ```","examples":[{"description":"Raw embedding vector","input_mode":"vector","value":[0.1,0.2,0.3]},{"description":"ColBERT multi-vector (2 tokens × 3 dims)","input_mode":"vector","value":[[0.1,0.2,0.3],[0.4,0.5,0.6]]}],"properties":{"input_mode":{"const":"vector","default":"vector","description":"Discriminator field. Always 'vector' for raw embedding queries.","title":"Input Mode","type":"string"},"value":{"anyOf":[{"items":{"type":"number"},"type":"array"},{"items":{"items":{"type":"number"},"type":"array"},"type":"array"},{"type":"string"},{"type":"null"}],"default":null,"description":"Pre-computed embedding vector. Accepts: (1) list[float] for single dense vectors, (2) list[list[float]] for multi-dense vectors (ColBERT token embeddings), (3) a template string (e.g., '{{INPUT.query}}') that resolves to either format. No inference is performed; the vector is used directly for similarity search.","title":"Value"}},"title":"VectorQueryInput","type":"object"}},"description":"User configuration for the feature_filter stage.\n\n**Stage Category**: FILTER\n\n**Transformation**: 0 documents → ≤final_top_k documents\n\n**Purpose**: Unified stage for semantic and hybrid search across N feature URIs.\nPerforms vector similarity search on one or more embedding features and fuses\nresults using configurable strategies. Leverages MVS's native multi-vector\nsearch and fusion capabilities for optimal performance.\n\n**When to Use**:\n    - As the initial stage to find candidate documents from collections\n    - Single feature URI (N=1): Pure semantic/KNN search\n    - Multiple feature URIs (N>1): Hybrid/multimodal search with fusion\n    - Multimodal search: Text + image + video embeddings combined\n    - Lexical + semantic: Sparse + dense vectors for best-of-both\n    - Any combination of dense vector features\n    - **Filter-only mode (N=0)**: Pure attribute/text filtering without embeddings\n      by providing only pre_filters (uses MVS's native full-text search)\n\n**When NOT to Use**:\n    - For filtering in-memory results (use attribute_filter or llm_filter)\n    - For reordering results (use SORT stages)\n    - For enriching documents (use APPLY stages)\n\n**Operational Behavior**:\n    - **With searches**: Queries vector databases (MVS) for each feature URI\n      in parallel, generates embeddings via inference service, performs multi-vector\n      search with MVS native fusion, returns fused and scored results\n    - **Filter-only mode**: Uses MVS's scroll API with native filtering (no embeddings,\n      no vector search). Supports full-text search via TEXT operator and all other\n      filter operators. Returns documents matching filters without relevance scores.\n    - Moderate performance (depends on number of features and top_k)\n    - Output schema = Collection document schema (no schema changes)\n\n**Common Pipeline Position**: FILTER (this stage) → ENRICH → SORT → REDUCE\n\n**Configuration vs Execution**:\n    Configuration time (defining the retriever):\n    - Specify searches: List of features to search with parameters\n    - Use TEMPLATES for query inputs: {{INPUT.field_name}}\n    - Configure fusion strategy, weights, final_top_k\n\n    Execution time (running the retriever):\n    - Only pass input values (e.g., {\"user_query\": \"search text\"})\n    - System resolves templates using your input values\n    - Generates embeddings and performs searches automatically\n\nRequirements:\n    - searches: REQUIRED, at least one feature search configuration\n    - final_top_k: OPTIONAL, defaults to 25 results after fusion\n    - fusion: OPTIONAL, defaults to RRF for multi-feature searches\n    - cache_behavior: OPTIONAL, defaults to 'auto' for caching (inherited)\n\nUse Cases:\n    - Single-modal search: One feature URI (text OR image OR video)\n    - Multimodal search: Text + image + video features combined\n    - Hybrid search: Dense + sparse vectors for best-of-both\n    - Multi-index search: Search across multiple feature types simultaneously\n\nExample Stage Configuration (Multimodal):\n    ```json\n    {\n        \"stage_name\": \"multimodal_search\",\n        \"stage_type\": \"filter\",\n        \"config\": {\n            \"stage_id\": \"feature_search\",\n            \"parameters\": {\n                \"searches\": [\n                    {\n                        \"feature_uri\": \"mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1\",\n                        \"query\": {\"input_mode\": \"text\", \"value\": \"{{INPUT.user_query}}\"},\n                        \"top_k\": 100,\n                        \"weight\": 0.6\n                    },\n                    {\n                        \"feature_uri\": \"mixpeek://clip_extractor@v1/image_embedding\",\n                        \"query\": {\"input_mode\": \"text\", \"value\": \"{{INPUT.user_query}}\"},\n                        \"top_k\": 50,\n                        \"weight\": 0.4\n                    }\n                ],\n                \"final_top_k\": 25,\n                \"fusion\": \"weighted\"\n            }\n        }\n    }\n    ```\n\nExample Execution Request:\n    ```json\n    {\n        \"inputs\": {\n            \"user_query\": \"red sports car with spoiler\"\n        }\n    }\n    ```","examples":[{"cache_behavior":"auto","description":"Single-modal text search (pure semantic/KNN)","final_top_k":25,"searches":[{"feature_uri":"mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1","query":{"input_mode":"text","value":"{{INPUT.user_query}}"},"top_k":100}]},{"description":"Image search by URL with score threshold","final_top_k":20,"searches":[{"feature_uri":"mixpeek://clip_extractor@v1/image_embedding","min_score":0.6,"query":{"input_mode":"content","value":"{{INPUT.reference_image}}"},"top_k":100}]},{"description":"Optional search with random fallback - always returns results","final_top_k":25,"searches":[{"feature_uri":"mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1","on_empty":"random","query":{"input_mode":"text","value":"{{INPUT.optional_query}}"},"top_k":100}],"use_case":"Browse/explore mode where empty query returns random content"},{"description":"Multi-modal with optional inputs (text OR image OR both)","final_top_k":25,"fusion":"rrf","searches":[{"feature_uri":"mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1","on_empty":"skip","query":{"input_mode":"text","value":"{{INPUT.text_query}}"},"top_k":100},{"feature_uri":"mixpeek://clip_extractor@v1/image_embedding","on_empty":"skip","query":{"input_mode":"content","value":"{{INPUT.image_url}}"},"top_k":100}],"use_case":"User can provide text, image, or both - search adapts automatically"},{"description":"Required text + optional image refinement","final_top_k":25,"fusion":"weighted","searches":[{"feature_uri":"mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1","on_empty":"error","query":{"input_mode":"text","value":"{{INPUT.query}}"},"top_k":100,"weight":0.6},{"feature_uri":"mixpeek://clip_extractor@v1/image_embedding","on_empty":"skip","query":{"input_mode":"content","value":"{{INPUT.optional_image}}"},"top_k":50,"weight":0.4}],"use_case":"Text search is required, image is optional enhancement"},{"description":"Multimodal search with RRF fusion (recommended default)","final_top_k":25,"fusion":"rrf","searches":[{"feature_uri":"mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1","query":{"input_mode":"text","value":"{{INPUT.query}}"},"top_k":100},{"feature_uri":"mixpeek://clip_extractor@v1/image_embedding","query":{"input_mode":"text","value":"{{INPUT.query}}"},"top_k":50}]},{"description":"Hybrid search with weighted fusion (text-heavy)","final_top_k":50,"fusion":"weighted","searches":[{"feature_uri":"mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1","query":{"input_mode":"text","value":"{{INPUT.search_text}}"},"top_k":100,"weight":0.7},{"feature_uri":"mixpeek://sparse_extractor@v1/lexical","query":{"input_mode":"text","value":"{{INPUT.search_text}}"},"top_k":50,"weight":0.3}]},{"description":"Learned fusion with personalization","final_top_k":25,"fusion":"learned","learning_config":{"context_features":["INPUT.user_id"],"demographic_features":["INPUT.user_segment"],"reward_signal":"click"},"searches":[{"feature_uri":"mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1","query":{"input_mode":"text","value":"{{INPUT.query}}"},"top_k":100},{"feature_uri":"mixpeek://clip_extractor@v1/image_embedding","query":{"input_mode":"text","value":"{{INPUT.query}}"},"top_k":50}]},{"description":"Search with database-level grouping (decompose/recompose)","final_top_k":50,"group_by":{"field":"source_object_id","limit":25,"max_per_group":3,"output_mode":"all"},"searches":[{"feature_uri":"mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1","min_score":0.5,"query":{"input_mode":"text","value":"{{INPUT.query}}"},"top_k":200}],"use_case":"Search chunks but return parent documents"},{"description":"Search with faceted results for filter UI","facets":[{"key":"metadata.category","limit":10},{"key":"metadata.file_type","limit":5}],"final_top_k":25,"searches":[{"feature_uri":"mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1","query":{"input_mode":"text","value":"{{INPUT.query}}"},"top_k":100}]},{"description":"Filter-only mode: full-text search without embeddings","final_top_k":50,"searches":[],"use_case":"Pure attribute/text filtering via MVS's native filtering"},{"cache_behavior":"auto","collection_identifiers":["products","catalog"],"description":"Complete example with all key parameters","facets":[{"key":"metadata.category","limit":10}],"final_top_k":30,"fusion":"weighted","group_by":{"field":"product_id","max_per_group":1,"output_mode":"first"},"searches":[{"feature_uri":"mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1","min_score":0.5,"on_empty":"error","query":{"input_mode":"text","value":"{{INPUT.query}}"},"top_k":150,"weight":0.7},{"feature_uri":"mixpeek://clip_extractor@v1/image_embedding","min_score":0.4,"on_empty":"skip","query":{"input_mode":"content","value":"{{INPUT.image_url}}"},"top_k":100,"weight":0.3}]}],"properties":{"cache_behavior":{"$ref":"#/$defs/StageCacheBehavior","default":"auto","description":"Controls internal caching behavior for this stage. OPTIONAL - defaults to 'auto' for transparent performance. \n\n'auto' (default): Automatic caching for deterministic operations. Stage intelligently caches results based on inputs and parameters. Use for transformations, parsing, formatting, stable API calls. Cache invalidates automatically when parameters change. Recommended for 95% of use cases. \n\n'disabled': Skip all internal caching. Every execution runs fresh without cache lookup. Use for templates with now(), random(), or external APIs that must be called every time (real-time data). No performance benefit but guarantees fresh execution. \n\n'aggressive': Cache even non-deterministic operations. Use ONLY when you fully understand caching implications. May cache time-sensitive or random data. Generally not recommended - prefer 'auto' or 'disabled'. \n\nNote: This controls internal stage caching. Retriever-level caching (cache_config.cache_stage_names) is separate and caches complete stage outputs.","examples":["auto","disabled","aggressive"]},"cache_ttl_seconds":{"anyOf":[{"minimum":0,"type":"integer"},{"type":"null"}],"default":null,"description":"Time-to-live for cache entries in seconds. OPTIONAL - defaults to None (LRU eviction only). \n\nWhen None (default, recommended): Cache uses Redis LRU eviction policy. Most frequently used items stay cached automatically. No manual TTL management needed. Memory bounded by Redis maxmemory setting. \n\nWhen specified: Cache entries expire after this duration regardless of usage. Useful for data that becomes stale after specific time periods. Lower values for frequently changing external data. Higher values for stable transformations. \n\nExamples:\n- None: LRU-based eviction (recommended for most cases)\n- 300: 5 minutes (for semi-static external data)\n- 3600: 1 hour (for stable transformations)\n- 86400: 24 hours (for rarely changing operations)\n\n\nPerformance Note: TTL adds minimal overhead (<1ms) but forces eviction even for frequently accessed items. Use None unless you have specific staleness requirements.","examples":[null,300,3600,86400],"title":"Cache Ttl Seconds"},"searches":{"description":"List of feature searches to perform and fuse. Can be empty for filter-only mode (requires pre_filters). For single-modal search: Provide 1 feature search (pure KNN/semantic). For hybrid/multimodal: Provide 2+ feature searches (fusion applied). Each feature search specifies: feature URI, query input, top_k, score threshold. Searches execute in parallel and results are fused using fusion strategy. \n\n**Filter-only mode**: Leave empty and provide pre_filters to use MVS's native filtering (including full-text search) without vector embeddings. This enables pure attribute/text search without semantic similarity.","examples":[[{"feature_uri":"mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1","query":{"input_mode":"text","value":"{{INPUT.query}}"},"top_k":100}]],"items":{"$ref":"#/$defs/FeatureSearchConfig"},"maxItems":10,"title":"Searches","type":"array"},"final_top_k":{"default":25,"description":"OPTIONAL. Maximum number of documents to return after fusion. Defaults to 25. This is applied AFTER all feature searches are fused together. Must be ≤ minimum top_k across all searches.Higher values: More comprehensive results but slower. Common values: 10 (fast), 25 (balanced), 50-100 (comprehensive).","examples":[10,25,50,100],"maximum":500,"minimum":1,"title":"Final Top K","type":"integer"},"hard_timeout_ms":{"anyOf":[{"minimum":1,"type":"integer"},{"type":"null"}],"default":null,"description":"OPTIONAL. Per-retriever override for this stage's hard cancellation ceiling (ms) — the wall-clock point at which the executor cancels the stage and degrades gracefully. When None (default) the global stage-latency-budget ceiling applies (feature_search = 5000ms). Set this only when a specific retriever legitimately needs longer (e.g. a sparse collection-id-filtered search whose nprobe widens), so the stage returns slow-but-complete results instead of n=0. Clamped to the global gateway-safe maximum; never raises the SLA budget, only the cancellation point.","examples":[8000],"title":"Hard Timeout Ms"},"fusion":{"$ref":"#/$defs/FusionStrategy","default":"rrf","description":"OPTIONAL. Score fusion strategy for combining multiple feature searches. Defaults to 'rrf' (Reciprocal Rank Fusion). Only relevant when searches has 2+ entries. Ignored for single feature search (no fusion needed). \n\n┌──────────┬──────────────┬────────────────────────────────────────┐\n│ Strategy │ MVS Native│ Description                            │\n├──────────┼──────────────┼────────────────────────────────────────┤\n│ rrf      │ ✅ Yes       │ Rank-based, robust default             │\n│ dbsf     │ ✅ Yes       │ Score-based with normalization         │\n│ weighted │ ❌ No        │ Manual score-weighted fusion           │\n│ max      │ ❌ No        │ Best match from any feature            │\n│ learned  │ ❌ No        │ Bandit-learned, personalized weights   │\n└──────────┴──────────────┴────────────────────────────────────────┘\n\n\n'rrf' (default): Rank-based fusion, robust and simple. Single MVS call, no tuning needed. \n\n'dbsf': Score-based fusion with statistical normalization. Single MVS call, handles different score scales. \n\n'weighted': Manual weight fusion using FeatureSearchConfig.weight. N separate queries, merged client-side. \n\n'max': Maximum score across features (OR semantics). N separate queries, merged client-side. \n\n'learned': Bandit-learned weights from user feedback. Requires learning_config. Enables personalization. N separate queries for per-feature score tracking.","examples":["rrf","dbsf","weighted","max","learned"]},"learning_config":{"anyOf":[{"$ref":"#/$defs/LearnedFusionConfig"},{"type":"null"}],"default":null,"description":"OPTIONAL. Configuration for learned fusion. REQUIRED when fusion='learned', ignored otherwise. \n\nEnables personalized feature weighting by learning from user interactions. The system learns which features (text, image, audio) matter most for different users or contexts via Thompson Sampling bandit. \n\nSee LearnedFusionConfig for full documentation.","examples":[{"context_features":["INPUT.user_id"],"reward_signal":"click"},{"context_features":["INPUT.user_id"],"demographic_features":["INPUT.user_segment"],"fallback_strategy":"hierarchical","min_interactions":5}]},"query_preprocessing":{"anyOf":[{"$ref":"#/$defs/QueryPreprocessing"},{"type":"null"}],"default":null,"description":"OPTIONAL. Default preprocessing config for all searches in this stage. When set, all searches without their own query_preprocessing will inherit this config. Per-search query_preprocessing overrides this stage default. Use for stage-wide preprocessing when all searches target large file inputs."},"auto_preprocess":{"default":true,"description":"OPTIONAL. Enable smart auto-detection of when to apply query preprocessing. Defaults to True. When enabled, content-mode queries (URLs/base64) are automatically preprocessed based on content type:\n- Video/Audio: Auto-preprocessed (decompose into segments)\n- PDF: Auto-preprocessed (decompose into pages)\n- Image: NOT preprocessed (single embedding is sufficient)\n- Text mode: NOT preprocessed (unless explicitly configured)\n\nSet to False to disable auto-detection and require explicit query_preprocessing config on each search. Per-search query_preprocessing always takes priority over auto-detection.","title":"Auto Preprocess","type":"boolean"},"collection_identifiers":{"anyOf":[{"items":{"type":"string"},"type":"array"},{"type":"null"}],"default":null,"description":"OPTIONAL. Collection identifiers to search within this stage. Can be collection IDs or names. If NOT provided, uses retriever's default collections. Use this to target specific collections independent of retriever defaults. \n\nNote: These identifiers are validated at retriever creation time to ensure the stage's feature_uris can be resolved against these collections. \n\nUse cases:\n- Multi-tier pipelines where different stages search different collections\n- Stage-specific collection targeting\n- Override retriever defaults for this stage only","examples":[["col_marketing_2024"],["products","col_archived"]],"title":"Collection Identifiers"},"facets":{"anyOf":[{"items":{"$ref":"#/$defs/FacetFieldConfig"},"type":"array"},{"type":"null"}],"default":null,"description":"OPTIONAL. Fields to compute facet counts for during search. Facets run in PARALLEL with the search query using the same filters, providing value counts for the entire filtered result set (not just paginated results). \n\nUse cases:\n- Build faceted search UIs (e.g., 'Filter by Category: Sports (45), Music (23)')\n- Show available filter options with document counts\n- Enable drill-down navigation in search results\n\n\nRequirements:\n- Each faceted field MUST have a keyword index in MVS\n- Common auto-indexed fields: metadata.*, status, collection_id\n- For custom fields, ensure indexing is configured\n\n\nPerformance:\n- Facets execute in parallel with search (minimal latency impact)\n- Use approximate counts (exact=False) for fast UI responses\n- Limit facet count per field to reduce response size","examples":[[{"key":"metadata.category","limit":10}],[{"key":"metadata.category","limit":10},{"key":"metadata.file_type","limit":5}],[{"key":"metadata.category","limit":20},{"exact":true,"key":"metadata.author","limit":50}]],"title":"Facets"},"group_by":{"anyOf":[{"$ref":"#/$defs/FeatureSearchGroupBy"},{"type":"null"}],"default":null,"description":"OPTIONAL. Database-level grouping configuration (uses MVS query_points_groups). When enabled, results are grouped by the specified field at the database level, which is more efficient than in-memory grouping for large result sets. \n\nUse cases:\n- Decompose/recompose: Search chunks, return parent documents\n- Deduplication: One best result per product_id\n- Scene→Video grouping: Search frames, return parent videos\n\n\nOutput modes (mirrors group_by REDUCE stage for consistency):\n- 'first': Top doc per group (deduplication)\n- 'all': All docs with group structure preserved\n- 'flatten': All docs as flat list\n\n\nPerformance:\n- Grouping happens in MVS (database-level)\n- Much faster than fetching all results and grouping in memory\n- Use for decompose/recompose patterns at scale","examples":[{"field":"source_object_id","max_per_group":3,"output_mode":"all"},{"field":"video_id","max_per_group":1,"output_mode":"first"}]}},"title":"FeatureSearchStageConfig","type":"object"}},{"stage_id":"llm_filter","description":"Use LLM criteria to discard documents","category":"filter","icon":"bot","parameter_schema":{"$defs":{"LLMProvider":{"description":"Supported LLM providers for content generation.\n\nEach provider has different strengths, pricing, and multimodal capabilities.\nChoose based on your use case, performance requirements, and budget.\n\nValues:\n    OPENAI: OpenAI GPT models (GPT-4o, GPT-4.1, O3-mini)\n        - Best for: General purpose, vision tasks, structured outputs\n        - Multimodal: Text, images\n        - Performance: Fast (100-500ms), reliable\n        - Cost: Moderate to high ($0.15-$10 per 1M tokens)\n        - Use when: Need high-quality generation with vision support\n\n    GOOGLE: Google Gemini models (Gemini 3.1 Flash Lite, Gemini 2.5 Pro)\n        - Best for: Fast generation, video understanding, cost-efficiency\n        - Multimodal: Text, images, video, audio, PDFs\n        - Performance: Very fast (50-200ms)\n        - Cost: Low to moderate ($0.075-$0.40 per 1M tokens)\n        - Use when: Need video/audio/PDF support or cost-efficiency\n\n    ANTHROPIC: Anthropic Claude models (Claude 3.5 Sonnet, Claude 3.5 Haiku)\n        - Best for: Long context, complex reasoning, safety\n        - Multimodal: Text, images\n        - Performance: Moderate (200-800ms)\n        - Cost: Moderate to high ($0.25-$15 per 1M tokens)\n        - Use when: Need long context or complex reasoning\n\nExamples:\n    - Use OPENAI for production with structured JSON outputs\n    - Use GOOGLE for video summarization and cost-sensitive workloads\n    - Use ANTHROPIC for complex reasoning with long documents","enum":["openai","google","anthropic"],"title":"LLMProvider","type":"string"}},"description":"Configuration for delegating filtering decisions to an LLM.\n\n**Stage Category**: FILTER\n\n**Transformation**: N documents → ≤N documents (subset, same schema)\n\n**Purpose**: Produces a subset of input documents using LLM-based reasoning.\nUse this when filtering criteria are too complex for simple attribute conditions\nand require semantic understanding, content analysis, or subjective judgment.\nOutput documents have identical schema to input.\n\n**When to Use**:\n    - After initial FILTER stages for intelligent content filtering\n    - When filtering criteria involve content understanding (sentiment, topic relevance)\n    - For subjective filtering (quality, appropriateness, brand alignment)\n    - When simple attribute filters aren't sufficient (complex policies, nuanced rules)\n    - To filter multimodal content (images, videos) based on visual criteria\n    - For dynamic filtering based on natural language criteria\n\n**When NOT to Use**:\n    - For simple metadata filtering (use attribute_filter - much faster and cheaper)\n    - For reordering results (use SORT stages)\n    - For enriching documents (use APPLY stages)\n    - For aggregating documents (use REDUCE stages)\n    - When fast response time is critical (LLM calls are slow, 100ms-2s per batch)\n    - When cost is a major concern (LLM inference costs per document)\n\n**Operational Behavior**:\n    - Operates on in-memory document results (no database queries)\n    - Produces subset of documents (removes those not meeting LLM criteria)\n    - Slow operation (LLM API calls, network latency)\n    - Processes documents in batches to optimize LLM calls\n    - Makes HTTP requests to Engine service for LLM inference\n    - Supports concurrent batching for throughput\n    - Output schema = Input schema (no schema changes)\n\n**Common Pipeline Position**: FILTER (attribute_filter) → FILTER (this stage) → SORT\n\n**Cost & Performance**:\n    - Expensive: LLM API costs per document evaluated\n    - Slow: 100ms-2s per batch depending on LLM and batch size\n    - Use batch_size to balance throughput vs latency\n    - Consider filtering with attribute_filter first to reduce LLM calls\n\nRequirements:\n    - provider: OPTIONAL, LLM provider (openai, google, anthropic). Auto-inferred if not specified.\n    - model_name: OPTIONAL, specific model name. Uses provider default if not specified.\n    - criteria: OPTIONAL, natural language filtering description (defaults to empty string)\n      * If criteria is empty/null, stage is SKIPPED (all documents pass through)\n      * This saves 100ms-30s when no filtering is needed\n    - batch_size: OPTIONAL, defaults to 10 documents per batch\n    - max_concurrency: OPTIONAL, defaults to 5 concurrent requests\n\nUse Cases:\n    - Content quality filtering: \"Keep only well-written, professional articles\"\n    - Sentiment filtering: \"Discard negative or controversial content\"\n    - Topic relevance: \"Keep only documents about enterprise SaaS\"\n    - Visual filtering: \"Keep only images with people smiling\"\n    - Policy compliance: \"Filter out any content mentioning competitors\"","examples":[{"criteria":"Keep only documents tagged as GA releases","description":"Simple filtering with reasoning (recommended format)","include_reasoning":true,"model_name":"gemini-2.5-flash-lite","provider":"google"},{"criteria":"Discard anything with PII or sensitive information","description":"Fast PII filtering","model_name":"gemini-2.5-flash-lite","provider":"google"},{"criteria":"Keep documents about {INPUT.topic} published after {INPUT.min_date}","description":"Dynamic filtering with template variables","include_reasoning":false,"model_name":"gpt-4o-mini","provider":"openai"},{"criteria":"Keep only professional content suitable for enterprise clients.","description":"Provider-only (uses default model)","provider":"anthropic"},{"criteria":"Keep relevant documents","description":"Legacy format (deprecated but supported)","inference_name":"openai:gpt-4o-mini"}],"properties":{"feature_uri":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Feature URI of a custom LLM/generation plugin. When set, overrides provider and model_name. The plugin must accept {prompt: str, document: dict} and return {keep: bool, reason: str}. Format: mixpeek://plugin_name@version/feature_name","examples":["mixpeek://my_filter_model@1.0.0/filter"],"title":"Feature Uri"},"provider":{"anyOf":[{"$ref":"#/$defs/LLMProvider"},{"type":"null"}],"default":null,"description":"LLM provider to use. Supported providers:\n- openai: GPT models (GPT-4o, GPT-4o-mini)\n- google: Gemini models (Gemini 3.1 Flash Lite)\n- anthropic: Claude models (Claude 3.5 Sonnet/Haiku)\n\nIf not specified, defaults to 'google'. Can be auto-inferred from model_name. Ignored when feature_uri is set.","examples":["openai","google","anthropic"]},"model_name":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Specific LLM model to use. If not specified, uses provider default.\nFaster models recommended for filtering (gemini-2.5-flash-lite, gpt-4o-mini).","examples":["gemini-2.5-flash-lite","gpt-4o-mini","claude-haiku-4-5-20251001"],"title":"Model Name"},"inference_name":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"deprecated":true,"description":"DEPRECATED: Use 'provider' and 'model_name' instead.\nLegacy format: 'provider:model' (e.g., 'gemini:gemini-2.5-flash-lite').\nKept for backward compatibility only.","title":"Inference Name"},"criteria":{"anyOf":[{"type":"string"},{"type":"null"}],"default":"Keep only documents relevant to {{INPUT.query}}","description":"Natural language description of filtering criteria. The LLM will evaluate each document against this criteria. Be specific and clear about what to keep vs discard. If empty or null, the stage is skipped (all documents pass through). Supports template variables: - {INPUT.field}: From pipeline inputs - {DOC.field}: From current document Template expressions are evaluated per-document for dynamic filtering. Examples: 'Keep only...', 'Discard if...', 'Filter out...'","examples":["Keep only documents mentioning enterprise pricing or B2B features","Discard anything with negative sentiment or controversial topics","Keep documents related to {INPUT.target_topic}","Filter out content older than {INPUT.cutoff_date} unless marked urgent","Keep only images showing people in professional settings","Discard documents where {DOC.metadata.status} is 'draft' or 'archived'"],"title":"Criteria"},"include_reasoning":{"default":false,"description":"Whether to include LLM reasoning strings in stage metadata.","title":"Include Reasoning","type":"boolean"},"api_key":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"OPTIONAL. Bring Your Own Key (BYOK) - use your own LLM API key instead of Mixpeek's.\n\n**How to use:**\n1. Store your API key as an organization secret via POST /v1/organizations/secrets\n   Example: {\"secret_name\": \"openai_api_key\", \"secret_value\": \"sk-proj-...\"}\n\n2. Reference it here using template syntax: {{secrets.openai_api_key}}\n\n**Benefits:**\n- Use your own API credits and rate limits\n- Keep your API keys secure in Mixpeek's encrypted vault\n- No changes needed to your retriever when rotating keys\n\nIf not provided, uses Mixpeek's default API keys (usage charged to your account).","examples":["{{secrets.openai_api_key}}","{{secrets.anthropic_key}}"],"title":"Api Key"}},"title":"LLMFilterConfig","type":"object"}},{"stage_id":"query_expand","description":"Generate query variations with LLM and fuse search results","category":"filter","icon":"expand","parameter_schema":{"$defs":{"QueryExpandParameters":{"description":"Parameters for query expansion stage.\n\nThis stage:\n1. Takes your original query\n2. Uses an LLM to generate semantically similar query variations\n3. Executes feature_search for each variation (original + expansions)\n4. Fuses results using Reciprocal Rank Fusion (RRF)\n\nExample - Basic query expansion (copy and run):\n    ```python\n    {\n        \"stages\": [\n            {\n                \"stage\": \"query_expand\",\n                \"parameters\": {\n                    \"num_expansions\": 3,\n                    \"feature_search_config\": {\n                        \"query\": {\"text\": \"{{INPUT.query}}\"},\n                        \"feature_extractors\": [\n                            {\"field_name\": \"content.text\", \"embedding_model\": \"text\"}\n                        ],\n                        \"top_k\": 10\n                    }\n                }\n            }\n        ]\n    }\n    ```\n\nExample - With custom expansion prompt:\n    ```python\n    {\n        \"stages\": [\n            {\n                \"stage\": \"query_expand\",\n                \"parameters\": {\n                    \"num_expansions\": 5,\n                    \"expansion_prompt\": \"Generate {{NUM_EXPANSIONS}} alternative search queries for: {{QUERY}}. Focus on synonyms and related concepts. Return one query per line.\",\n                    \"feature_search_config\": {\n                        \"query\": {\"text\": \"{{INPUT.query}}\"},\n                        \"feature_extractors\": [\n                            {\"field_name\": \"content.text\", \"embedding_model\": \"text\"}\n                        ],\n                        \"top_k\": 20\n                    },\n                    \"rrf_k\": 60\n                }\n            }\n        ]\n    }\n\nExample - Multimodal query expansion:\n    ```python\n    {\n        \"stages\": [\n            {\n                \"stage\": \"query_expand\",\n                \"parameters\": {\n                    \"num_expansions\": 3,\n                    \"feature_search_config\": {\n                        \"query\": {\"text\": \"{{INPUT.query}}\", \"image\": \"{{INPUT.image_url}}\"},\n                        \"feature_extractors\": [\n                            {\"field_name\": \"content.text\", \"embedding_model\": \"text\"},\n                            {\"field_name\": \"content.image\", \"embedding_model\": \"multimodal\"}\n                        ],\n                        \"top_k\": 15\n                    },\n                    \"include_original\": true\n                }\n            }\n        ]\n    }\n\nHow it works:\n    1. The original query (from feature_search_config.query.text) is sent to an LLM\n    2. LLM generates `num_expansions` alternative queries\n    3. feature_search runs for original query + each expansion\n    4. Results are fused using RRF: score = sum(1 / (k + rank)) across all queries\n    5. Documents appearing in multiple result sets get boosted\n\nWhy use query expansion:\n    - Handles vocabulary mismatch (user says \"car\", docs say \"vehicle\")\n    - Captures related concepts the user might not have thought of\n    - Improves recall without sacrificing precision (RRF handles fusion)\n    - Works with any feature_search configuration (text, image, multimodal)","properties":{"num_expansions":{"default":3,"description":"Number of query variations to generate. More expansions = better recall but slower.","maximum":10,"minimum":1,"title":"Num Expansions","type":"integer"},"expansion_prompt":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Custom prompt for query expansion. Use {{QUERY}} for the original query and {{NUM_EXPANSIONS}} for the count. If not provided, uses a default prompt.","title":"Expansion Prompt"},"expansion_model":{"default":"gpt-4o-mini","description":"LLM model to use for generating query expansions.","title":"Expansion Model","type":"string"},"feature_search_config":{"additionalProperties":true,"description":"Full feature_search configuration. This is the same config you would pass to a standalone feature_search stage. The query.text field will be replaced with each expanded query.","title":"Feature Search Config","type":"object"},"include_original":{"default":true,"description":"Whether to include the original query in addition to expansions.","title":"Include Original","type":"boolean"},"rrf_k":{"default":60,"description":"RRF constant k. Higher values give more weight to lower-ranked results. Default of 60 is standard. Use lower (20-40) for precision, higher (80-100) for recall.","maximum":1000,"minimum":1,"title":"Rrf K","type":"integer"},"fusion_strategy":{"default":"rrf","description":"How to fuse results from multiple queries. 'rrf' = Reciprocal Rank Fusion (recommended), 'linear' = simple score averaging.","enum":["rrf","linear"],"title":"Fusion Strategy","type":"string"},"deduplicate":{"default":true,"description":"Whether to deduplicate results by document_id before returning.","title":"Deduplicate","type":"boolean"}},"required":["feature_search_config"],"title":"QueryExpandParameters","type":"object"}},"description":"Configuration wrapper for query expansion stage.","properties":{"stage_id":{"default":"query_expand","description":"Stage identifier","title":"Stage Id","type":"string"},"parameters":{"$ref":"#/$defs/QueryExpandParameters","description":"Query expansion parameters"}},"required":["parameters"],"title":"QueryExpandConfig","type":"object"}},{"stage_id":"aggregate","description":"Compute aggregations (COUNT, SUM, AVG, etc.) on pipeline results","category":"reduce","icon":"calculator","parameter_schema":{"$defs":{"AggregationFunction":{"description":"Supported aggregation functions.\n\nThese functions operate on document fields to produce aggregate metrics.\nAll functions operate on the current pipeline results (in-memory).","enum":["count","count_distinct","sum","avg","min","max","first","last","collect","collect_distinct","percentile","stddev","variance","frequency","co_occurrence","correlation"],"title":"AggregationFunction","type":"string"},"AggregationOperation":{"description":"Configuration for a single aggregation operation.\n\nDefines what function to apply and on which field(s).\n\nExamples:\n    Count documents:\n        ```json\n        {\"function\": \"count\", \"alias\": \"total\"}\n        ```\n\n    Sum a numeric field:\n        ```json\n        {\"function\": \"sum\", \"field\": \"metadata.views\", \"alias\": \"total_views\"}\n        ```\n\n    Count distinct values:\n        ```json\n        {\"function\": \"count_distinct\", \"field\": \"metadata.author\", \"alias\": \"unique_authors\"}\n        ```","examples":[{"alias":"total","function":"count"},{"alias":"total_views","field":"metadata.views","function":"sum"},{"alias":"avg_relevance","field":"score","function":"avg"},{"alias":"unique_authors","field":"metadata.author","function":"count_distinct"}],"properties":{"function":{"$ref":"#/$defs/AggregationFunction","description":"REQUIRED. Aggregation function to apply. count/count_distinct: Count documents or unique values. sum/avg/min/max: Numeric aggregations on a field. first/last: Get first or last value (by score order). collect/collect_distinct: Gather values into a list. percentile/stddev/variance: Statistical aggregations on numeric fields. frequency: Top-K value distribution for a categorical field. co_occurrence: Top-K co-occurring pairs across two fields. correlation: Pearson correlation between two numeric fields.","examples":["count","sum","avg","count_distinct","percentile","stddev","frequency","correlation"]},"field":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"OPTIONAL. Field path to aggregate (dot notation supported). Required for: sum, avg, min, max, count_distinct, first, last, collect, collect_distinct. Not needed for: count (counts documents). Examples: 'score', 'metadata.views', 'payload.price'.","examples":["score","metadata.views","payload.price","metadata.category"],"title":"Field"},"field_b":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"OPTIONAL. Second field path for two-field operations. Required for: co_occurrence, correlation. Not needed for other aggregation functions.","title":"Field B"},"percentile_value":{"anyOf":[{"maximum":100.0,"minimum":0.0,"type":"number"},{"type":"null"}],"default":null,"description":"OPTIONAL. Percentile to compute (0-100). Only used with 'percentile' function. Defaults to 50 (median).","title":"Percentile Value"},"top_k":{"anyOf":[{"maximum":100,"minimum":1,"type":"integer"},{"type":"null"}],"default":null,"description":"OPTIONAL. Limit results for frequency/co_occurrence. For frequency: top K most common values. For co_occurrence: top K co-occurring pairs. Defaults to 10.","title":"Top K"},"alias":{"description":"REQUIRED. Name for this aggregation result in the output. Must be unique within the stage configuration. Used as the key in the aggregation results.","examples":["total_count","avg_score","total_views","unique_categories"],"title":"Alias","type":"string"}},"required":["function","alias"],"title":"AggregationOperation","type":"object"},"GroupByFieldConfig":{"description":"Configuration for grouping documents before aggregation.\n\nWhen specified, aggregations are computed per-group rather than\nacross all documents.\n\nExample:\n    Group by category and count:\n        ```json\n        {\"field\": \"metadata.category\", \"alias\": \"category\"}\n        ```","examples":[{"alias":"category","field":"metadata.category"},{"field":"collection_id"}],"properties":{"field":{"description":"REQUIRED. Field path to group by (dot notation supported). Documents with the same value for this field are grouped together. Examples: 'metadata.category', 'collection_id', 'metadata.file_type'.","examples":["metadata.category","collection_id","metadata.file_type"],"title":"Field","type":"string"},"alias":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"OPTIONAL. Alias for this group-by field in the output. Defaults to the field name (last segment after dot). Example: 'category' for field 'metadata.category'.","examples":["category","collection","file_type"],"title":"Alias"}},"required":["field"],"title":"GroupByFieldConfig","type":"object"}},"description":"Configuration for the aggregate REDUCE stage.\n\n┌─────────────────────────────────────────────────────────────────────────────┐\n│ Stage Category: REDUCE                                                      │\n│                                                                             │\n│ Transformation: N documents → M aggregation results                         │\n│                                                                             │\n│ This stage operates on IN-MEMORY pipeline results, NOT the database.        │\n│ Use this for analytics on already-retrieved documents.                      │\n└─────────────────────────────────────────────────────────────────────────────┘\n\nPurpose:\n    Compute aggregations (counts, sums, averages, etc.) on pipeline results.\n    Useful for analytics, summaries, and metadata extraction from search results.\n\nWhen to Use:\n    - Compute statistics on retrieved documents (avg score, total count)\n    - Group results by a field and count per group\n    - Extract summary metrics for display (e.g., \"45 results in 3 categories\")\n    - Post-search analytics that don't need database queries\n\nWhen NOT to Use:\n    - For faceted search (use feature_search.facets instead - queries MVS)\n    - For full-collection analytics (use aggregation API directly)\n    - The aggregate stage only sees pipeline results, not full filtered dataset\n\nPerformance:\n    - Fast: In-memory Python operations on already-fetched documents\n    - No database queries\n    - Suitable for up to ~10K documents\n\nCommon Pipeline Position:\n    FILTER → SORT → REDUCE (this stage)\n\nExample Use Cases:\n    - \"Show average relevance score of top 100 results\"\n    - \"Count results per category from search results\"\n    - \"Get min/max prices from product search\"\n    - \"List unique authors in search results\"","examples":[{"aggregations":[{"alias":"total","function":"count"},{"alias":"avg_score","field":"score","function":"avg"}],"description":"Global count and average score"},{"aggregations":[{"alias":"count","function":"count"}],"description":"Count per category, top 10","group_by":[{"alias":"category","field":"metadata.category"}],"limit":10,"sort_by":"count","sort_order":"desc"},{"aggregations":[{"alias":"count","function":"count"},{"alias":"avg_score","field":"score","function":"avg"},{"alias":"best_score","field":"score","function":"max"}],"description":"Stats per file type","group_by":[{"alias":"type","field":"metadata.file_type"}],"include_documents":true}],"properties":{"aggregations":{"default":[{"function":"count","field":null,"field_b":null,"percentile_value":null,"top_k":null,"alias":"total"},{"function":"avg","field":"score","field_b":null,"percentile_value":null,"top_k":null,"alias":"avg_score"}],"description":"List of aggregation operations to compute. At least one aggregation is required. Multiple aggregations can be computed in a single stage. Supported functions: count, count_distinct, sum, avg, min, max, first, last, collect, collect_distinct, percentile, stddev, variance, frequency, co_occurrence, correlation.","examples":[[{"alias":"total","function":"count"}],[{"alias":"total","function":"count"},{"alias":"avg_score","field":"score","function":"avg"}]],"items":{"$ref":"#/$defs/AggregationOperation"},"minItems":1,"title":"Aggregations","type":"array"},"group_by":{"anyOf":[{"items":{"$ref":"#/$defs/GroupByFieldConfig"},"type":"array"},{"type":"null"}],"default":null,"description":"OPTIONAL. Fields to group by before aggregating. When specified, aggregations are computed per-group. When None, aggregations are computed across all documents (global). Multiple fields create composite groups (e.g., category + year).","examples":[[{"alias":"category","field":"metadata.category"}],[{"alias":"category","field":"metadata.category"},{"alias":"year","field":"metadata.year"}]],"title":"Group By"},"sort_by":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"OPTIONAL. Metric alias to sort aggregation results by. Must match an alias from the aggregations list. Only applies when group_by is specified.","examples":["total","avg_score"],"title":"Sort By"},"sort_order":{"default":"desc","description":"OPTIONAL. Sort order for aggregation results. 'desc': Highest values first (default). 'asc': Lowest values first.","enum":["asc","desc"],"title":"Sort Order","type":"string"},"limit":{"anyOf":[{"maximum":1000,"minimum":1,"type":"integer"},{"type":"null"}],"default":null,"description":"OPTIONAL. Maximum number of aggregation results to return. Only applies when group_by is specified. Useful for 'top N' queries (e.g., top 10 categories by count).","examples":[10,20,50],"title":"Limit"},"include_documents":{"default":false,"description":"OPTIONAL. Whether to include the original documents in output. False (default): Only return aggregation results in metadata. True: Pass through documents and add aggregation results to metadata. Set to True when aggregations are supplementary to search results.","title":"Include Documents","type":"boolean"}},"title":"AggregateConfig","type":"object"}},{"stage_id":"cluster","description":"Cluster documents by embedding similarity","category":"reduce","icon":"circle-dot","parameter_schema":{"description":"Configuration for clustering documents from previous stage results.\n\nStage Category: REDUCE\n\nTransformation: N documents → K clusters (where K < N typically)\n\nPurpose: Dynamically clusters documents from the pipeline by their embeddings.\nUnlike group_by which groups by a pre-existing field, cluster discovers\nnatural groupings in the data based on vector similarity.\n\nPerformance: Calls clustering inference service. Fast for typical retriever\nresult sets (10-500 documents). For larger datasets, consider using\npre-computed clusters with group_by instead.\n\nWhen to Use:\n    - Discover themes/topics in search results\n    - Group semantically similar documents without pre-existing labels\n    - Analyze patterns in retrieved content\n    - \"Find the 3 main themes in these results\"\n    - Auto-categorize search results\n\nWhen NOT to Use:\n    - When documents already have cluster/category labels (use group_by)\n    - For very large result sets (>1000 docs) - use pre-computed clusters\n    - When you need exact groupings (clustering is approximate)\n\nOutput Modes:\n    - \"clusters\": Returns K cluster summary documents with member lists\n    - \"labeled\": Returns original N documents with cluster_label added\n    - \"representatives\": Returns K representative documents (one per cluster)\n\nCommon Pipeline Position: FILTER → cluster (this stage) → ENRICH (summarize clusters)\n\nExamples:\n    - \"Find 3 themes in 60 ads\" → cluster with n_clusters=3\n    - \"Group similar products\" → cluster with algorithm=hdbscan (auto K)\n    - \"Discover topics in articles\" → cluster with representatives output","examples":[{"algorithm":"kmeans","description":"Find 3 main themes in results (known K)","n_clusters":3,"output_mode":"clusters"},{"algorithm":"hdbscan","description":"Auto-discover topic clusters (unknown K)","min_cluster_size":5,"output_mode":"clusters"},{"algorithm":"hdbscan","description":"Label documents with cluster assignments","output_mode":"labeled"},{"algorithm":"kmeans","description":"Get representative document per theme","n_clusters":5,"output_mode":"representatives"},{"algorithm":"spectral","description":"Cluster multimodal embeddings","feature_uri":"mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding","n_clusters":4,"output_mode":"clusters"}],"properties":{"algorithm":{"default":"hdbscan","description":"Clustering algorithm to use:\n\n- hdbscan: Auto-determines number of clusters, handles noise (DEFAULT, recommended)\n- kmeans: Fast, requires n_clusters, spherical clusters\n- dbscan: Density-based, handles noise, requires eps tuning\n- agglomerative: Hierarchical, good for nested structures\n- spectral: Graph-based, good for non-convex clusters\n- gaussian_mixture: Probabilistic, soft cluster assignments\n\nRecommendation: Use 'hdbscan' for exploratory analysis, 'kmeans' when you know K.","enum":["kmeans","hdbscan","dbscan","agglomerative","spectral","gaussian_mixture"],"examples":["hdbscan","kmeans","spectral"],"title":"Algorithm","type":"string"},"n_clusters":{"anyOf":[{"maximum":100,"minimum":2,"type":"integer"},{"type":"null"}],"default":null,"description":"Number of clusters to create. Required for kmeans, spectral, agglomerative, gaussian_mixture. Ignored for hdbscan and dbscan (auto-determined).\n\nIf not specified for algorithms that need it, auto-calculated as min(8, N/10).\n\nTypical values: 3-5 for theme discovery, 5-10 for topic modeling, 10-20 for fine-grained categorization.","examples":[3,5,8,10],"title":"N Clusters"},"min_cluster_size":{"default":5,"description":"Minimum number of documents to form a cluster (HDBSCAN/DBSCAN only).\n\nLower values = more clusters, may include noise.\nHigher values = fewer, denser clusters.\n\nAuto-adjusted for small datasets: min(min_cluster_size, N/3).\nTypical values: 3-5 for small results, 10-20 for large results.","examples":[3,5,10],"maximum":100,"minimum":2,"title":"Min Cluster Size","type":"integer"},"feature_uri":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Feature URI specifying which embedding to cluster on.\n\nOPTIONAL - if not provided, auto-detects from the upstream feature_search stage.\nWhen a feature_search stage runs before cluster, its feature_uri is automatically\ntracked in the pipeline state and used for clustering.\n\nUse the mixpeek:// URI format:\n  mixpeek://{extractor}@{version}/{output}\n\nExamples:\n- 'mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding'\n- 'mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1'\n- 'mixpeek://clip_extractor@v1/image_embedding'\n\nThe feature_uri is resolved to the actual embedding field name\non the documents (e.g., 'multimodal_extractor_v1_multimodal_embedding').\n\nOnly specify explicitly if you want to cluster on a different embedding\nthan the one used in the feature_search stage.","examples":["mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding","mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1","mixpeek://clip_extractor@v1/image_embedding"],"title":"Feature Uri"},"output_mode":{"default":"clusters","description":"How to format the output:\n\n- 'clusters': Returns K cluster documents, each containing:\n  - cluster_id: Cluster identifier\n  - member_count: Number of documents in cluster\n  - members: List of member documents\n  - centroid: Cluster center vector\n  Use for: Theme analysis, cluster summaries\n\n- 'labeled': Returns original N documents with added fields:\n  - cluster_id: Assigned cluster\n  - cluster_score: Distance to centroid (lower = closer)\n  Use for: Downstream processing with cluster context\n\n- 'representatives': Returns K documents (one per cluster):\n  - The document closest to each cluster centroid\n  Use for: Quick sampling, representative examples","enum":["clusters","labeled","representatives"],"examples":["clusters","labeled","representatives"],"title":"Output Mode","type":"string"},"include_centroids":{"default":true,"description":"Whether to include centroid vectors in output.\nUseful for downstream similarity comparisons or visualization.\nSet to False to reduce response size.","title":"Include Centroids","type":"boolean"},"max_members_per_cluster":{"default":50,"description":"Maximum members to include per cluster in 'clusters' output mode.\nDocuments are sorted by distance to centroid (closest first).\nUse to limit response size for large result sets.","examples":[10,25,50,100],"maximum":500,"minimum":1,"title":"Max Members Per Cluster","type":"integer"}},"title":"ClusterConfig","type":"object"}},{"stage_id":"deduplicate","description":"Remove duplicate documents by field match or content similarity","category":"reduce","icon":"copy-minus","parameter_schema":{"description":"Configuration for the deduplicate stage.\n\n**Stage Category**: REDUCE\n\n**Transformation**: N documents → M documents (M ≤ N, duplicates removed)\n\n**Purpose**: Removes duplicate documents from the result set based on\nexact field matching or content similarity. When duplicates are found,\nkeeps the first occurrence (highest ranked) by default.\n\n**When to Use**:\n    - Remove exact duplicates from multi-source retrieval\n    - Collapse results to one-per-group (e.g., one per URL, one per author)\n    - Deduplicate after query_expand which may return overlapping results\n    - Remove near-duplicate content using similarity threshold\n    - Ensure unique results after merging multiple feature searches\n\n**When NOT to Use**:\n    - For grouping with aggregation (use group_by stage)\n    - For sampling unique categories (use sample with stratified)\n    - For limiting result count without dedup logic (use limit stage)\n    - When order doesn't matter and you want all groups (use group_by)\n\n**Deduplication Strategies**:\n    - `field`: Exact match on one or more fields (fastest)\n    - `content`: Content-based similarity using text comparison\n\n**Operational Behavior**:\n    - Operates on in-memory document results (no database queries)\n    - Preserves the first occurrence of each unique value (respects prior ordering)\n    - Fast operation for field-based dedup (hash-based O(N))\n    - Content similarity is more expensive (O(N²) pairwise comparison)\n\n**Common Pipeline Position**: FILTER → SORT → REDUCE (this stage)\n\nRequirements:\n    - strategy: REQUIRED, deduplication method\n    - fields: REQUIRED for field strategy, list of fields to compare\n    - similarity_threshold: OPTIONAL for content strategy\n    - keep: OPTIONAL, which duplicate to keep (first or last)\n\nUse Cases:\n    - URL dedup: Deduplicate by source URL after web search enrichment\n    - Content dedup: Remove near-identical paragraphs from chunked retrieval\n    - Author collapse: Keep one result per author\n    - Source dedup: One result per source document after cross-collection search","examples":[{"description":"Deduplicate by document ID (exact match)","fields":["document_id"],"keep":"first","strategy":"field"},{"description":"Deduplicate by source URL","fields":["metadata.source_url"],"keep":"first","strategy":"field"},{"case_sensitive":false,"description":"Deduplicate by author and title combination","fields":["metadata.author","metadata.title"],"strategy":"field"},{"content_field":"content","description":"Content-based near-duplicate removal","keep":"first","similarity_threshold":0.9,"strategy":"content"},{"description":"Strict content dedup (exact text match)","similarity_threshold":1.0,"strategy":"content"}],"properties":{"strategy":{"default":"field","description":"REQUIRED. Deduplication strategy:\n- 'field': Exact match on specified fields (fast, hash-based). Best for structured deduplication by ID, URL, or metadata.\n- 'content': Text content similarity comparison. Best for removing near-duplicate text content.","enum":["field","content"],"examples":["field","content"],"title":"Strategy","type":"string"},"fields":{"anyOf":[{"items":{"type":"string"},"type":"array"},{"type":"null"}],"default":null,"description":"REQUIRED for strategy='field'. List of field paths to use for deduplication. Documents with identical values across ALL listed fields are considered duplicates. Supports dot notation for nested fields. Example: ['metadata.source_url'] deduplicates by URL.","examples":[["document_id"],["metadata.source_url"],["metadata.author","metadata.title"],["content"]],"title":"Fields"},"content_field":{"default":"content","description":"OPTIONAL. Field containing text content for content-based dedup. Only used when strategy='content'. Default: 'content'.","examples":["content","text","metadata.body"],"title":"Content Field","type":"string"},"similarity_threshold":{"default":0.95,"description":"OPTIONAL. Similarity threshold for content-based deduplication. Documents with similarity above this threshold are considered duplicates. Only used when strategy='content'. 1.0 = exact match only, 0.0 = everything is a duplicate. Default: 0.95 (very similar content).","examples":[0.9,0.95,0.99,1.0],"maximum":1.0,"minimum":0.0,"title":"Similarity Threshold","type":"number"},"keep":{"default":"first","description":"OPTIONAL. Which duplicate to keep when duplicates are found. 'first' (default): Keep the first occurrence (highest ranked after prior sort). 'last': Keep the last occurrence.","enum":["first","last"],"examples":["first","last"],"title":"Keep","type":"string"},"case_sensitive":{"default":true,"description":"OPTIONAL. Whether field comparisons are case-sensitive. True (default): 'Hello' != 'hello'. False: 'Hello' == 'hello' (case-insensitive comparison). Only applies to string field values.","examples":[true,false],"title":"Case Sensitive","type":"boolean"}},"title":"DeduplicateStageConfig","type":"object"}},{"stage_id":"group_by","description":"Group documents by field value (decompose/recompose)","category":"reduce","icon":"layers","parameter_schema":{"description":"Configuration for grouping documents by field value.\n\nStage Category: REDUCE\n\nTransformation: N documents → M groups (where M ≤ N)\n\nPurpose: Groups documents by a common field value, aggregating chunks\nback to parent objects. Essential for decompose/recompose workflows\nwhere chunks are searched individually then grouped to show context.\n\nPerformance: Runs in API layer (fast stage, ~10-50ms for 100-500 docs).\nIntegrates with optimizer for filter push-down before grouping. Future\noptimization will push grouping into MVS for 10-100x speedup.\n\nWhen to Use:\n    - After chunk-level search to group back to objects\n    - To deduplicate results by a field\n    - To aggregate related documents\n    - For decompose→search→recompose workflows\n    - Show top N results per category/author/parent\n\nWhen NOT to Use:\n    - For initial document retrieval (use FILTER stages: hybrid_search)\n    - For ordering documents (use SORT stages: sort_relevance)\n    - For enriching documents (use APPLY stages: document_enrich)\n    - For expanding documents (use APPLY 1-N stages: taxonomy_enrich)\n\nOperational Behavior:\n    - Fast stage: runs in API layer (no Engine delegation)\n    - In-memory grouping: Python dict-based grouping\n    - Groups documents with same field value\n    - Sorts within groups by score (highest first)\n    - Limits documents per group (configurable)\n    - Reports metrics to ClickHouse for learned optimizations\n\nCommon Pipeline Position: FILTER → SORT → REDUCE (this stage)\n\nRequirements:\n    - group_by_field: REQUIRED\n    - max_per_group: OPTIONAL, defaults to 10\n    - output_mode: OPTIONAL, defaults to \"all\"\n\nUse Cases:\n    - Decompose/recompose: Search 50 scenes, group to 10 videos\n    - Deduplication: Group by unique_id, keep top match\n    - Analytics: Group by category, show top docs per category\n    - Multi-tier results: Show top 3 products per brand\n\nExamples:\n    - Group video scenes back to parent videos\n    - Deduplicate search results by product_id\n    - Show top 3 articles per author\n    - Display best match per category","examples":[{"description":"Decompose/recompose: Group scene search results back to parent videos","group_by_field":"source_object_id","max_per_group":5,"output_mode":"all"},{"description":"Deduplication: Keep best match per video (deduplicate by video_id)","group_by_field":"video_id","max_per_group":1,"output_mode":"first"},{"description":"Category preview: Show top 3 products per category","group_by_field":"metadata.category","max_per_group":3,"output_mode":"all"},{"description":"Author aggregation: Group articles by author, return flat list","group_by_field":"metadata.author_id","max_per_group":10,"output_mode":"flatten"},{"description":"Brand showcase: One best product per brand","group_by_field":"brand_name","max_per_group":1,"output_mode":"first"}],"properties":{"group_by_field":{"default":"source_object_id","description":"Field path to group documents by using dot notation. Documents with the same field value are grouped together. Common fields: 'source_object_id' (parent object from decomposition), 'root_object_id' (top-level ancestor in hierarchy), 'metadata.category' (nested categorical field), 'video_id' (media grouping), 'product_id' (e-commerce). Use dot notation for nested fields: 'metadata.user_id', 'lineage.source_id'. Performance: Indexed fields are faster for future MVS native grouping optimization. Template support: Use {{inputs.group_field}} for dynamic grouping.","examples":["source_object_id","root_object_id","metadata.category","metadata.user_id","video_id","product_id","brand_name","author_id"],"title":"Group By Field","type":"string"},"max_per_group":{"default":10,"description":"OPTIONAL. Maximum number of documents to keep per group. Documents are sorted by score (highest first) before limiting. Default: 10. Use 1 for deduplication (keeps only highest scoring doc per group). Use 3-5 for preview results (show top chunks per parent). Use 50+ for comprehensive results (show many chunks per parent). Performance: Lower values reduce response size and improve latency. Typical values: 1 (dedup), 5 (preview), 10 (default), 20 (detailed), 50 (comprehensive).","examples":[1,3,5,10,20,50],"maximum":1000,"minimum":1,"title":"Max Per Group","type":"integer"},"output_mode":{"default":"all","description":"OPTIONAL. Controls what documents are returned per group. 'first': Return only the top document per group (deduplication, fastest).          Use for: unique results per group (e.g., one video per brand). 'all': Return all documents grouped by field (default, shows full context).        Use for: showing chunks within each parent object. 'flatten': Return all documents as flat list (loses group structure).            Use for: need all docs but don't care about grouping metadata. Default: 'all'. Performance: 'first' is fastest (smallest response), 'all' preserves grouping, 'flatten' is lightest (no group metadata).","enum":["first","all","flatten"],"examples":["first","all","flatten"],"title":"Output Mode","type":"string"}},"title":"GroupByConfig","type":"object"}},{"stage_id":"limit","description":"Truncate results to a maximum count with optional offset","category":"reduce","icon":"arrow-down-to-line","parameter_schema":{"description":"Configuration for the limit (truncation) stage.\n\n**Stage Category**: REDUCE\n\n**Transformation**: N documents → min(N, limit) documents\n\n**Purpose**: Truncates the document set to a maximum number of results,\noptionally with an offset to skip leading documents. This is the retriever\npipeline equivalent of SQL's LIMIT/OFFSET clause.\n\n**When to Use**:\n    - Cap results after expensive reranking/enrichment stages\n    - Implement pagination in multi-stage pipelines\n    - Reduce document count before costly LLM stages\n    - Return a fixed number of top results regardless of input size\n    - Skip already-processed documents in iterative retrieval\n\n**When NOT to Use**:\n    - For random sampling (use sample stage instead)\n    - For filtering by criteria (use attribute_filter or llm_filter)\n    - For the initial retrieval limit (set limit in feature_search directly)\n    - When you need statistical reduction (use aggregate)\n\n**Operational Behavior**:\n    - Operates on in-memory document results (no database queries)\n    - Preserves document order (takes first N after optional offset)\n    - Constant time operation O(1) for slicing\n    - Does not modify document schema or scores\n\n**Common Pipeline Position**: FILTER → SORT → REDUCE (this stage)\n\nRequirements:\n    - limit: REQUIRED, maximum number of documents to return\n    - offset: OPTIONAL, number of leading documents to skip (default: 0)\n\nUse Cases:\n    - Top-K results: Limit to 10 best results after reranking\n    - Pagination: offset=20, limit=10 for page 3\n    - Cost control: Limit before expensive LLM enrichment\n    - Fixed output: Guarantee exactly N results for downstream consumers","examples":[{"description":"Return top 10 results","limit":10},{"description":"Return top 5 results after skipping first 10","limit":5,"offset":10},{"description":"Page 3 of results (10 per page)","limit":10,"offset":20},{"description":"Cap at 100 for downstream LLM processing","limit":100,"offset":0},{"description":"Single best result","limit":1}],"properties":{"limit":{"default":10,"description":"REQUIRED. Maximum number of documents to return. If the input has fewer documents than limit, all are returned. Must be at least 1 and at most 10000.","examples":[5,10,25,50,100],"maximum":10000,"minimum":1,"title":"Limit","type":"integer"},"offset":{"default":0,"description":"OPTIONAL. Number of documents to skip from the beginning before applying the limit. Default: 0 (no skip). Combined with limit, enables pagination-style access. Example: offset=20, limit=10 returns documents 21-30.","examples":[0,10,20,50,100],"maximum":10000,"minimum":0,"title":"Offset","type":"integer"}},"title":"LimitStageConfig","type":"object"}},{"stage_id":"moment_group","description":"Merge temporal intervals into video moments","category":"reduce","icon":"film","parameter_schema":{"description":"Configuration for grouping search results into video moments.\n\nStage Category: REDUCE\n\nTransformation: N frame/chunk documents → M moments (where M ≤ N)\n\nPurpose: Groups frame-level or chunk-level search results by their\nparent video/object, then merges overlapping or adjacent temporal\nintervals into consolidated moments. Each moment has a start/end\ntime range and an aggregated relevance score.\n\nDesigned for the decompose/search/recompose workflow where videos\nare decomposed into frames at ingest, searched individually via\nfeature_search, then recomposed into meaningful time ranges here.\n\nPerformance: Runs in API layer (fast stage, O(n log n) per parent\nfor interval sort + single-pass merge). Typically <10ms for 500 docs.\n\nWhen to Use:\n    - After feature_search on video frame embeddings\n    - To return \"moments\" (time ranges) instead of raw frame docs\n    - To consolidate nearby matches into single time ranges\n    - For video highlight / clip extraction workflows\n\nWhen NOT to Use:\n    - For non-temporal grouping (use group_by stage)\n    - For time-series aggregation (use temporal stage)\n    - For initial retrieval (use FILTER stages)\n\nCommon Pipeline Position: FILTER (feature_search) → REDUCE (moment_group)\n\nRequirements:\n    - Results must have query_chunks with start_ms/end_ms (from\n      feature_search preprocessing) OR document-level time fields\n    - parent_field must exist on documents for grouping","examples":[{"description":"Video scene search: merge nearby frame matches into moments","max_moments_per_parent":5,"merge_tolerance_ms":2000,"output_mode":"annotated","parent_field":"source_object_id","score_strategy":"max","time_field":"query_chunks"},{"description":"Tight moment extraction with no gap tolerance","max_moments_per_parent":20,"merge_tolerance_ms":0,"min_score":0.5,"output_mode":"moments_only","parent_field":"video_id","score_strategy":"avg","time_field":"query_chunks"}],"properties":{"parent_field":{"default":"source_object_id","description":"Field to group results by parent object (video). Documents with the same parent_field value are grouped together before interval merging. Use 'source_object_id' for standard ingest lineage, 'root_object_id' for top-level ancestor, or any custom field.","examples":["source_object_id","root_object_id","video_id"],"title":"Parent Field","type":"string"},"time_field":{"default":"query_chunks","description":"Source of temporal intervals. 'query_chunks' reads the array attached by feature_search preprocessing (each entry has start_ms, end_ms, score). For document-level timestamps, use a dot-path like 'metadata.start_ms' — the stage will look for a corresponding end field by replacing 'start' with 'end'.","examples":["query_chunks","metadata.start_ms","frame_timestamp_ms"],"title":"Time Field","type":"string"},"merge_tolerance_ms":{"default":2000,"description":"Maximum gap in milliseconds between intervals to merge them into a single moment. Set to 0 for exact overlap merging only. 2000ms (default) works well for scene-level video search.","examples":[0,1000,2000,5000],"maximum":60000,"minimum":0,"title":"Merge Tolerance Ms","type":"integer"},"max_moments_per_parent":{"default":10,"description":"Maximum number of moments to return per parent object. Moments are sorted by score (highest first) before limiting.","examples":[3,5,10,20],"maximum":100,"minimum":1,"title":"Max Moments Per Parent","type":"integer"},"score_strategy":{"default":"max","description":"How to aggregate scores when merging overlapping intervals. 'max': Use the highest score among merged intervals. 'avg': Average all scores. 'sum': Sum all scores.","enum":["max","avg","sum"],"title":"Score Strategy","type":"string"},"min_score":{"anyOf":[{"maximum":1.0,"minimum":0.0,"type":"number"},{"type":"null"}],"default":null,"description":"Minimum score threshold. Moments below this are dropped.","title":"Min Score"},"output_mode":{"default":"annotated","description":"'annotated': Keep the best-scoring document per parent and attach a 'moments' array as an extra field. Preserves all original document fields. 'moments_only': Return lightweight synthetic documents containing only moment data and parent reference.","enum":["annotated","moments_only"],"title":"Output Mode","type":"string"}},"title":"MomentGroupConfig","type":"object"}},{"stage_id":"sample","description":"Sample a subset of documents using random or stratified sampling","category":"reduce","icon":"shuffle","parameter_schema":{"description":"Configuration for document sampling.\n\n**Stage Category**: REDUCE\n\n**Transformation**: N documents → M documents (where M ≤ N)\n\n**Purpose**: Sample a subset of documents using random or stratified sampling.\nOperates on in-memory results from previous stages.\n\n**When to Use**:\n    - A/B testing different pipeline configurations\n    - Reducing result set for expensive downstream stages\n    - Exploration and discovery features\n    - Ensuring proportional representation across categories\n    - Creating reproducible experiments with seeded sampling\n\n**When NOT to Use**:\n    - When you need all results (just use previous stage output)\n    - For ranking/reordering (use SORT stages)\n    - For filtering by criteria (use FILTER stages)\n\n**Sampling Strategies**:\n    - `random`: Uniform random sampling\n    - `stratified`: Proportional sampling across a field's values\n    - `reservoir`: Reservoir sampling (for streaming scenarios)\n\n**Common Pipeline Position**: feature_search → (expensive stages) → sample\n\nExamples:\n    Basic random sampling:\n        ```json\n        {\n            \"count\": 10,\n            \"strategy\": \"random\"\n        }\n        ```\n\n    Stratified sampling by category:\n        ```json\n        {\n            \"count\": 20,\n            \"strategy\": \"stratified\",\n            \"stratify_by\": \"metadata.category\"\n        }\n        ```\n\n    Reproducible sampling with seed:\n        ```json\n        {\n            \"count\": 50,\n            \"strategy\": \"random\",\n            \"seed\": 42\n        }\n        ```\n\n    Preserve top results, sample rest:\n        ```json\n        {\n            \"count\": 10,\n            \"strategy\": \"random\",\n            \"preserve_top_k\": 3\n        }\n        ```","examples":[{"count":10,"description":"Basic random sampling (simplest usage)","strategy":"random"},{"count":20,"description":"Stratified sampling by category","min_per_stratum":2,"strategy":"stratified","stratify_by":"metadata.category"},{"count":50,"description":"Reproducible sampling with seed","seed":42,"strategy":"random"},{"count":10,"description":"Preserve top 3, sample 7 more randomly","preserve_top_k":3,"strategy":"random"},{"count":100,"description":"Large sample with reservoir sampling","seed":12345,"strategy":"reservoir"}],"properties":{"count":{"default":10,"description":"REQUIRED. Number of documents to sample. If count > available documents, returns all documents.","examples":[5,10,25,50,100],"maximum":1000,"minimum":1,"title":"Count","type":"integer"},"strategy":{"default":"random","description":"OPTIONAL. Sampling strategy:\n- 'random': Uniform random sampling (default)\n- 'stratified': Proportional sampling across stratify_by field values\n- 'reservoir': Reservoir sampling (memory-efficient for large sets)","enum":["random","stratified","reservoir"],"examples":["random","stratified","reservoir"],"title":"Strategy","type":"string"},"stratify_by":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"OPTIONAL. Field to stratify on (required when strategy='stratified'). Samples proportionally from each unique value of this field. Supports dot notation for nested fields.","examples":["metadata.category","metadata.source","collection_id"],"title":"Stratify By"},"min_per_stratum":{"default":1,"description":"OPTIONAL. Minimum documents per stratum (stratified mode). Ensures each category gets at least this many documents.","examples":[1,2,5],"minimum":0,"title":"Min Per Stratum","type":"integer"},"seed":{"anyOf":[{"type":"integer"},{"type":"null"}],"default":null,"description":"OPTIONAL. Random seed for reproducible sampling. Same seed + same input = same output. Leave None for non-deterministic sampling.","examples":[42,12345,0],"title":"Seed"},"preserve_top_k":{"default":0,"description":"OPTIONAL. Always keep the top K documents by score, sample from remainder. Useful when you want to guarantee top results are included. Default: 0 (no preservation, sample from all).","examples":[0,3,5,10],"minimum":0,"title":"Preserve Top K","type":"integer"}},"title":"SampleStageConfig","type":"object"}},{"stage_id":"score_threshold","description":"Drop results below an absolute score; returns empty when none qualify","category":"reduce","icon":"filter","parameter_schema":{"description":"Configuration for the absolute score-threshold (quality gate) stage.\n\n**Stage Category**: REDUCE (N documents → ≤ N documents)\n\n**Transformation**: drops documents whose `score_field` fails `comparison`\nagainst `min_score`. The set may become empty.\n\n**Why this exists** (vs `score_normalize` + `attribute_filter`):\n    - `score_normalize` (min_max) always rescales the best result to 1.0, so\n      a threshold on the normalized score can NEVER reject an all-bad set —\n      the top match is always 1.0. This stage gates on the RAW/calibrated\n      score so \"everything is below the bar\" is expressible.\n    - It emits `all_below_threshold` so a client can branch to a\n      no-results / adjacent-suggestion UX deterministically.\n\n**When to Use**:\n    - Suppress weak matches so editors aren't shown unusable results\n    - Drive a \"no good results\" nudge when nothing clears the bar\n    - Hard-gate on a calibrated score (the `rerank` cross-encoder score is\n      ideal: `score_field=\"scores.rerank\"`)\n\n**When NOT to Use**:\n    - To rescale scores for comparison (use `score_normalize`)\n    - To keep top-N regardless of quality (use `limit`)\n    - For metadata filtering (use `attribute_filter`)\n\n**Operational Behavior**:\n    - In-memory, no external calls; O(N).\n    - Preserves input order of the surviving documents.\n    - Does not modify scores or schema — only filters.","examples":[{"description":"Suppress weak matches on the rerank score","min_score":0.5,"score_field":"scores.rerank"},{"comparison":"gt","description":"Hard cosine gate, strictly above 0.7","min_score":0.7}],"properties":{"min_score":{"description":"REQUIRED. Absolute minimum score a document must meet to be kept. Documents below this (per `comparison`) are dropped. Threshold on a calibrated score (e.g. the rerank cross-encoder score) for meaningful, stable values.","examples":[0.3,0.5,0.7],"title":"Min Score","type":"number"},"score_field":{"default":"score","description":"Score field to gate on. Supports dot-paths into the document (e.g. 'scores.rerank', 'metadata.quality'). Default: 'score'.","examples":["score","scores.rerank"],"title":"Score Field","type":"string"},"comparison":{"default":"gte","description":"Keep docs whose score is >= (gte) or strictly > (gt) min_score.","enum":["gte","gt"],"title":"Comparison","type":"string"},"missing_score":{"default":"drop","description":"What to do with documents that lack `score_field`. 'drop' (default) removes them; 'keep' retains them regardless of the threshold.","enum":["drop","keep"],"title":"Missing Score","type":"string"}},"required":["min_score"],"title":"ScoreThresholdStageConfig","type":"object"}},{"stage_id":"summarize","description":"Condense multiple documents into a summary using an LLM","category":"reduce","icon":"file-text","parameter_schema":{"$defs":{"LLMProvider":{"description":"Supported LLM providers for content generation.\n\nEach provider has different strengths, pricing, and multimodal capabilities.\nChoose based on your use case, performance requirements, and budget.\n\nValues:\n    OPENAI: OpenAI GPT models (GPT-4o, GPT-4.1, O3-mini)\n        - Best for: General purpose, vision tasks, structured outputs\n        - Multimodal: Text, images\n        - Performance: Fast (100-500ms), reliable\n        - Cost: Moderate to high ($0.15-$10 per 1M tokens)\n        - Use when: Need high-quality generation with vision support\n\n    GOOGLE: Google Gemini models (Gemini 3.1 Flash Lite, Gemini 2.5 Pro)\n        - Best for: Fast generation, video understanding, cost-efficiency\n        - Multimodal: Text, images, video, audio, PDFs\n        - Performance: Very fast (50-200ms)\n        - Cost: Low to moderate ($0.075-$0.40 per 1M tokens)\n        - Use when: Need video/audio/PDF support or cost-efficiency\n\n    ANTHROPIC: Anthropic Claude models (Claude 3.5 Sonnet, Claude 3.5 Haiku)\n        - Best for: Long context, complex reasoning, safety\n        - Multimodal: Text, images\n        - Performance: Moderate (200-800ms)\n        - Cost: Moderate to high ($0.25-$15 per 1M tokens)\n        - Use when: Need long context or complex reasoning\n\nExamples:\n    - Use OPENAI for production with structured JSON outputs\n    - Use GOOGLE for video summarization and cost-sensitive workloads\n    - Use ANTHROPIC for complex reasoning with long documents","enum":["openai","google","anthropic"],"title":"LLMProvider","type":"string"}},"description":"Configuration for multi-document summarization.\n\n**Stage Category**: REDUCE\n\n**Transformation**: N documents → 1 document (or N → M with group_by)\n\n**Purpose**: Condense multiple documents into a single summary using an LLM.\nUnlike llm_enrich which processes each document independently, Summarize\nprovides all documents to the LLM in a single call, enabling cross-document\nsynthesis and comparison.\n\n**When to Use**:\n    - Generate a single answer from search results (RAG output)\n    - Create executive summaries from multiple sources\n    - Synthesize information that spans multiple documents\n    - Reduce result set to key findings\n\n**When NOT to Use**:\n    - Adding fields to each document (use llm_enrich instead)\n    - Simple filtering based on content (use llm_filter instead)\n    - When you need to preserve individual documents\n\n**Common Pipeline Position**: feature_search → rerank → summarize\n\n**Template Variables**:\n    - `{{DOCUMENTS}}`: Formatted list of all documents (required in prompt)\n    - `{{DOC_COUNT}}`: Number of documents being summarized\n    - `{{INPUT.*}}`: Access query inputs\n    - `{{CONTEXT.*}}`: Access execution context\n\nExamples:\n    Basic summarization:\n        ```json\n        {\n            \"prompt\": \"Summarize these {{DOC_COUNT}} search results:\\n\\n{{DOCUMENTS}}\",\n            \"provider\": \"google\",\n            \"model_name\": \"gemini-2.5-flash-lite\"\n        }\n        ```\n\n    Question-answering from search results:\n        ```json\n        {\n            \"prompt\": \"Answer this question: {{INPUT.question}}\\n\\nBased on these documents:\\n{{DOCUMENTS}}\",\n            \"provider\": \"openai\",\n            \"model_name\": \"gpt-4o\",\n            \"include_sources\": true\n        }\n        ```\n\n    Per-category summarization:\n        ```json\n        {\n            \"prompt\": \"Summarize documents about {{GROUP_VALUE}}:\\n\\n{{DOCUMENTS}}\",\n            \"provider\": \"openai\",\n            \"model_name\": \"gpt-4o-mini\",\n            \"group_by\": \"metadata.category\"\n        }\n        ```","examples":[{"description":"Basic summarization (recommended format)","model_name":"gemini-2.5-flash-lite","prompt":"Summarize these {{DOC_COUNT}} documents:\n\n{{DOCUMENTS}}","provider":"google"},{"description":"Question-answering from search results","include_sources":true,"model_name":"gpt-4o","prompt":"Answer this question based on the search results:\n\nQuestion: {{INPUT.question}}\n\nSearch Results:\n{{DOCUMENTS}}\n\nProvide a comprehensive answer with citations.","provider":"openai","temperature":0.2},{"description":"Per-category summarization","group_by":"metadata.category","model_name":"gpt-4o-mini","prompt":"Create a summary for the '{{GROUP_VALUE}}' category:\n\n{{DOCUMENTS}}","provider":"openai"},{"description":"Structured summary with key points","model_name":"claude-haiku-4-5-20251001","output_schema":{"properties":{"summary":{"type":"string"},"key_points":{"items":{"type":"string"},"type":"array"},"confidence":{"type":"number"}},"required":["summary","key_points"],"type":"object"},"prompt":"Analyze these documents and extract key findings:\n\n{{DOCUMENTS}}","provider":"anthropic"},{"description":"Legacy format (deprecated but supported)","inference_name":"openai:gpt-4o-mini","prompt":"Summarize:\n\n{{DOCUMENTS}}"}],"properties":{"prompt":{"default":"Summarize the following {{DOC_COUNT}} documents concisely:\n\n{{DOCUMENTS}}","description":"REQUIRED. Prompt template for the LLM. Must include {{DOCUMENTS}} placeholder. \n\nAvailable placeholders:\n- {{DOCUMENTS}}: Formatted list of all documents\n- {{DOC_COUNT}}: Number of documents\n- {{GROUP_VALUE}}: Current group value (when using group_by)\n- {{INPUT.*}}: Query input values\n- {{CONTEXT.*}}: Execution context","examples":["Summarize these {{DOC_COUNT}} results:\n\n{{DOCUMENTS}}","Answer: {{INPUT.question}}\n\nBased on:\n{{DOCUMENTS}}","Key findings from {{DOC_COUNT}} documents:\n\n{{DOCUMENTS}}"],"title":"Prompt","type":"string"},"provider":{"anyOf":[{"$ref":"#/$defs/LLMProvider"},{"type":"null"}],"default":null,"description":"LLM provider to use. Supported providers:\n- openai: GPT models (GPT-4o, GPT-4o-mini)\n- google: Gemini models (Gemini 3.1 Flash Lite)\n- anthropic: Claude models (Claude 3.5 Sonnet/Haiku)\n\nIf not specified, defaults to 'google'. Can be auto-inferred from model_name.","examples":["openai","google","anthropic"]},"model_name":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Specific LLM model to use. If not specified, uses provider default.\nExamples: gemini-2.5-flash-lite, gpt-4o-mini, gpt-4o","examples":["gemini-2.5-flash-lite","gpt-4o-mini","gpt-4o","claude-haiku-4-5-20251001"],"title":"Model Name"},"inference_name":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"deprecated":true,"description":"DEPRECATED: Use 'provider' and 'model_name' instead.\nLegacy format: 'provider:model' (e.g., 'gemini:gemini-2.5-flash-lite').\nKept for backward compatibility only.","title":"Inference Name"},"document_template":{"default":"[{{INDEX}}] {{DOC.content}}\n","description":"OPTIONAL. Template for formatting each document in {{DOCUMENTS}}. Default: '[{{INDEX}}] {{DOC.content}}\\n'. \n\nAvailable placeholders:\n- {{INDEX}}: 1-based document index\n- {{DOC.*}}: Any document field (e.g., {{DOC.content}}, {{DOC.metadata.title}})","examples":["[{{INDEX}}] {{DOC.content}}\n","Document {{INDEX}}: {{DOC.metadata.title}}\n{{DOC.content}}\n\n","Source: {{DOC.metadata.source}}\n{{DOC.content}}\n---\n"],"title":"Document Template","type":"string"},"content_field":{"default":"content","description":"OPTIONAL. Primary field to extract content from each document. Used when {{DOC.content}} is referenced in document_template. Supports dot notation for nested fields.","examples":["content","metadata.description","text","body"],"title":"Content Field","type":"string"},"group_by":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"OPTIONAL. Field to group documents by before summarization. When set, creates one summary per unique group value (N→M transformation). When not set, creates one summary for all documents (N→1 transformation). \n\nUse cases:\n- 'metadata.category': One summary per category\n- 'metadata.source': One summary per source\n- 'metadata.date': One summary per date","examples":["metadata.category","metadata.source","collection_id"],"title":"Group By"},"output_field":{"default":"summary","description":"OPTIONAL. Field name for the summary in the output document. Default: 'summary'.","examples":["summary","answer","synthesis","key_findings"],"title":"Output Field","type":"string"},"include_sources":{"default":true,"description":"OPTIONAL. Include source document IDs in output. When true, adds 'source_document_ids' field to output. Useful for citation and attribution.","title":"Include Sources","type":"boolean"},"include_metadata":{"default":true,"description":"OPTIONAL. Include metadata about summarization in output. Adds 'document_count', 'tokens_used', etc.","title":"Include Metadata","type":"boolean"},"max_input_tokens":{"default":8000,"description":"OPTIONAL. Maximum tokens to use for input documents. Documents exceeding this limit are truncated using truncation_strategy. Default: 8000 (safe for most models).","examples":[4000,8000,16000,32000],"maximum":128000,"minimum":100,"title":"Max Input Tokens","type":"integer"},"truncation_strategy":{"default":"drop_last","description":"OPTIONAL. How to handle documents exceeding max_input_tokens. \n\nStrategies:\n- 'drop_last': Include documents in order until limit, drop remaining\n- 'truncate_each': Give each document equal token budget, truncate individually\n- 'smart': Prioritize by relevance score, truncate lower-scored documents first","enum":["drop_last","truncate_each","smart"],"examples":["drop_last","truncate_each","smart"],"title":"Truncation Strategy","type":"string"},"temperature":{"default":0.3,"description":"OPTIONAL. LLM temperature for summary generation. Lower values (0.1-0.3) produce more focused, deterministic summaries. Higher values (0.7-1.0) produce more creative, varied summaries. Default: 0.3 (factual summarization).","examples":[0.0,0.3,0.7,1.0],"maximum":2.0,"minimum":0.0,"title":"Temperature","type":"number"},"max_output_tokens":{"anyOf":[{"maximum":16000,"minimum":10,"type":"integer"},{"type":"null"}],"default":1024,"description":"OPTIONAL. Maximum tokens for the summary output. Default: 1024.","examples":[256,512,1024,2048],"title":"Max Output Tokens"},"output_schema":{"anyOf":[{"additionalProperties":true,"type":"object"},{"type":"null"}],"default":null,"description":"OPTIONAL. JSON schema for structured output. When provided, LLM output is parsed as JSON matching this schema.","examples":[{"properties":{"summary":{"type":"string"},"key_points":{"items":{"type":"string"},"type":"array"}},"type":"object"}],"title":"Output Schema"}},"title":"SummarizeStageConfig","type":"object"}},{"stage_id":"temporal","description":"Group documents by time windows and compute trend aggregations","category":"reduce","icon":"clock","parameter_schema":{"$defs":{"DriftDetectionConfig":{"description":"Configuration for detecting changes between consecutive time windows.\n\nWhen enabled, computes absolute and percent change between consecutive\nwindows for a specified metric, optionally flagging significant changes.","properties":{"enabled":{"default":true,"description":"Whether to compute drift metrics between consecutive windows.","title":"Enabled","type":"boolean"},"metric":{"default":"count","description":"Which aggregation alias to compute drift on. Must match an alias from the aggregations list.","examples":["count","avg_score","total_views"],"title":"Metric","type":"string"},"threshold":{"anyOf":[{"minimum":0.0,"type":"number"},{"type":"null"}],"default":null,"description":"Absolute percent change threshold to flag as significant drift. E.g., 50.0 means flag windows with >50%% change from previous. None means compute drift but don't flag.","title":"Threshold"}},"title":"DriftDetectionConfig","type":"object"},"TemporalAggregation":{"description":"Aggregation to compute per time window.\n\nExamples:\n    Count documents per window:\n        ```json\n        {\"function\": \"count\", \"alias\": \"count\"}\n        ```\n\n    Average score per window:\n        ```json\n        {\"function\": \"avg\", \"field\": \"score\", \"alias\": \"avg_score\"}\n        ```","examples":[{"alias":"count","function":"count"},{"alias":"avg_score","field":"score","function":"avg"},{"alias":"unique_authors","field":"metadata.author","function":"count_distinct"}],"properties":{"function":{"description":"REQUIRED. Aggregation function to apply per time window. count: Count documents in each window. sum/avg/min/max: Numeric aggregations on a field. count_distinct: Count unique values of a field. collect_distinct: Collect unique values into a list.","enum":["count","sum","avg","min","max","count_distinct","collect_distinct"],"examples":["count","sum","avg","min","max","count_distinct"],"title":"Function","type":"string"},"field":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"OPTIONAL. Field path to aggregate (dot notation supported). Required for all functions except count. Examples: 'score', 'metadata.views', 'payload.price'.","examples":["score","metadata.views","payload.price"],"title":"Field"},"alias":{"description":"REQUIRED. Name for this metric in the output. Must be unique within the stage configuration. Used as the key in the per-window metrics.","examples":["count","avg_score","total_views"],"title":"Alias","type":"string"}},"required":["function","alias"],"title":"TemporalAggregation","type":"object"},"TimeWindow":{"description":"Supported time window granularities.","enum":["hour","day","week","month","quarter","year"],"title":"TimeWindow","type":"string"}},"description":"Configuration for the temporal REDUCE stage.\n\n┌─────────────────────────────────────────────────────────────────────────────┐\n│ Stage Category: REDUCE                                                      │\n│                                                                             │\n│ Transformation: N documents → M time-window results                         │\n│                                                                             │\n│ This stage operates on IN-MEMORY pipeline results, NOT the database.        │\n│ Use this for trend analysis on already-retrieved documents.                 │\n└─────────────────────────────────────────────────────────────────────────────┘\n\nPurpose:\n    Group documents by time windows and compute per-window aggregations.\n    Useful for trend analysis, drift detection, and temporal patterns.\n\nWhen to Use:\n    - Analyze how document counts or scores change over time\n    - Detect spikes or drops in activity across time windows\n    - Build time-series visualizations from search results\n    - Identify temporal patterns in retrieved documents\n\nWhen NOT to Use:\n    - For full-collection time-series (use aggregation API directly)\n    - The temporal stage only sees pipeline results, not full dataset\n\nPerformance:\n    - Fast: In-memory Python operations on already-fetched documents\n    - No database queries\n    - Suitable for up to ~10K documents\n\nCommon Pipeline Position:\n    FILTER → SORT → REDUCE (this stage)","examples":[{"aggregations":[{"alias":"count","function":"count"}],"description":"Daily document count with drift detection","drift":{"enabled":true,"metric":"count","threshold":50.0},"time_field":"created_at","window":"day"},{"aggregations":[{"alias":"total","function":"count"},{"alias":"avg_score","field":"score","function":"avg"}],"description":"Monthly average score trend","limit":12,"sort_order":"desc","time_field":"metadata.timestamp","window":"month"}],"properties":{"time_field":{"description":"REQUIRED. Field containing the timestamp (ISO 8601 string or epoch). Dot notation supported. Examples: 'created_at', 'metadata.timestamp', 'payload.published_date'.","examples":["created_at","metadata.timestamp","payload.published_date"],"title":"Time Field","type":"string"},"window":{"$ref":"#/$defs/TimeWindow","default":"day","description":"Time window granularity for grouping."},"aggregations":{"default":[{"function":"count","field":null,"alias":"count"}],"description":"List of aggregations to compute per time window. At least one aggregation is required.","items":{"$ref":"#/$defs/TemporalAggregation"},"minItems":1,"title":"Aggregations","type":"array"},"drift":{"anyOf":[{"$ref":"#/$defs/DriftDetectionConfig"},{"type":"null"}],"default":null,"description":"OPTIONAL. Drift detection between consecutive windows. When enabled, computes absolute and percent change for a metric."},"sort_order":{"default":"asc","description":"Sort order for time windows (asc = oldest first).","enum":["asc","desc"],"title":"Sort Order","type":"string"},"limit":{"anyOf":[{"maximum":1000,"minimum":1,"type":"integer"},{"type":"null"}],"default":null,"description":"Maximum number of time windows to return.","title":"Limit"},"include_documents":{"default":false,"description":"OPTIONAL. Whether to include the original documents in output. False (default): Only return window results in metadata. True: Pass through documents and add window results to metadata.","title":"Include Documents","type":"boolean"}},"required":["time_field"],"title":"TemporalConfig","type":"object"}},{"stage_id":"sort_attribute","description":"Sort documents by attribute","category":"sort","icon":"arrow-up-down","parameter_schema":{"$defs":{"SortDirection":{"description":"Sort direction options.","enum":["asc","desc"],"title":"SortDirection","type":"string"}},"description":"Configuration for sorting documents by an attribute field.\n\n**Stage Category**: SORT\n\n**Transformation**: N documents → N documents (same docs, different order, same schema)\n\n**Purpose**: Reorders documents in the pipeline based on any document attribute\n(not just relevance scores). Use this for sorting by metadata fields like dates,\npopularity, priority, or custom attributes.\n\n**When to Use**:\n    - Sort by timestamps (created_at, updated_at, published_date)\n    - Sort by numeric metadata (popularity, view_count, rating, price)\n    - Sort by string attributes (title, category) with optional case-insensitive comparison\n    - Apply business logic ordering (priority, status)\n    - Secondary sorting after relevance-based retrieval\n\n**When NOT to Use**:\n    - For sorting by relevance/similarity scores (use sort_relevance instead)\n    - For initial document retrieval (use FILTER stages)\n    - For removing documents (use FILTER stages)\n    - For enriching documents (use APPLY stages)\n\n**Operational Behavior**:\n    - Operates on in-memory document results (no database queries)\n    - Maintains all documents, just changes their order\n    - Fast operation (simple in-memory sort)\n    - Does not change document count or schema\n    - Handles null values gracefully (configurable placement)\n\n**Common Pipeline Position**: FILTER → SORT (this stage) → APPLY\n\nRequirements:\n    - field: REQUIRED, document field path to sort on\n    - direction: OPTIONAL, defaults to descending\n    - nulls_last: OPTIONAL, defaults to true (nulls at end)\n\nUse Cases:\n    - Recent content first: Sort by published_date desc\n    - Popular content: Sort by view_count or popularity score\n    - Alphabetical: Sort by title or name\n    - Priority-based: Sort by urgency or importance ratings\n    - Temporal ordering: Sort by event timestamps","examples":[{"description":"Sort by most recent published date first","direction":"desc","field":"metadata.published_date","nulls_last":true},{"description":"Sort by lowest price first (budget-friendly)","direction":"asc","field":"metadata.price","nulls_last":true},{"description":"Sort by highest popularity/view count","direction":"desc","field":"metadata.view_count","nulls_last":true},{"description":"Sort alphabetically by title A-Z","direction":"asc","field":"title","nulls_last":false},{"description":"Sort by priority score (business logic)","direction":"desc","field":"metadata.priority_score"}],"properties":{"field":{"default":"metadata.created_at","description":"Document field path to sort on. Use dot notation for nested fields (e.g., 'metadata.release_date'). Supports template expressions for dynamic field selection. Can sort by strings, numbers, dates, or booleans. Examples: 'metadata.created_at', 'metadata.popularity', 'title'.","examples":["metadata.published_date","metadata.popularity","metadata.price","title","metadata.priority_score"],"title":"Field","type":"string"},"direction":{"$ref":"#/$defs/SortDirection","default":"desc","description":"OPTIONAL. Sort direction. 'desc' (default): Highest/latest values first (Z-A, 100-0, newest-oldest). 'asc': Lowest/earliest values first (A-Z, 0-100, oldest-newest).","examples":["desc","asc"]},"nulls_last":{"default":true,"description":"OPTIONAL. Whether documents with null/missing field values should be placed at the end of results regardless of sort direction. true (default): Nulls always at the end. false: Nulls follow natural sort order (beginning for asc, end for desc).","examples":[true,false],"title":"Nulls Last","type":"boolean"}},"title":"SortAttributeConfig","type":"object"}},{"stage_id":"mmr","description":"Reorder results using Maximal Marginal Relevance for diversity","category":"sort","icon":"shuffle","parameter_schema":{"$defs":{"DiversityFeatureConfig":{"description":"Configuration for a single feature used in multi-feature diversity computation.\n\nWhen using multi-feature diversity mode, you can specify multiple embedding spaces\nand weight their contribution to the overall diversity score.\n\nExample:\n    Combine text and image embeddings with different weights:\n    ```python\n    features = [\n        DiversityFeatureConfig(\n            feature_uri=\"mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1\",\n            weight=0.6\n        ),\n        DiversityFeatureConfig(\n            feature_uri=\"mixpeek://clip@v1/image_embedding\",\n            weight=0.4\n        )\n    ]\n    ```","examples":[{"description":"Text embedding with high weight","feature_uri":"mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1","weight":0.7},{"description":"Image embedding with lower weight","feature_uri":"mixpeek://clip@v1/image_embedding","weight":0.3}],"properties":{"feature_uri":{"description":"REQUIRED. Feature URI specifying which embedding to use for diversity. Format: 'mixpeek://extractor@version/output'. The embedding must exist on documents from the previous stage. Use state.context to find available embeddings from feature_search.","examples":["mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1","mixpeek://clip@v1/image_embedding","mixpeek://multimodal@v1/embedding"],"title":"Feature Uri","type":"string"},"weight":{"default":1.0,"description":"OPTIONAL. Weight for this feature's contribution to diversity score. Weights across all features are normalized to sum to 1.0. Higher weight = more influence on diversity computation.","examples":[0.5,0.7,1.0],"maximum":1.0,"minimum":0.0,"title":"Weight","type":"number"}},"required":["feature_uri"],"title":"DiversityFeatureConfig","type":"object"}},"description":"Configuration for MMR (Maximal Marginal Relevance) result diversification.\n\n**Stage Category**: SORT\n\n**Transformation**: N documents → N documents (reordered for diversity)\n\n**Purpose**: Reorders search results to balance relevance with diversity,\npreventing result sets dominated by near-duplicate or highly similar content.\nParticularly valuable for multimodal search where visual/semantic similarity\ncan lead to repetitive results.\n\n**When to Use**:\n    - After feature_search when results may contain near-duplicates\n    - When users expect variety in search results\n    - For recommendation systems needing diverse suggestions\n    - When different facets of a query should be represented\n\n**When NOT to Use**:\n    - When exact relevance ranking is critical (use sort_relevance)\n    - For small result sets (<5 documents) where diversity matters less\n    - When duplicates have already been removed via group_by\n\n**Three Diversity Modes** (mutually exclusive):\n\n| Mode | Config Field | Description |\n|------|--------------|-------------|\n| Single Feature | `diversity_feature_uri` | Diversity in one embedding space |\n| Multi-Feature | `diversity_features` | Weighted fusion across spaces |\n| Attribute-Based | `diversity_fields` | Diversify by metadata values |\n\n**Mode Selection Logic**:\n    1. If `diversity_feature_uri` is set → Single Feature mode\n    2. If `diversity_features` is set → Multi-Feature mode\n    3. If `diversity_fields` is set → Attribute-Based mode\n    4. If none set → Auto-detect from previous feature_search stage\n\n**Common Pipeline Position**: feature_search → mmr → (optional rerank)\n\nExamples:\n    Single feature diversity (simplest):\n        ```json\n        {\n            \"lambda_\": 0.7,\n            \"top_k\": 25,\n            \"diversity_feature_uri\": \"mixpeek://clip@v1/image_embedding\"\n        }\n        ```\n\n    Multi-feature diversity (multimodal):\n        ```json\n        {\n            \"lambda_\": 0.6,\n            \"top_k\": 20,\n            \"diversity_features\": [\n                {\"feature_uri\": \"mixpeek://text@v1/embedding\", \"weight\": 0.5},\n                {\"feature_uri\": \"mixpeek://clip@v1/embedding\", \"weight\": 0.5}\n            ]\n        }\n        ```\n\n    Attribute-based diversity (no embeddings):\n        ```json\n        {\n            \"lambda_\": 0.5,\n            \"top_k\": 30,\n            \"diversity_fields\": [\"metadata.category\", \"metadata.source\"]\n        }\n        ```","examples":[{"description":"Basic MMR with auto-detected feature (simplest usage)","lambda":0.7,"top_k":25},{"description":"Single feature diversity with explicit embedding","diversity_feature_uri":"mixpeek://clip@v1/image_embedding","lambda":0.7,"top_k":25},{"description":"Multi-feature diversity for multimodal content","diversity_features":[{"feature_uri":"mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1","weight":0.5},{"feature_uri":"mixpeek://clip@v1/image_embedding","weight":0.5}],"lambda":0.6,"top_k":20},{"description":"Attribute-based diversity by category and source","diversity_fields":["metadata.category","metadata.source"],"lambda":0.5,"top_k":30},{"description":"High diversity setting for exploratory search","diversity_feature_uri":"mixpeek://multimodal@v1/embedding","lambda":0.4,"top_k":50}],"properties":{"lambda":{"default":0.7,"description":"OPTIONAL. Balance between relevance and diversity (default: 0.7). Higher values favor relevance, lower values favor diversity. \n\nGuidelines:\n- 1.0: Pure relevance (no diversity, equivalent to no MMR)\n- 0.7-0.8: Slight diversity while maintaining relevance (recommended)\n- 0.5: Balanced relevance and diversity\n- 0.3-0.4: High diversity, may sacrifice some relevance\n- 0.0: Maximum diversity (ignores relevance scores)","examples":[0.5,0.6,0.7,0.8],"maximum":1.0,"minimum":0.0,"title":"Lambda","type":"number"},"top_k":{"default":25,"description":"OPTIONAL. Number of documents to return after MMR reordering (default: 25). MMR processes all input documents but returns only top_k. Set to match your UI's result count for optimal diversity.","examples":[10,25,50],"maximum":500,"minimum":1,"title":"Top K","type":"integer"},"diversity_feature_uri":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"OPTIONAL. Single embedding feature URI for diversity computation. Use this for simple cases where one embedding space captures similarity. \n\nIf not specified and no other mode is set, auto-detects from the previous feature_search stage's feature_uri. \n\nFormat: 'mixpeek://extractor@version/output'","examples":["mixpeek://clip@v1/image_embedding","mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1"],"title":"Diversity Feature Uri"},"diversity_features":{"anyOf":[{"items":{"$ref":"#/$defs/DiversityFeatureConfig"},"type":"array"},{"type":"null"}],"default":null,"description":"OPTIONAL. Multiple embedding features for weighted diversity fusion. Use this for multimodal content where similarity should consider multiple embedding spaces (e.g., text + image + audio). \n\nDiversity scores from each feature are combined using weights. Weights are normalized to sum to 1.0.","examples":[[{"feature_uri":"mixpeek://text@v1/embedding","weight":0.6},{"feature_uri":"mixpeek://clip@v1/embedding","weight":0.4}]],"title":"Diversity Features"},"diversity_fields":{"anyOf":[{"items":{"type":"string"},"type":"array"},{"type":"null"}],"default":null,"description":"OPTIONAL. Metadata fields to use for attribute-based diversity. No embeddings required - uses field value matching. \n\nDocuments with the same field values are considered similar. Useful for categorical diversity (one per category, source, type). \n\nDot notation supported for nested fields.","examples":[["metadata.category"],["metadata.category","metadata.source_type"],["metadata.brand","metadata.product_type"]],"title":"Diversity Fields"},"diversity_field_weights":{"anyOf":[{"additionalProperties":{"type":"number"},"type":"object"},{"type":"null"}],"default":null,"description":"OPTIONAL. Weights for each diversity field (for attribute-based mode). If not specified, all fields are weighted equally. Keys must match field paths in diversity_fields.","examples":[{"metadata.category":0.7,"metadata.source":0.3}],"title":"Diversity Field Weights"},"score_field":{"default":"score","description":"OPTIONAL. Document field containing relevance score from previous stage. Used as the 'relevance' component in MMR formula.","examples":["score","scores.relevance","relevance_score"],"title":"Score Field","type":"string"},"mmr_score_field":{"default":"scores.mmr","description":"OPTIONAL. Field path to store the computed MMR score. Useful for debugging and understanding ranking decisions.","examples":["scores.mmr","mmr_score"],"title":"Mmr Score Field","type":"string"},"similarity_metric":{"default":"cosine","description":"OPTIONAL. Similarity metric for embedding comparison. Cosine is recommended for normalized embeddings (most common).","enum":["cosine","dot","euclidean"],"examples":["cosine","dot","euclidean"],"title":"Similarity Metric","type":"string"}},"title":"MMRStageConfig","type":"object"}},{"stage_id":"sort_relevance","description":"Sort documents by relevance score","category":"sort","icon":"sparkles","parameter_schema":{"$defs":{"SortDirection":{"description":"Sort direction options.","enum":["asc","desc"],"title":"SortDirection","type":"string"}},"description":"Configuration for re-sorting documents by relevance score.\n\n**Stage Category**: SORT\n\n**Transformation**: N documents → N documents (same docs, different order, same schema)\n\n**Purpose**: Reorders documents in the pipeline based on their relevance scores.\nThis stage does NOT add/remove documents - it only changes their order.\n\n**When to Use**:\n    - After SEARCH stages to reorder results by their similarity scores\n    - After multiple SEARCH stages are merged to apply final relevance ordering\n    - When you need to sort by a custom score field (not just top-level score)\n    - To apply consistent ordering after filtering operations\n\n**When NOT to Use**:\n    - For initial document retrieval (use FILTER stages: hybrid_search)\n    - For removing documents (use FILTER stages: attribute_filter, llm_filter)\n    - For sorting by non-score attributes (use sort_attribute instead)\n    - For enriching documents (use APPLY stages)\n\n**Operational Behavior**:\n    - Operates on in-memory document results (no database queries)\n    - Maintains all documents, just changes their order\n    - Fast operation (simple in-memory sort)\n    - Does not change document count or schema\n\n**Common Pipeline Position**: FILTER → SORT (this stage) → APPLY\n\nRequirements:\n    - score_field: OPTIONAL, defaults to \"score\"\n    - direction: OPTIONAL, defaults to descending (highest scores first)\n    - feature_address: OPTIONAL, for computing similarity when scores missing\n    - missing_score: OPTIONAL, controls placement of documents without scores\n\nUse Cases:\n    - Standard relevance ranking: Sort search results by similarity scores\n    - Custom scoring: Sort by model-specific scores (metadata.rerank_score)\n    - Multi-stage pipelines: Final ordering after complex filtering\n    - Hybrid search: Order fused results by combined scores","examples":[{"description":"Basic relevance sorting - sort by standard score field","direction":"desc","score_field":"score"},{"description":"Sort by custom reranker scores with missing score handling","direction":"desc","missing_score":"bottom","score_field":"metadata.rerank_score"},{"description":"Sort with fallback similarity computation for missing scores","direction":"desc","feature_address":"mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1","missing_score":"bottom","score_field":"score"},{"description":"Preserve original order for documents without scores","direction":"desc","missing_score":"preserve","score_field":"metadata.custom_score"}],"properties":{"score_field":{"default":"score","description":"OPTIONAL. Document field path to use for sorting. Defaults to top-level 'score' field populated by SEARCH stages. Use dot notation for nested fields (e.g., 'metadata.rerank_score'). Supports template expressions for dynamic field selection.","examples":["score","metadata.relevance_score","metadata.rerank_score"],"title":"Score Field","type":"string"},"direction":{"$ref":"#/$defs/SortDirection","default":"desc","description":"OPTIONAL. Sort direction for relevance scores. 'desc' (default): Highest scores first (most relevant). 'asc': Lowest scores first (least relevant, rarely used).","examples":["desc","asc"]},"feature_address":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"OPTIONAL. Feature address to compute similarity when scores are missing. If a document lacks the score_field, the system can compute similarity using this feature and the query embedding from earlier SEARCH stages. Format: 'mixpeek://extractor@version/output'. NOT REQUIRED - only use when expecting missing scores and want to compute them.","examples":["mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1","mixpeek://image_extractor@v1/embedding"],"title":"Feature Address"},"missing_score":{"default":"bottom","description":"OPTIONAL. How to handle documents without a score_field value. 'bottom' (default): Place at end of results (lowest priority). 'top': Place at start of results (highest priority, rarely used). 'preserve': Keep in original position (maintain insertion order).","enum":["bottom","top","preserve"],"examples":["bottom","top","preserve"],"title":"Missing Score","type":"string"}},"title":"SortRelevanceConfig","type":"object"}},{"stage_id":"rerank","description":"Rerank documents using cross-encoder inference","category":"sort","icon":"shuffle","parameter_schema":{"$defs":{"StageCacheBehavior":{"description":"Cache behavior modes for retriever stages.\n\nControls internal caching of stage operations for performance optimization.\nAll modes are safe and automatic with LRU eviction - no manual cache management needed.\n\nValues:\n    AUTO: Smart automatic caching (default, recommended)\n    DISABLED: Skip internal caching completely\n    AGGRESSIVE: Cache even non-deterministic operations (use with caution)\n\nCache Architecture:\n    - Redis with LRU eviction policy (memory-bounded)\n    - Namespace-isolated per organization (multi-tenant safe)\n    - Stage-specific keyspaces prevent conflicts\n    - Cache keys hash (stage_name, inputs, parameters)\n    - Automatic invalidation on parameter changes\n\nPerformance Impact:\n    - AUTO: 50-90% latency reduction for repeated operations\n    - Cache lookup overhead: <5ms\n    - Hit rates: Typically 60-80% in production\n\nWhen to Use Each Mode:\n    AUTO (default):\n        - Deterministic transformations (parsing, formatting, reshaping)\n        - Stable external API calls (embeddings, standard inference)\n        - Operations without side effects\n        - Most use cases - this is the recommended default\n\n    DISABLED:\n        - Templates with now(), random(), or time-sensitive functions\n        - External APIs that must be called every time (real-time data)\n        - Operations with side effects\n        - Rapidly changing data where caching would serve stale results\n\n    AGGRESSIVE:\n        - When you fully understand caching implications\n        - For debugging or testing cache behavior\n        - Only use if you know cache invalidation is handled elsewhere\n        - Generally not recommended for production\n\nExamples:\n    Basic usage (auto mode, no config needed):\n        {\"cache_behavior\": \"auto\"}  # or omit - this is the default\n\n    Disable for time-sensitive operations:\n        {\"cache_behavior\": \"disabled\"}  # Template has {{now()}}\n\n    With custom TTL:\n        {\"cache_behavior\": \"auto\", \"cache_ttl_seconds\": 300}","enum":["auto","disabled","aggressive"],"title":"StageCacheBehavior","type":"string"}},"description":"Configuration for reranking documents using cross-encoder models.\n\nReranking refines search results by computing query-document relevance\nscores using cross-encoder models (e.g., BGE reranker). More accurate\nthan vector similarity but slower, so typically used on top-K results.\n\nYou can use a builtin reranker (via inference_name) or bring your own\nreranker plugin (via feature_uri). When feature_uri is set, it takes\nprecedence over inference_name.\n\nCommon Pipeline:\n    feature_filter (retrieve 100) → rerank (refine to 10) → sort_relevance","examples":[{"batch_size":32,"description":"Basic text reranking with builtin model","document_field":"content","inference_name":"BAAI__bge_reranker_v2_m3","query":"{{INPUT.query}}","top_k":10},{"description":"BYO reranker via custom plugin","document_field":"content","feature_uri":"mixpeek://my_reranker@1.0.0/rerank","query":"{{INPUT.query}}","top_k":10}],"properties":{"cache_behavior":{"$ref":"#/$defs/StageCacheBehavior","default":"auto","description":"Controls internal caching behavior for this stage. OPTIONAL - defaults to 'auto' for transparent performance. \n\n'auto' (default): Automatic caching for deterministic operations. Stage intelligently caches results based on inputs and parameters. Use for transformations, parsing, formatting, stable API calls. Cache invalidates automatically when parameters change. Recommended for 95% of use cases. \n\n'disabled': Skip all internal caching. Every execution runs fresh without cache lookup. Use for templates with now(), random(), or external APIs that must be called every time (real-time data). No performance benefit but guarantees fresh execution. \n\n'aggressive': Cache even non-deterministic operations. Use ONLY when you fully understand caching implications. May cache time-sensitive or random data. Generally not recommended - prefer 'auto' or 'disabled'. \n\nNote: This controls internal stage caching. Retriever-level caching (cache_config.cache_stage_names) is separate and caches complete stage outputs.","examples":["auto","disabled","aggressive"]},"cache_ttl_seconds":{"anyOf":[{"minimum":0,"type":"integer"},{"type":"null"}],"default":null,"description":"Time-to-live for cache entries in seconds. OPTIONAL - defaults to None (LRU eviction only). \n\nWhen None (default, recommended): Cache uses Redis LRU eviction policy. Most frequently used items stay cached automatically. No manual TTL management needed. Memory bounded by Redis maxmemory setting. \n\nWhen specified: Cache entries expire after this duration regardless of usage. Useful for data that becomes stale after specific time periods. Lower values for frequently changing external data. Higher values for stable transformations. \n\nExamples:\n- None: LRU-based eviction (recommended for most cases)\n- 300: 5 minutes (for semi-static external data)\n- 3600: 1 hour (for stable transformations)\n- 86400: 24 hours (for rarely changing operations)\n\n\nPerformance Note: TTL adds minimal overhead (<1ms) but forces eviction even for frequently accessed items. Use None unless you have specific staleness requirements.","examples":[null,300,3600,86400],"title":"Cache Ttl Seconds"},"feature_uri":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Feature URI of a custom reranker plugin. When set, overrides inference_name. The plugin must accept {pairs: [[query, doc], ...]} and return {scores: [float, ...]}. Format: mixpeek://plugin_name@version/feature_name","examples":["mixpeek://my_reranker@1.0.0/rerank"],"title":"Feature Uri"},"inference_name":{"default":"BAAI__bge_reranker_v2_m3","description":"Reranking inference service name. Must be a reranking service. Use GET /engine/inference to list available rerankers. Ignored when feature_uri is set.","examples":["BAAI__bge_reranker_v2_m3"],"title":"Inference Name","type":"string"},"query":{"default":"{{INPUT.query}}","description":"Query text to compare against documents. Supports template variables: {{INPUT.query}}, etc.","examples":["{{INPUT.query}}","{{INPUT.search_term}}"],"title":"Query","type":"string"},"document_field":{"default":"content","description":"Document field path containing text to rerank against","examples":["content","metadata.description","text"],"title":"Document Field","type":"string"},"top_k":{"anyOf":[{"minimum":1,"type":"integer"},{"type":"null"}],"default":null,"description":"Number of top documents to keep after reranking. If None, returns all documents in reranked order.","title":"Top K"},"score_field":{"default":"scores.rerank","description":"Document field path to store reranking scores","title":"Score Field","type":"string"},"batch_size":{"default":32,"description":"Batch size for reranking inference calls","maximum":100,"minimum":1,"title":"Batch Size","type":"integer"},"max_document_chars":{"default":2000,"description":"Maximum characters of document text to send for reranking. The cross-encoder tokenizer truncates to ~512 tokens anyway, so sending full page content (often 10K+ chars) wastes bandwidth and increases latency. 2000 chars ≈ 500 tokens covers the tokenizer window with margin.","maximum":50000,"minimum":100,"title":"Max Document Chars","type":"integer"},"max_concurrent_batches":{"default":3,"description":"Maximum number of inference batches to process concurrently. When the number of pairs exceeds batch_size, they are split into batches and sent as concurrent inference requests.","maximum":10,"minimum":1,"title":"Max Concurrent Batches","type":"integer"}},"title":"RerankConfig","type":"object"}},{"stage_id":"score_normalize","description":"Normalize document scores to a common range","category":"sort","icon":"scaling","parameter_schema":{"description":"Configuration for score normalization stage.\n\n**Stage Category**: SORT\n\n**Transformation**: N documents → N documents (same docs, rescaled scores)\n\n**Purpose**: Rescales document scores using statistical normalization methods.\nThis makes scores from different retrieval stages comparable and enables\nconsistent score-based thresholding downstream.\n\n**When to Use**:\n    - After feature_search to normalize cosine similarity scores\n    - After rerank to rescale cross-encoder scores to [0,1]\n    - In hybrid pipelines combining text and vector search scores\n    - Before score-based filtering to set consistent thresholds\n    - When combining scores from multiple retrieval sources\n\n**When NOT to Use**:\n    - When scores are already in desired range\n    - For reordering documents (scores may change but order is preserved with min_max)\n    - For filtering by score threshold (use attribute_filter on score field)\n    - When only one search stage produces scores\n\n**Normalization Methods**:\n    - `min_max`: Scales to [0, 1] range. Best for bounded score comparison.\n    - `z_score`: Standard score (mean=0, std=1). Best for statistical analysis.\n    - `softmax`: Probability distribution (sum=1). Best for relative importance.\n    - `l2`: L2 norm (unit vector). Best for geometric comparisons.\n\n**Operational Behavior**:\n    - Operates on in-memory document scores (no external calls)\n    - Preserves document order (same documents, different score values)\n    - Fast operation (simple arithmetic on scores)\n    - Handles edge cases (single document, all same scores)\n\n**Common Pipeline Position**: FILTER → SORT (rerank) → SORT (this stage) → FILTER\n\nRequirements:\n    - method: REQUIRED, normalization method to apply\n    - score_field: OPTIONAL, which score field to normalize (default: 'score')\n    - output_field: OPTIONAL, where to write normalized score\n\nUse Cases:\n    - Hybrid search fusion: Normalize text and vector scores before combining\n    - Threshold filtering: Normalize then filter by consistent cutoff\n    - Score comparison: Make scores from different models comparable\n    - Probability ranking: Softmax for relative document importance","examples":[{"description":"Min-max normalization to [0, 1] range","method":"min_max","score_field":"score"},{"description":"Z-score normalization for statistical thresholding","method":"z_score","output_field":"z_score","score_field":"score"},{"description":"Softmax for probability distribution","method":"softmax","output_field":"probability","score_field":"score"},{"description":"Min-max with custom range bounds","max_value":1.0,"method":"min_max","min_value":0.0,"score_field":"score"},{"description":"L2 normalization for fusion","method":"l2","output_field":"normalized_rerank","score_field":"metadata.rerank_score"}],"properties":{"method":{"default":"min_max","description":"REQUIRED. Normalization method to apply:\n- 'min_max': Scale to [0, 1] range using (x - min) / (max - min). Best for bounded comparison. Preserves relative ordering.\n- 'z_score': Standard score using (x - mean) / std. Centers at 0 with unit variance. Good for statistical thresholding.\n- 'softmax': Exponential normalization where scores sum to 1. Converts scores to probability distribution. Amplifies differences.\n- 'l2': Divide by L2 norm (euclidean length). Projects scores onto unit sphere. Good for cosine-based fusion.","enum":["min_max","z_score","softmax","l2"],"examples":["min_max","z_score","softmax","l2"],"title":"Method","type":"string"},"score_field":{"default":"score","description":"OPTIONAL. The field containing the score to normalize. Default: 'score' (the standard document relevance score). Use dot notation for nested fields (e.g., 'metadata.rerank_score').","examples":["score","metadata.rerank_score","metadata.similarity"],"title":"Score Field","type":"string"},"output_field":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"OPTIONAL. Field to write the normalized score to. If None, the original score field is overwritten. If provided, the original score is preserved and the normalized value is written to this field.","examples":["normalized_score","score_norm",null],"title":"Output Field"},"min_value":{"anyOf":[{"type":"number"},{"type":"null"}],"default":null,"description":"OPTIONAL. Custom minimum value for min_max normalization. If None, uses the actual minimum score in the document set. Useful for consistent scaling across multiple queries.","examples":[0.0,-1.0,null],"title":"Min Value"},"max_value":{"anyOf":[{"type":"number"},{"type":"null"}],"default":null,"description":"OPTIONAL. Custom maximum value for min_max normalization. If None, uses the actual maximum score in the document set. Useful for consistent scaling across multiple queries.","examples":[1.0,100.0,null],"title":"Max Value"}},"title":"ScoreNormalizeStageConfig","type":"object"}}]