{"feature_extractor_name":"multimodal_extractor","version":"v1","feature_extractor_id":"multimodal_extractor_v1","description":"**Multimodal extractor for VIDEO, AUDIO, IMAGE, TEXT, and GIF content** using unified Vertex embeddings (1408D).\n\nProcesses diverse media types in a unified embedding space for cross-modal search. Videos/audio are decomposed into segments with transcription, embeddings, OCR, and descriptions. Images and text are embedded directly.\n\n**Pipeline Steps:**\n1. Filter dataset to collection (if collection_id provided)\n2. Apply input mappings\n3. Detect content types via sampling (video/audio/image/text)\n4. **Content Routing:**\n   - **Video:** FFmpeg chunking (time/scene/silence) → Steps 5-9\n   - **Audio:** FFmpeg audio chunking (time/silence) → Steps 5-7\n   - **Image:** Direct to Step 8\n   - **Text:** Direct to Step 8\n   - **Mixed:** Branch by type, process separately, then union\n5. **Conditional:** Transcription (if run_transcription=true)\n   - Whisper API or Local GPU\n   - Speech-to-text for audio tracks\n6. **Conditional:** Transcription embeddings (if run_transcription_embedding=true)\n   - E5 text embeddings (1024D) from transcribed text\n7. Multimodal embeddings (if run_multimodal_embedding=true)\n   - Vertex AI embeddings (1408D) for all content types\n   - Unified space enables cross-modal search\n8. **Conditional:** Thumbnail generation (if enable_thumbnails=true, visual content only)\n   - 640px width at 85% quality\n   - Upload to S3 with optional CDN\n9. **Conditional:** Visual analysis (if run_video_description OR run_ocr=true, visual content only)\n   - Gemini-based descriptions and/or OCR\n10. **Output:** Segment/document records with embeddings and features\n\n**Use for:** Unified multimodal search, video content libraries, educational content, media platforms, cross-modal retrieval.\n\n**Processing speed:** Videos 0.5-2x realtime, Images <1s, Text <100ms","icon":"film","source":"builtin","input_schema":{"description":"Input schema for the multimodal extractor.\n\nDefines the media content (video, image, text, gif) that will be processed and embedded.\nUses Google Vertex multimodal embeddings to create a unified embedding space across all media types.\n\n**Multimodal Support**:\n    - VIDEO: Decomposed into segments with transcription, visual embeddings, and OCR\n    - IMAGE: Direct visual embeddings (no decomposition)\n    - TEXT: Direct text embeddings\n    - GIF: Treated as video, decomposed frame-by-frame\n\n**Bucket Schema Mapping**:\n    When mapping from bucket schema fields to extractor inputs:\n\n    - BucketSchemaFieldType.VIDEO → maps to 'video' input\n    - BucketSchemaFieldType.IMAGE → maps to 'image' input\n    - BucketSchemaFieldType.TEXT/STRING → maps to 'text' input\n\n    **GIF files**: There is no BucketSchemaFieldType.GIF. Instead, GIFs can be\n    declared as either IMAGE or VIDEO in your bucket schema:\n\n    - As IMAGE: GIF detected via MIME type, embedded as static image (first frame)\n    - As VIDEO: GIF detected via MIME type, decomposed frame-by-frame\n\n    Use VIDEO schema type for animated GIFs requiring frame-level search.\n\nRequirements:\n    - Provide ONE of: video, image, text, or gif\n    - VIDEO/GIF: Formats MP4, MOV, AVI, MKV, WebM, FLV, GIF\n    - IMAGE: Formats JPG, PNG, WebP, BMP, GIF (static)\n    - TEXT: Plain text string\n    - URLs must be accessible (S3, HTTP, HTTPS)","examples":[{"description":"Video: Educational lecture","video":"s3://education-videos/lecture-machine-learning-101.mp4"},{"description":"Image: Product photo","image":"https://cdn.example.com/products/laptop-pro-2024.jpg"},{"description":"Text: Product description","text":"High-performance laptop with M3 chip, 16GB RAM, perfect for developers"}],"oneOf":[{"required":["video"]},{"required":["image"]},{"required":["text"]},{"required":["gif"]}],"properties":{"video":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"URL or S3 path to video file for processing. Video will be decomposed into segments based on split_method, and each segment processed for features (transcription, embeddings, OCR, etc.). Supports formats: MP4, MOV, AVI, MKV, WebM, FLV. Recommended: 720p-1080p resolution, <2 hours duration. Examples: 's3://bucket/video.mp4', 'https://example.com/video.mp4'","examples":["s3://my-bucket/videos/lecture-01.mp4","https://storage.googleapis.com/my-videos/tutorial.mp4"],"title":"Video"},"image":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"URL or S3 path to image file for embedding. Image will be embedded directly using Vertex multimodal embeddings (1408D). Supports formats: JPG, PNG, WebP, BMP. Recommended: <10MB file size. Examples: 's3://bucket/image.jpg', 'https://example.com/photo.png'","examples":["s3://my-bucket/images/product.jpg","https://cdn.example.com/photos/banner.png"],"title":"Image"},"text":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Plain text content for embedding. Text will be embedded directly using Vertex multimodal embeddings (1408D). Ideal for creating a unified embedding space with images and videos. Examples: Product descriptions, captions, summaries, labels","examples":["A red sports car driving on a mountain road at sunset","Machine learning tutorial covering neural networks and backpropagation"],"title":"Text"},"gif":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"URL or S3 path to GIF file for processing. GIF will be treated as video and decomposed frame-by-frame. Each frame processed for visual embeddings. Supports animated GIFs. Note: When using bucket schema, declare GIFs as VIDEO type for frame-level processing, or IMAGE type for static embedding. There is no separate GIF bucket schema type - the extractor auto-detects GIFs via MIME type. Examples: 's3://bucket/animation.gif', 'https://example.com/meme.gif'","examples":["s3://my-bucket/animations/loading.gif","https://media.example.com/reactions/thumbs-up.gif"],"title":"Gif"},"custom_thumbnail":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Optional custom thumbnail image URL or S3 path. When provided, this thumbnail is used instead of auto-generating one from the media. Useful when you have a pre-selected representative image for your content. Supports formats: JPG, PNG, WebP. Examples: 's3://bucket/thumb.jpg', 'https://example.com/poster.png'","examples":["s3://my-bucket/thumbnails/custom-thumb.jpg","https://cdn.example.com/posters/video-poster.png"],"title":"Custom Thumbnail"}},"title":"MultimodalExtractorInput","type":"object"},"output_schema":{"description":"Output schema for a single document produced by the multimodal extractor.\n\nEach video segment produces one document with multimodal features.\n\nOutput Structure:\n    - One document per video segment (timespan)\n    - Contains timing information (start/end timestamps)\n    - Includes all extracted features (transcription, embeddings, OCR, etc.)\n    - References source video via source_object_id\n    - Searchable via text (transcription) and visual (embeddings) content\n    - When response_shape is provided, includes custom structured fields (NEW)\n\nCustom Structured Extraction (NEW):\n    When response_shape parameter is set in MultimodalExtractorParams, custom fields\n    are automatically added to this output schema using get_multimodal_extractor_output_schema().\n    This enables extraction of structured data like product details, entity information,\n    or custom metadata fields that are stored directly in document metadata.\n\n    Example: With response_shape defining \"products\" and \"aesthetic\" fields,\n    each document will have those fields in addition to the base fields below.\n\nUse Cases:\n    - Search for specific moments in videos by spoken content\n    - Find visual scenes by description or similarity\n    - Extract and search text appearing in videos (signs, captions, etc.)\n    - Navigate to relevant segments via start_time/end_time\n    - Analyze video content at granular level\n    - Extract structured product/entity data for e-commerce and fashion (NEW)","properties":{"start_time":{"description":"Start time of the segment in seconds","title":"Start Time","type":"number"},"end_time":{"description":"End time of the segment in seconds","title":"End Time","type":"number"},"duration":{"anyOf":[{"type":"number"},{"type":"null"}],"default":null,"description":"Total duration of the entire source video in seconds. This represents the full video length, not the segment duration. Useful for calculating segment position within the video (e.g., start_time/duration). Only populated for video content; None for images and text.","title":"Duration"},"transcription":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Transcription of the audio in the segment","title":"Transcription"},"description":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Generated description of the video segment","title":"Description"},"ocr_text":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Text extracted from video frames (OCR)","title":"Ocr Text"},"json_output":{"anyOf":[{"additionalProperties":true,"type":"object"},{"type":"null"}],"default":null,"description":"Optional raw JSON output from underlying models.","title":"Json Output"},"thumbnail_url":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"S3 URL of the thumbnail image for this video segment. Automatically generated during processing. Useful for UI previews and visual navigation.","examples":["s3://mixpeek-storage/ns_123/obj_456/thumbnails/thumb_0.jpg"],"title":"Thumbnail Url"},"source_video_url":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"S3 URL of the original source video this segment was extracted from. Maintains data lineage from bucket object to processed segment. Use for tracking provenance and accessing the full original video. OPTIONAL for images and text, POPULATED for video segments.","examples":["s3://mixpeek-storage/ns_123/obj_456/original.mp4","s3://user-bucket/videos/campaign_video.mp4"],"title":"Source Video Url"},"video_segment_url":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"S3 URL of this specific video segment file. Enables collection-to-collection decomposition pipelines where this segment can be re-processed by another collection with different settings. Example use cases: - Time segments (5s) → Scene detection within each segment - Coarse segments → Fine-grained analysis - Initial extraction → Enhanced processing with different models OPTIONAL for initial processing, REQUIRED for decomposition chains.","examples":["s3://mixpeek-storage/ns_123/obj_456/segments/segment_0.mp4","s3://mixpeek-storage/ns_123/obj_456/segments/segment_1.mp4"],"title":"Video Segment Url"},"multimodal_extractor_v1_multimodal_embedding":{"anyOf":[{"items":{"type":"number"},"type":"array"},{"type":"null"}],"default":null,"description":"Dense vector embeddings (1408D) for multimodal content (video/image/gif/text) using Google Vertex AI. Captures visual and contextual information from all media types in a unified embedding space. Used for semantic search and similarity matching across all content types.","title":"Multimodal Extractor V1 Multimodal Embedding"},"multimodal_extractor_v1_transcription_embedding":{"anyOf":[{"items":{"type":"number"},"type":"array"},{"type":"null"}],"default":null,"description":"Dense vector embeddings (1024D) for the transcription text using E5-Large. Captures semantic meaning of spoken content. Used for text-based search across video transcriptions.","title":"Multimodal Extractor V1 Transcription Embedding"},"internal_metadata":{"anyOf":[{"additionalProperties":true,"type":"object"},{"type":"null"}],"default":null,"description":"Internal metadata for the video segment. Contains processing details, model versions, and diagnostic information. NOT REQUIRED for typical usage.","title":"Internal Metadata"}},"required":["start_time","end_time"],"title":"MultimodalExtractorOutput","type":"object"},"parameter_schema":{"$defs":{"GenerationConfig":{"description":"Configuration for generative models.","properties":{"candidate_count":{"default":1,"description":"Number of candidate responses to generate for video description.","title":"Candidate Count","type":"integer"},"max_output_tokens":{"default":1024,"description":"Maximum number of tokens for the generated video description.","title":"Max Output Tokens","type":"integer"},"temperature":{"default":0.7,"description":"Controls randomness for video description generation. Higher is more random.","title":"Temperature","type":"number"},"top_p":{"default":0.8,"description":"Nucleus sampling (top-p) for video description generation.","title":"Top P","type":"number"},"response_mime_type":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"MIME type for response (e.g., 'application/json')","title":"Response Mime Type"},"response_schema":{"anyOf":[{"additionalProperties":true,"type":"object"},{"type":"null"}],"default":null,"description":"JSON schema for structured output","title":"Response Schema"}},"title":"GenerationConfig","type":"object"},"SplitMethod":{"description":"Split methods for video extraction.","enum":["time","scene","silence"],"title":"SplitMethod","type":"string"}},"description":"Parameters for the multimodal extractor.\n\nThe multimodal extractor processes video, audio, image, text, and GIF content in a unified embedding space.\nVideos/GIFs/Audio are decomposed into segments with transcription, visual analysis (video only), OCR, and embeddings.\nImages and text are embedded directly without decomposition.\n\n**When to Use**:\n    - Video content libraries requiring searchable segments\n    - Audio content (podcasts, lectures, music) requiring transcription and search\n    - Media platforms with search across spoken and visual content\n    - Educational content with lecture videos and demonstrations\n    - Surveillance/security footage requiring event detection\n    - Social media platforms with user-generated video content\n    - Broadcasting/streaming services with large video catalogs\n    - Training video repositories with instructional content\n    - Marketing/advertising analytics for video campaigns\n\n**When NOT to Use**:\n    - Static image collections → Use image_extractor instead\n    - Very short videos (<5 seconds) → Overhead not worth it\n    - Real-time live streams → Use specialized streaming extractors\n    - Extremely high-resolution videos (8K+) → Consider downsampling first\n\n**Decomposition Methods**:\n\n    | Method | Use Case | Accuracy | Segments/Min | Best For |\n    |--------|----------|----------|--------------|----------|\n    | **TIME** | Fixed intervals | N/A | 60/interval_sec | General purpose, audio/video chunking |\n    | **SCENE** | Visual changes | 85-90% | Variable (2-20) | Movies, dynamic content (video only) |\n    | **SILENCE** | Audio pauses | 80-85% | Variable (5-30) | Lectures, presentations, audio/video |\n\n**Feature Extraction Options**:\n    - Transcription: Speech-to-text using Whisper (95%+ accuracy)\n    - Multimodal Embeddings: Unified embeddings from Vertex AI (1408D) for video/image/gif/text\n    - Transcription Embeddings: Text embeddings from E5-Large (1024D)\n    - OCR: Text extraction from video frames using Gemini Vision\n    - Descriptions: AI-generated segment summaries using Gemini\n    - Thumbnails: Visual preview images for each segment\n\n**Performance Characteristics**:\n    - Processing Speed: 0.5-2x realtime (depends on features enabled)\n    - Example: 10min video → 5-20 minutes processing time\n    - Transcription: ~200ms per second of audio\n    - Visual Embedding: ~50ms per segment\n    - OCR: ~300ms per segment\n    - Description: ~2s per segment (if enabled)\n\nRequirements:\n    - video URL: REQUIRED (accessible video file)\n    - All feature parameters: OPTIONAL (defaults provided)","examples":[{"description":"Standard video processing with 10-second intervals (default)","enable_thumbnails":true,"extractor_type":"multimodal_extractor","run_multimodal_embedding":true,"split_method":"time","time_split_interval":10,"use_case":"General-purpose video indexing for search and discovery"},{"description":"Educational content with transcription + embeddings","extractor_type":"multimodal_extractor","run_multimodal_embedding":true,"run_transcription":true,"run_transcription_embedding":true,"split_method":"time","time_split_interval":10,"transcription_language":"en","use_case":"Lecture videos and online courses requiring searchable spoken content"}],"properties":{"extractor_type":{"const":"multimodal_extractor","default":"multimodal_extractor","description":"Discriminator field for parameter type identification. Must be 'multimodal_extractor'.","title":"Extractor Type","type":"string"},"split_method":{"$ref":"#/$defs/SplitMethod","default":"time","description":"The PRIMARY control for video splitting strategy. This determines which splitting method is used."},"description_prompt":{"default":"Describe the video segment in detail.","description":"The prompt to use for description generation.","title":"Description Prompt","type":"string"},"time_split_interval":{"anyOf":[{"type":"integer"},{"type":"null"}],"default":10,"description":"Interval in seconds for 'time' splitting. Used when split_method='time'.","title":"Time Split Interval"},"silence_db_threshold":{"anyOf":[{"type":"integer"},{"type":"null"}],"default":null,"description":"The decibel level below which audio is considered silent. Used when split_method='silence'. Recommended value: -40 (auto-applied if not specified). Lower values (e.g., -50) detect more silence, higher values (e.g., -30) detect less.","title":"Silence Db Threshold"},"scene_detection_threshold":{"anyOf":[{"type":"number"},{"type":"null"}],"default":null,"description":"The threshold for scene detection (0.0-1.0). Used when split_method='scene'. Recommended value: 0.5 (auto-applied if not specified). Lower values (e.g., 0.3) detect more scenes, higher values (e.g., 0.7) detect fewer scenes.","title":"Scene Detection Threshold"},"run_transcription":{"default":false,"description":"Whether to run transcription on video segments.","title":"Run Transcription","type":"boolean"},"transcription_language":{"default":"en","description":"The language of the transcription. Used when run_transcription is True.","title":"Transcription Language","type":"string"},"run_video_description":{"default":false,"description":"Whether to generate descriptions for video segments. OPTIMIZED: Defaults to False as descriptions add 1-2 minutes. Enable only when needed.","title":"Run Video Description","type":"boolean"},"run_transcription_embedding":{"default":false,"description":"Whether to generate embeddings for transcriptions. Useful for semantic search over spoken content.","title":"Run Transcription Embedding","type":"boolean"},"run_multimodal_embedding":{"default":true,"description":"Whether to generate multimodal embeddings for all content types (video/image/gif/text). Uses Google Vertex AI to create unified 1408D embeddings in a shared semantic space. Useful for cross-modal semantic search across all media types.","title":"Run Multimodal Embedding","type":"boolean"},"run_ocr":{"default":false,"description":"Whether to run OCR to extract text from video frames. OPTIMIZED: Defaults to False as OCR adds significant processing time. Enable only when text extraction from video is required.","title":"Run Ocr","type":"boolean"},"sensitivity":{"default":"low","description":"The sensitivity of the scene detection.","title":"Sensitivity","type":"string"},"enable_thumbnails":{"default":true,"description":"Whether to generate thumbnail images for video segments and images. Thumbnails provide visual previews for navigation and UI display. For videos: Extracts a frame from each segment. For images: Creates an optimized thumbnail version. ","title":"Enable Thumbnails","type":"boolean"},"use_cdn":{"default":false,"description":"Whether to use CloudFront CDN for thumbnail delivery. When True: Uploads to public bucket and returns CloudFront URLs. When False (default): Uploads to private bucket with presigned S3 URLs. Benefits of CDN: faster global delivery, permanent URLs, reduced bandwidth costs. Requires CLOUDFRONT_PUBLIC_DOMAIN to be configured in settings. Only applies when enable_thumbnails=True.","title":"Use Cdn","type":"boolean"},"generation_config":{"$ref":"#/$defs/GenerationConfig"},"response_shape":{"anyOf":[{"type":"string"},{"additionalProperties":true,"type":"object"},{"type":"null"}],"default":null,"description":"OPTIONAL. Define custom structured output using Gemini's JSON mode. NOT REQUIRED - by default, descriptions are stored as plain text. When provided, Gemini will extract structured data matching this schema. \n\nTwo modes supported:\n1. Natural language prompt (string): Describe desired output in plain English\n   - Gemini automatically infers JSON schema from your description\n   - Example: 'Extract product names, colors, and aesthetic labels'\n\n2. Explicit JSON schema (dict): Provide complete JSON schema for output structure\n   - Full control over output structure, types, and constraints\n   - Use response_mime_type='application/json' in generation_config\n   - Example: {'type': 'object', 'properties': {'products': {'type': 'array', ...}}}\n\n\nUse when:\n  - Need to extract structured product/entity information from videos\n  - Want consistent, parseable output format (not free-form text)\n  - Require specific fields like visibility_percentage, product categories, etc.\n  - Building e-commerce, fashion, or product discovery applications\n\n\nOutput fields are automatically added to collection schema and stored in document metadata.\nNote: When using response_shape, set description_prompt to describe the extraction task.\n","examples":["Extract product names, colors, materials, and aesthetic style labels from this fashion segment",{"properties":{"products":{"items":{"properties":{"name":{"type":"string"},"category":{"type":"string"},"visibility_percentage":{"maximum":100,"minimum":0,"type":"integer"}},"type":"object"},"type":"array"},"aesthetic":{"type":"string"}},"type":"object"},null],"title":"Response Shape"}},"title":"MultimodalExtractorParams","type":"object"},"supported_input_types":["video","image","text","string"],"max_inputs":{"video":1,"image":1,"text":1,"string":1},"default_parameters":{},"costs":{"tier":4,"tier_label":"PREMIUM","rates":[{"unit":"minute","credits_per_unit":50,"description":"Video processing per minute"},{"unit":"image","credits_per_unit":5,"description":"Image analysis"},{"unit":"1k_tokens","credits_per_unit":2,"description":"Text processing per 1K tokens"}]},"required_vector_indexes":[{"feature_uri":"mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding","name":"vertex_multimodal_embedding","description":"Vector index for multimodal content embeddings (video/image/gif/text).","type":"single","index":{"name":"multimodal_extractor_v1_multimodal_embedding","description":"Dense vector embedding for multimodal content in unified embedding space.","dimensions":1408,"type":"dense","distance":"Cosine","datatype":"float32","on_disk":null,"supported_inputs":["video","text","image"],"inference_name":"google__vertex_multimodal","inference_service_id":"google/vertex-multimodal","purpose":null,"vector_name_override":null,"supports_multi_query":false}},{"feature_uri":"mixpeek://multimodal_extractor@v1/multilingual_e5_large_instruct_v1","name":"multilingual_e5_large_instruct_v1","description":"Vector index for transcription embeddings.","type":"single","index":{"name":"multimodal_extractor_v1_transcription_embedding","description":"Dense vector embedding for transcriptions.","dimensions":1024,"type":"dense","distance":"Cosine","datatype":"float32","on_disk":null,"supported_inputs":["string","text"],"inference_name":"intfloat__multilingual_e5_large_instruct","inference_service_id":"intfloat/multilingual-e5-large-instruct","purpose":null,"vector_name_override":null,"supports_multi_query":false}}],"required_payload_indexes":[],"position_fields":["start_time","end_time"]}