{"feature_extractor_name":"text_extractor","version":"v1","feature_extractor_id":"text_extractor_v1","description":"Extracts dense vector embeddings from text using E5-Large multilingual model. Optimized for semantic search, RAG applications, and general-purpose text retrieval. Supports text chunking/decomposition with multiple splitting strategies. With source_type='youtube', resolves YouTube URLs to caption text before embedding. Fast (5ms/doc) and supports 100+ languages.","icon":"file-text","category":"text","source":"builtin","type_mode":"type_specific","expected_input_types":{"text":"text"},"inference_type":"embedding","input_schema":{"description":"Input schema for the text extractor.","examples":[{"text":"How do I reset my password?"},{"text":"wireless bluetooth headphones with noise cancellation"}],"properties":{"text":{"description":"Text content to process into embeddings.","minLength":1,"title":"Text","type":"string"}},"required":["text"],"title":"TextExtractorInput","type":"object"},"output_schema":{"description":"Output schema for text extractor documents.\n\nWhen source_type='youtube', additional video metadata fields are populated.","examples":[{"text":"Meal kit delivery service with chef-crafted recipes","text_extractor_v1_embedding":[0.023,-0.041,0.018]}],"properties":{"text":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"The processed text content for this document.","title":"Text"},"text_extractor_v1_embedding":{"anyOf":[{"items":{"type":"number"},"type":"array"},{"type":"null"}],"default":null,"description":"Dense vector embedding. Dimensionality is determined by the selected embedding_model (see shared.models.embeddings registry).","title":"Text Extractor V1 Embedding"},"video_id":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"YouTube video ID.","title":"Video Id"},"title":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Video title.","title":"Title"},"channel":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"YouTube channel name.","title":"Channel"},"video_url":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Source YouTube video URL.","title":"Video Url"},"duration_seconds":{"anyOf":[{"type":"integer"},{"type":"null"}],"default":null,"description":"Total video duration in seconds.","title":"Duration Seconds"},"publish_date":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Video publish date (ISO format).","title":"Publish Date"},"start_ms":{"anyOf":[{"type":"integer"},{"type":"null"}],"default":null,"description":"Segment start time in milliseconds.","title":"Start Ms"},"end_ms":{"anyOf":[{"type":"integer"},{"type":"null"}],"default":null,"description":"Segment end time in milliseconds.","title":"End Ms"},"segment_index":{"anyOf":[{"type":"integer"},{"type":"null"}],"default":null,"description":"Index of this segment within the video (0-based).","title":"Segment Index"},"total_segments":{"anyOf":[{"type":"integer"},{"type":"null"}],"default":null,"description":"Total number of segments from this video.","title":"Total Segments"}},"title":"TextExtractorOutput","type":"object"},"parameter_schema":{"$defs":{"EmbeddingModel":{"description":"Embedding model identifiers.\n\nFormat: {provider}_{model_name}_{version}","enum":["laion_clip_vit_l_14_v1","multilingual_e5_large_instruct_v1","vertex_multimodal_embedding","multimodalembedding@001","gemini-embedding-2","google_siglip_base_v1","google_siglip_so400m_v1","text-embedding-3-small","text-embedding-3-large","face_identity_arcface_r100_v1","all_minilm_l6_v2_v1"],"title":"EmbeddingModel","type":"string"},"TextSplitStrategy":{"description":"Strategy for splitting text into chunks.","enum":["characters","words","sentences","paragraphs","pages","time_segments","none"],"title":"TextSplitStrategy","type":"string"}},"description":"Parameters for the text extractor.\n\nThe text extractor generates dense vector embeddings optimized for semantic similarity search.\nIt uses the E5-Large multilingual model to convert text into 1024-dimensional vectors.\n\nWhen ``source_type`` is ``\"youtube\"``, the extractor first resolves YouTube URLs\nto caption text via yt-dlp before chunking and embedding. Use ``split_by=\"time_segments\"``\nwith ``segment_length_seconds`` to segment captions by time window.","examples":[{"chunk_overlap":0,"chunk_size":1000,"extractor_type":"text_extractor","split_by":"none"},{"chunk_overlap":1,"chunk_size":5,"extractor_type":"text_extractor","split_by":"sentences"},{"extractor_type":"text_extractor","language":"en","segment_length_seconds":120,"source_type":"youtube","split_by":"time_segments"}],"properties":{"extractor_type":{"const":"text_extractor","default":"text_extractor","description":"Discriminator field for parameter type identification.","title":"Extractor Type","type":"string"},"source_type":{"default":"text","description":"Source content type. Use 'youtube' to resolve YouTube URLs to caption text before embedding. Default: 'text' (plain text input).","enum":["text","youtube"],"title":"Source Type","type":"string"},"split_by":{"$ref":"#/$defs/TextSplitStrategy","default":"none","description":"Strategy for splitting text into multiple documents."},"chunk_size":{"default":1000,"description":"Target size for each chunk.","maximum":10000,"minimum":1,"title":"Chunk Size","type":"integer"},"chunk_overlap":{"default":0,"description":"Number of units to overlap between consecutive chunks.","maximum":5000,"minimum":0,"title":"Chunk Overlap","type":"integer"},"segment_length_seconds":{"default":120,"description":"Length of each transcript segment in seconds (for time_segments split strategy). Shorter segments give more precise search results but more documents.","maximum":600,"minimum":30,"title":"Segment Length Seconds","type":"integer"},"language":{"default":"en","description":"Preferred language code for YouTube captions (when source_type='youtube').","title":"Language","type":"string"},"extract_captions":{"default":true,"description":"Extract auto-captions or manual subtitles from YouTube videos (when source_type='youtube'). Falls back to video description if False.","title":"Extract Captions","type":"boolean"},"response_shape":{"anyOf":[{"type":"string"},{"additionalProperties":true,"type":"object"},{"type":"null"}],"default":null,"description":"Define custom structured output using LLM extraction.","title":"Response Shape"},"llm_provider":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"LLM provider for structured extraction (openai, google, anthropic).","title":"Llm Provider"},"llm_model":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Specific LLM model for structured extraction.","title":"Llm Model"},"llm_api_key":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"API key for LLM operations (BYOK - Bring Your Own Key). Supports:\n- Direct key: 'sk-proj-abc123...'\n- Secret reference: '{{SECRET.openai_api_key}}'\n\nWhen using secret reference, the key is loaded from your organization's secrets vault at runtime. Store secrets via POST /v1/organizations/secrets.\n\nIf not provided, uses Mixpeek's default API keys.","title":"Llm Api Key"},"embedding_model":{"anyOf":[{"$ref":"#/$defs/EmbeddingModel"},{"type":"null"}],"default":null,"description":"Embedding model to use. Defaults to the current TEXT modality default in the central embedding registry. Changing this on an existing namespace requires a migration — dimensions are fixed at namespace creation."},"embedding_task":{"anyOf":[{"type":"string"},{"type":"null"}],"default":null,"description":"Embedding task hint for instruction-aware models (E5, Gemini). Prefer setting this at collection level (embedding_task on the collection) rather than here. Collection-level overrides this value. Defaults to 'retrieval_document'. Values: retrieval_document, retrieval_query, semantic_similarity, classification, clustering.","title":"Embedding Task"}},"title":"TextExtractorParams","type":"object"},"supported_input_types":["text","string"],"max_inputs":{"text":1},"default_parameters":{},"costs":{"tier":1,"tier_label":"SIMPLE","rates":[{"unit":"1k_tokens","credits_per_unit":1,"description":"Text embedding per 1K tokens"}]},"required_vector_indexes":[{"feature_uri":"mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1","display_name":"text embedding","name":"multilingual_e5_large_instruct_v1","description":"Vector index for text embeddings.","type":"single","index":{"name":"text_extractor_v1_embedding","description":"Dense vector embedding for text.","dimensions":1024,"type":"dense","distance":"Cosine","datatype":"float32","on_disk":null,"supported_inputs":["string","text"],"inference_name":"intfloat__multilingual_e5_large_instruct","inference_service_id":"intfloat/multilingual-e5-large-instruct","purpose":null,"vector_name_override":null,"supports_multi_query":false}}],"required_payload_indexes":[],"position_fields":["chunk_index","video_id","segment_index"],"capabilities":["batch","realtime"],"example_usage":{"namespace":{"feature_extractors":[{"name":"text_extractor","version":"v1"}]},"collection":{"feature_extractor":{"name":"text_extractor","version":"v1","input_mappings":{"text":"<your_text_field>"},"parameters":{"source_type":"text","split_by":"none","chunk_size":1000,"chunk_overlap":0,"segment_length_seconds":120,"language":"en","extract_captions":true}}}}}