import { CodeGroup } from '@/app/components/develop/code.tsx' import { Row, Col, Properties, Property, Heading, SubProperty, PropertyInstruction, Paragraph } from '@/app/components/develop/md.tsx' # Knowledge API
book Book
- web_page Web page
- paper Academic paper/article
- social_media_post Social media post
- wikipedia_entry Wikipedia entry
- personal_document Personal document
- business_document Business document
- im_chat_log Chat log
- synced_from_notion Notion document
- synced_from_github GitHub document
- others Other document types
book:
- title Book title
- language Book language
- author Book author
- publisher Publisher name
- publication_date Publication date
- isbn ISBN number
- category Book category
For web_page:
- title Page title
- url Page URL
- language Page language
- publish_date Publish date
- author/publisher Author or publisher
- topic/keywords Topic or keywords
- description Page description
Please check [api/services/dataset_service.py](https://github.com/langgenius/dify/blob/main/api/services/dataset_service.py#L475) for more details on the fields required for each doc_type.
For doc_type "others", any valid JSON object is accepted
high_quality High quality: embedding using embedding model, built as vector database index
- economy Economy: Build using inverted index of keyword table index
text_model Text documents are directly embedded; `economy` mode defaults to using this form
- hierarchical_model Parent-child mode
- qa_model Q&A Mode: Generates Q&A pairs for segmented documents and then embeds the questions
English, Chinese
mode (string) Cleaning, segmentation mode, automatic / custom
- rules (object) Custom rules (in automatic mode, this field is empty)
- pre_processing_rules (array[object]) Preprocessing rules
- id (string) Unique identifier for the preprocessing rule
- enumerate
- remove_extra_spaces Replace consecutive spaces, newlines, tabs
- remove_urls_emails Delete URL, email address
- enabled (bool) Whether to select this rule or not. If no document ID is passed in, it represents the default value.
- segmentation (object) Segmentation rules
- separator Custom segment identifier, currently only allows one delimiter to be set. Default is \n
- max_tokens Maximum length (token) defaults to 1000
- parent_mode Retrieval mode of parent chunks: full-doc full text retrieval / paragraph paragraph retrieval
- subchunk_segmentation (object) Child chunk rules
- separator Segmentation identifier. Currently, only one delimiter is allowed. The default is ***
- max_tokens The maximum length (tokens) must be validated to be shorter than the length of the parent chunk
- chunk_overlap Define the overlap between adjacent chunks (optional)
search_method (string) Search method
- hybrid_search Hybrid search
- semantic_search Semantic search
- full_text_search Full-text search
- reranking_enable (bool) Whether to enable reranking
- reranking_mode (object) Rerank model configuration
- reranking_provider_name (string) Rerank model provider
- reranking_model_name (string) Rerank model name
- top_k (int) Number of results to return
- score_threshold_enabled (bool) Whether to enable score threshold
- score_threshold (float) Score threshold
original_document_id Source document ID (optional)
- Used to re-upload the document or modify the document cleaning and segmentation configuration. The missing information is copied from the source document
- The source document cannot be an archived document
- When original_document_id is passed in, the update operation is performed on behalf of the document. process_rule is a fillable item. If not filled in, the segmentation method of the source document will be used by default
- When original_document_id is not passed in, the new operation is performed on behalf of the document, and process_rule is required
- indexing_technique Index mode
- high_quality High quality: embedding using embedding model, built as vector database index
- economy Economy: Build using inverted index of keyword table index
- doc_form Format of indexed content
- text_model Text documents are directly embedded; `economy` mode defaults to using this form
- hierarchical_model Parent-child mode
- qa_model Q&A Mode: Generates Q&A pairs for segmented documents and then embeds the questions
- doc_type Type of document (optional)
- book Book
Document records a book or publication
- web_page Web page
Document records web page content
- paper Academic paper/article
Document records academic paper or research article
- social_media_post Social media post
Content from social media posts
- wikipedia_entry Wikipedia entry
Content from Wikipedia entries
- personal_document Personal document
Documents related to personal content
- business_document Business document
Documents related to business content
- im_chat_log Chat log
Records of instant messaging chats
- synced_from_notion Notion document
Documents synchronized from Notion
- synced_from_github GitHub document
Documents synchronized from GitHub
- others Other document types
Other document types not listed above
- doc_metadata Document metadata (required if doc_type is provided)
Fields vary by doc_type:
For book:
- title Book title
Title of the book
- language Book language
Language of the book
- author Book author
Author of the book
- publisher Publisher name
Name of the publishing house
- publication_date Publication date
Date when the book was published
- isbn ISBN number
International Standard Book Number
- category Book category
Category or genre of the book
For web_page:
- title Page title
Title of the web page
- url Page URL
URL address of the web page
- language Page language
Language of the web page
- publish_date Publish date
Date when the web page was published
- author/publisher Author or publisher
Author or publisher of the web page
- topic/keywords Topic or keywords
Topics or keywords of the web page
- description Page description
Description of the web page content
Please check [api/services/dataset_service.py](https://github.com/langgenius/dify/blob/main/api/services/dataset_service.py#L475) for more details on the fields required for each doc_type.
For doc_type "others", any valid JSON object is accepted
- doc_language In Q&A mode, specify the language of the document, for example: English, Chinese
- process_rule Processing rules
- mode (string) Cleaning, segmentation mode, automatic / custom
- rules (object) Custom rules (in automatic mode, this field is empty)
- pre_processing_rules (array[object]) Preprocessing rules
- id (string) Unique identifier for the preprocessing rule
- enumerate
- remove_extra_spaces Replace consecutive spaces, newlines, tabs
- remove_urls_emails Delete URL, email address
- enabled (bool) Whether to select this rule or not. If no document ID is passed in, it represents the default value.
- segmentation (object) Segmentation rules
- separator Custom segment identifier, currently only allows one delimiter to be set. Default is \n
- max_tokens Maximum length (token) defaults to 1000
- parent_mode Retrieval mode of parent chunks: full-doc full text retrieval / paragraph paragraph retrieval
- subchunk_segmentation (object) Child chunk rules
- separator Segmentation identifier. Currently, only one delimiter is allowed. The default is ***
- max_tokens The maximum length (tokens) must be validated to be shorter than the length of the parent chunk
- chunk_overlap Define the overlap between adjacent chunks (optional)
search_method (string) Search method
- hybrid_search Hybrid search
- semantic_search Semantic search
- full_text_search Full-text search
- reranking_enable (bool) Whether to enable reranking
- reranking_mode (object) Rerank model configuration
- reranking_provider_name (string) Rerank model provider
- reranking_model_name (string) Rerank model name
- top_k (int) Number of results to return
- score_threshold_enabled (bool) Whether to enable score threshold
- score_threshold (float) Score threshold
book Book
- web_page Web page
- paper Academic paper/article
- social_media_post Social media post
- wikipedia_entry Wikipedia entry
- personal_document Personal document
- business_document Business document
- im_chat_log Chat log
- synced_from_notion Notion document
- synced_from_github GitHub document
- others Other document types
book:
- title Book title
- language Book language
- author Book author
- publisher Publisher name
- publication_date Publication date
- isbn ISBN number
- category Book category
For web_page:
- title Page title
- url Page URL
- language Page language
- publish_date Publish date
- author/publisher Author or publisher
- topic/keywords Topic or keywords
- description Page description
Please check [api/services/dataset_service.py](https://github.com/langgenius/dify/blob/main/api/services/dataset_service.py#L475) for more details on the fields required for each doc_type.
For doc_type "others", any valid JSON object is accepted
high_quality High quality
- economy Economy
only_me Only me
- all_team_members All team members
- partial_members Partial members
vendor Vendor
- external External knowledge
mode (string) Cleaning, segmentation mode, automatic / custom
- rules (object) Custom rules (in automatic mode, this field is empty)
- pre_processing_rules (array[object]) Preprocessing rules
- id (string) Unique identifier for the preprocessing rule
- enumerate
- remove_extra_spaces Replace consecutive spaces, newlines, tabs
- remove_urls_emails Delete URL, email address
- enabled (bool) Whether to select this rule or not. If no document ID is passed in, it represents the default value.
- segmentation (object) Segmentation rules
- separator Custom segment identifier, currently only allows one delimiter to be set. Default is \n
- max_tokens Maximum length (token) defaults to 1000
- parent_mode Retrieval mode of parent chunks: full-doc full text retrieval / paragraph paragraph retrieval
- subchunk_segmentation (object) Child chunk rules
- separator Segmentation identifier. Currently, only one delimiter is allowed. The default is ***
- max_tokens The maximum length (tokens) must be validated to be shorter than the length of the parent chunk
- chunk_overlap Define the overlap between adjacent chunks (optional)
mode (string) Cleaning, segmentation mode, automatic / custom
- rules (object) Custom rules (in automatic mode, this field is empty)
- pre_processing_rules (array[object]) Preprocessing rules
- id (string) Unique identifier for the preprocessing rule
- enumerate
- remove_extra_spaces Replace consecutive spaces, newlines, tabs
- remove_urls_emails Delete URL, email address
- enabled (bool) Whether to select this rule or not. If no document ID is passed in, it represents the default value.
- segmentation (object) Segmentation rules
- separator Custom segment identifier, currently only allows one delimiter to be set. Default is \n
- max_tokens Maximum length (token) defaults to 1000
- parent_mode Retrieval mode of parent chunks: full-doc full text retrieval / paragraph paragraph retrieval
- subchunk_segmentation (object) Child chunk rules
- separator Segmentation identifier. Currently, only one delimiter is allowed. The default is ***
- max_tokens The maximum length (tokens) must be validated to be shorter than the length of the parent chunk
- chunk_overlap Define the overlap between adjacent chunks (optional)
- doc_type Type of document (optional)
- book Book
Document records a book or publication
- web_page Web page
Document records web page content
- paper Academic paper/article
Document records academic paper or research article
- social_media_post Social media post
Content from social media posts
- wikipedia_entry Wikipedia entry
Content from Wikipedia entries
- personal_document Personal document
Documents related to personal content
- business_document Business document
Documents related to business content
- im_chat_log Chat log
Records of instant messaging chats
- synced_from_notion Notion document
Documents synchronized from Notion
- synced_from_github GitHub document
Documents synchronized from GitHub
- others Other document types
Other document types not listed above
- doc_metadata Document metadata (required if doc_type is provided)
Fields vary by doc_type:
For book:
- title Book title
Title of the book
- language Book language
Language of the book
- author Book author
Author of the book
- publisher Publisher name
Name of the publishing house
- publication_date Publication date
Date when the book was published
- isbn ISBN number
International Standard Book Number
- category Book category
Category or genre of the book
For web_page:
- title Page title
Title of the web page
- url Page URL
URL address of the web page
- language Page language
Language of the web page
- publish_date Publish date
Date when the web page was published
- author/publisher Author or publisher
Author or publisher of the web page
- topic/keywords Topic or keywords
Topics or keywords of the web page
- description Page description
Description of the web page content
Please check [api/services/dataset_service.py](https://github.com/langgenius/dify/blob/main/api/services/dataset_service.py#L475) for more details on the fields required for each doc_type.
For doc_type "others", any valid JSON object is accepted
content (text) Text content / question content, required
- answer (text) Answer content, if the mode of the knowledge is Q&A mode, pass the value (optional)
- keywords (list) Keywords (optional)
content (text) Text content / question content, required
- answer (text) Answer content, passed if the knowledge is in Q&A mode (optional)
- keywords (list) Keyword (optional)
- enabled (bool) False / true (optional)
- regenerate_child_chunks (bool) Whether to regenerate child chunks (optional)
search_method (text) Search method: One of the following four keywords is required
- keyword_search Keyword search
- semantic_search Semantic search
- full_text_search Full-text search
- hybrid_search Hybrid search
- reranking_enable (bool) Whether to enable reranking, required if the search mode is semantic_search or hybrid_search (optional)
- reranking_mode (object) Rerank model configuration, required if reranking is enabled
- reranking_provider_name (string) Rerank model provider
- reranking_model_name (string) Rerank model name
- weights (float) Semantic search weight setting in hybrid search mode
- top_k (integer) Number of results to return (optional)
- score_threshold_enabled (bool) Whether to enable score threshold
- score_threshold (float) Score threshold
| code | status | message |
|---|---|---|
| no_file_uploaded | 400 | Please upload your file. |
| too_many_files | 400 | Only one file is allowed. |
| file_too_large | 413 | File size exceeded. |
| unsupported_file_type | 415 | File type not allowed. |
| high_quality_dataset_only | 400 | Current operation only supports 'high-quality' datasets. |
| dataset_not_initialized | 400 | The dataset is still being initialized or indexing. Please wait a moment. |
| archived_document_immutable | 403 | The archived document is not editable. |
| dataset_name_duplicate | 409 | The dataset name already exists. Please modify your dataset name. |
| invalid_action | 400 | Invalid action. |
| document_already_finished | 400 | The document has been processed. Please refresh the page or go to the document details. |
| document_indexing | 400 | The document is being processed and cannot be edited. |
| invalid_metadata | 400 | The metadata content is incorrect. Please check and verify. |