import { CodeGroup } from '@/app/components/develop/code.tsx' import { Row, Col, Properties, Property, Heading, SubProperty, PropertyInstruction, Paragraph } from '@/app/components/develop/md.tsx' # Knowledge API
book
Book
- web_page
Web page
- paper
Academic paper/article
- social_media_post
Social media post
- wikipedia_entry
Wikipedia entry
- personal_document
Personal document
- business_document
Business document
- im_chat_log
Chat log
- synced_from_notion
Notion document
- synced_from_github
GitHub document
- others
Other document types
book
:
- title
Book title
- language
Book language
- author
Book author
- publisher
Publisher name
- publication_date
Publication date
- isbn
ISBN number
- category
Book category
For web_page
:
- title
Page title
- url
Page URL
- language
Page language
- publish_date
Publish date
- author/publisher
Author or publisher
- topic/keywords
Topic or keywords
- description
Page description
Please check [api/services/dataset_service.py](https://github.com/langgenius/dify/blob/main/api/services/dataset_service.py#L475) for more details on the fields required for each doc_type.
For doc_type "others", any valid JSON object is accepted
high_quality
High quality: embedding using embedding model, built as vector database index
- economy
Economy: Build using inverted index of keyword table index
text_model
Text documents are directly embedded; `economy` mode defaults to using this form
- hierarchical_model
Parent-child mode
- qa_model
Q&A Mode: Generates Q&A pairs for segmented documents and then embeds the questions
English
, Chinese
mode
(string) Cleaning, segmentation mode, automatic / custom
- rules
(object) Custom rules (in automatic mode, this field is empty)
- pre_processing_rules
(array[object]) Preprocessing rules
- id
(string) Unique identifier for the preprocessing rule
- enumerate
- remove_extra_spaces
Replace consecutive spaces, newlines, tabs
- remove_urls_emails
Delete URL, email address
- enabled
(bool) Whether to select this rule or not. If no document ID is passed in, it represents the default value.
- segmentation
(object) Segmentation rules
- separator
Custom segment identifier, currently only allows one delimiter to be set. Default is \n
- max_tokens
Maximum length (token) defaults to 1000
- parent_mode
Retrieval mode of parent chunks: full-doc
full text retrieval / paragraph
paragraph retrieval
- subchunk_segmentation
(object) Child chunk rules
- separator
Segmentation identifier. Currently, only one delimiter is allowed. The default is ***
- max_tokens
The maximum length (tokens) must be validated to be shorter than the length of the parent chunk
- chunk_overlap
Define the overlap between adjacent chunks (optional)
search_method
(string) Search method
- hybrid_search
Hybrid search
- semantic_search
Semantic search
- full_text_search
Full-text search
- reranking_enable
(bool) Whether to enable reranking
- reranking_mode
(object) Rerank model configuration
- reranking_provider_name
(string) Rerank model provider
- reranking_model_name
(string) Rerank model name
- top_k
(int) Number of results to return
- score_threshold_enabled
(bool) Whether to enable score threshold
- score_threshold
(float) Score threshold
original_document_id
Source document ID (optional)
- Used to re-upload the document or modify the document cleaning and segmentation configuration. The missing information is copied from the source document
- The source document cannot be an archived document
- When original_document_id is passed in, the update operation is performed on behalf of the document. process_rule is a fillable item. If not filled in, the segmentation method of the source document will be used by default
- When original_document_id is not passed in, the new operation is performed on behalf of the document, and process_rule is required
- indexing_technique
Index mode
- high_quality
High quality: embedding using embedding model, built as vector database index
- economy
Economy: Build using inverted index of keyword table index
- doc_form
Format of indexed content
- text_model
Text documents are directly embedded; `economy` mode defaults to using this form
- hierarchical_model
Parent-child mode
- qa_model
Q&A Mode: Generates Q&A pairs for segmented documents and then embeds the questions
- doc_type
Type of document (optional)
- book
Book
Document records a book or publication
- web_page
Web page
Document records web page content
- paper
Academic paper/article
Document records academic paper or research article
- social_media_post
Social media post
Content from social media posts
- wikipedia_entry
Wikipedia entry
Content from Wikipedia entries
- personal_document
Personal document
Documents related to personal content
- business_document
Business document
Documents related to business content
- im_chat_log
Chat log
Records of instant messaging chats
- synced_from_notion
Notion document
Documents synchronized from Notion
- synced_from_github
GitHub document
Documents synchronized from GitHub
- others
Other document types
Other document types not listed above
- doc_metadata
Document metadata (required if doc_type is provided)
Fields vary by doc_type:
For book
:
- title
Book title
Title of the book
- language
Book language
Language of the book
- author
Book author
Author of the book
- publisher
Publisher name
Name of the publishing house
- publication_date
Publication date
Date when the book was published
- isbn
ISBN number
International Standard Book Number
- category
Book category
Category or genre of the book
For web_page
:
- title
Page title
Title of the web page
- url
Page URL
URL address of the web page
- language
Page language
Language of the web page
- publish_date
Publish date
Date when the web page was published
- author/publisher
Author or publisher
Author or publisher of the web page
- topic/keywords
Topic or keywords
Topics or keywords of the web page
- description
Page description
Description of the web page content
Please check [api/services/dataset_service.py](https://github.com/langgenius/dify/blob/main/api/services/dataset_service.py#L475) for more details on the fields required for each doc_type.
For doc_type "others", any valid JSON object is accepted
- doc_language
In Q&A mode, specify the language of the document, for example: English
, Chinese
- process_rule
Processing rules
- mode
(string) Cleaning, segmentation mode, automatic / custom
- rules
(object) Custom rules (in automatic mode, this field is empty)
- pre_processing_rules
(array[object]) Preprocessing rules
- id
(string) Unique identifier for the preprocessing rule
- enumerate
- remove_extra_spaces
Replace consecutive spaces, newlines, tabs
- remove_urls_emails
Delete URL, email address
- enabled
(bool) Whether to select this rule or not. If no document ID is passed in, it represents the default value.
- segmentation
(object) Segmentation rules
- separator
Custom segment identifier, currently only allows one delimiter to be set. Default is \n
- max_tokens
Maximum length (token) defaults to 1000
- parent_mode
Retrieval mode of parent chunks: full-doc
full text retrieval / paragraph
paragraph retrieval
- subchunk_segmentation
(object) Child chunk rules
- separator
Segmentation identifier. Currently, only one delimiter is allowed. The default is ***
- max_tokens
The maximum length (tokens) must be validated to be shorter than the length of the parent chunk
- chunk_overlap
Define the overlap between adjacent chunks (optional)
search_method
(string) Search method
- hybrid_search
Hybrid search
- semantic_search
Semantic search
- full_text_search
Full-text search
- reranking_enable
(bool) Whether to enable reranking
- reranking_mode
(object) Rerank model configuration
- reranking_provider_name
(string) Rerank model provider
- reranking_model_name
(string) Rerank model name
- top_k
(int) Number of results to return
- score_threshold_enabled
(bool) Whether to enable score threshold
- score_threshold
(float) Score threshold
book
Book
- web_page
Web page
- paper
Academic paper/article
- social_media_post
Social media post
- wikipedia_entry
Wikipedia entry
- personal_document
Personal document
- business_document
Business document
- im_chat_log
Chat log
- synced_from_notion
Notion document
- synced_from_github
GitHub document
- others
Other document types
book
:
- title
Book title
- language
Book language
- author
Book author
- publisher
Publisher name
- publication_date
Publication date
- isbn
ISBN number
- category
Book category
For web_page
:
- title
Page title
- url
Page URL
- language
Page language
- publish_date
Publish date
- author/publisher
Author or publisher
- topic/keywords
Topic or keywords
- description
Page description
Please check [api/services/dataset_service.py](https://github.com/langgenius/dify/blob/main/api/services/dataset_service.py#L475) for more details on the fields required for each doc_type.
For doc_type "others", any valid JSON object is accepted
high_quality
High quality
- economy
Economy
only_me
Only me
- all_team_members
All team members
- partial_members
Partial members
vendor
Vendor
- external
External knowledge
mode
(string) Cleaning, segmentation mode, automatic / custom
- rules
(object) Custom rules (in automatic mode, this field is empty)
- pre_processing_rules
(array[object]) Preprocessing rules
- id
(string) Unique identifier for the preprocessing rule
- enumerate
- remove_extra_spaces
Replace consecutive spaces, newlines, tabs
- remove_urls_emails
Delete URL, email address
- enabled
(bool) Whether to select this rule or not. If no document ID is passed in, it represents the default value.
- segmentation
(object) Segmentation rules
- separator
Custom segment identifier, currently only allows one delimiter to be set. Default is \n
- max_tokens
Maximum length (token) defaults to 1000
- parent_mode
Retrieval mode of parent chunks: full-doc
full text retrieval / paragraph
paragraph retrieval
- subchunk_segmentation
(object) Child chunk rules
- separator
Segmentation identifier. Currently, only one delimiter is allowed. The default is ***
- max_tokens
The maximum length (tokens) must be validated to be shorter than the length of the parent chunk
- chunk_overlap
Define the overlap between adjacent chunks (optional)
mode
(string) Cleaning, segmentation mode, automatic / custom
- rules
(object) Custom rules (in automatic mode, this field is empty)
- pre_processing_rules
(array[object]) Preprocessing rules
- id
(string) Unique identifier for the preprocessing rule
- enumerate
- remove_extra_spaces
Replace consecutive spaces, newlines, tabs
- remove_urls_emails
Delete URL, email address
- enabled
(bool) Whether to select this rule or not. If no document ID is passed in, it represents the default value.
- segmentation
(object) Segmentation rules
- separator
Custom segment identifier, currently only allows one delimiter to be set. Default is \n
- max_tokens
Maximum length (token) defaults to 1000
- parent_mode
Retrieval mode of parent chunks: full-doc
full text retrieval / paragraph
paragraph retrieval
- subchunk_segmentation
(object) Child chunk rules
- separator
Segmentation identifier. Currently, only one delimiter is allowed. The default is ***
- max_tokens
The maximum length (tokens) must be validated to be shorter than the length of the parent chunk
- chunk_overlap
Define the overlap between adjacent chunks (optional)
- doc_type
Type of document (optional)
- book
Book
Document records a book or publication
- web_page
Web page
Document records web page content
- paper
Academic paper/article
Document records academic paper or research article
- social_media_post
Social media post
Content from social media posts
- wikipedia_entry
Wikipedia entry
Content from Wikipedia entries
- personal_document
Personal document
Documents related to personal content
- business_document
Business document
Documents related to business content
- im_chat_log
Chat log
Records of instant messaging chats
- synced_from_notion
Notion document
Documents synchronized from Notion
- synced_from_github
GitHub document
Documents synchronized from GitHub
- others
Other document types
Other document types not listed above
- doc_metadata
Document metadata (required if doc_type is provided)
Fields vary by doc_type:
For book
:
- title
Book title
Title of the book
- language
Book language
Language of the book
- author
Book author
Author of the book
- publisher
Publisher name
Name of the publishing house
- publication_date
Publication date
Date when the book was published
- isbn
ISBN number
International Standard Book Number
- category
Book category
Category or genre of the book
For web_page
:
- title
Page title
Title of the web page
- url
Page URL
URL address of the web page
- language
Page language
Language of the web page
- publish_date
Publish date
Date when the web page was published
- author/publisher
Author or publisher
Author or publisher of the web page
- topic/keywords
Topic or keywords
Topics or keywords of the web page
- description
Page description
Description of the web page content
Please check [api/services/dataset_service.py](https://github.com/langgenius/dify/blob/main/api/services/dataset_service.py#L475) for more details on the fields required for each doc_type.
For doc_type "others", any valid JSON object is accepted
content
(text) Text content / question content, required
- answer
(text) Answer content, if the mode of the knowledge is Q&A mode, pass the value (optional)
- keywords
(list) Keywords (optional)
content
(text) Text content / question content, required
- answer
(text) Answer content, passed if the knowledge is in Q&A mode (optional)
- keywords
(list) Keyword (optional)
- enabled
(bool) False / true (optional)
- regenerate_child_chunks
(bool) Whether to regenerate child chunks (optional)
search_method
(text) Search method: One of the following four keywords is required
- keyword_search
Keyword search
- semantic_search
Semantic search
- full_text_search
Full-text search
- hybrid_search
Hybrid search
- reranking_enable
(bool) Whether to enable reranking, required if the search mode is semantic_search or hybrid_search (optional)
- reranking_mode
(object) Rerank model configuration, required if reranking is enabled
- reranking_provider_name
(string) Rerank model provider
- reranking_model_name
(string) Rerank model name
- weights
(float) Semantic search weight setting in hybrid search mode
- top_k
(integer) Number of results to return (optional)
- score_threshold_enabled
(bool) Whether to enable score threshold
- score_threshold
(float) Score threshold
code | status | message |
---|---|---|
no_file_uploaded | 400 | Please upload your file. |
too_many_files | 400 | Only one file is allowed. |
file_too_large | 413 | File size exceeded. |
unsupported_file_type | 415 | File type not allowed. |
high_quality_dataset_only | 400 | Current operation only supports 'high-quality' datasets. |
dataset_not_initialized | 400 | The dataset is still being initialized or indexing. Please wait a moment. |
archived_document_immutable | 403 | The archived document is not editable. |
dataset_name_duplicate | 409 | The dataset name already exists. Please modify your dataset name. |
invalid_action | 400 | Invalid action. |
document_already_finished | 400 | The document has been processed. Please refresh the page or go to the document details. |
document_indexing | 400 | The document is being processed and cannot be edited. |
invalid_metadata | 400 | The metadata content is incorrect. Please check and verify. |