Data Sources
Integrate existing content seamlessly with branchly using various data sources. Presently, we support integration through a website crawler, file upload, openAPI specification, and MySQL/Postgres databases. We are expanding our data source offerings, so if you require a particular source, please reach out to us.
Shared Settings
For all data sources except for File Upload (if you do not link a document), you can specify a Schedule, when this data source should update your content in the background.
- Schedule: Enter a valid cron expression. Based on the schedule, the webcrawler is run automatically and updates or creates new nodes. Use a service like https://crontab.guru/ to generate a cron expression for your data source. Please note that the minimum time between schedules needs to be 60 Minutes.
Website Crawler
The Website Crawler starts with the URLs, finds links to other pages, and recursively crawls those pages, too, as long as their URL is under the start URL. For example, if you enter the start URL https://example.com/blog/, the crawler will crawl pages like https://example.com/blog/article-1 or https://example.com/blog/section/article-2, but will skip pages like https://example.com/docs/something-else. Using the option Split Documents, we can try to identify the semantic structure of the website using its heading and create multiple Nodes for you. The results of this depend on the semantic quality of your website.
Configuration Options
-
Split documents: We can try to identify the semantic structure of your website and create multiple Nodes for you based on the HTML headings (
h1,h2, etc.) of your page. This setting is helpful when crawling technical documentation or long pages of text. -
Crawler Type: Choose between
Adaptive(default),CheerioandPlaywright. We recommend using theAdaptivetype if you start out, as it can handle Javascript and switches dynamically http requests and executing javascript. -
Actor: For users only the
Website Content Crawleris available. For special use cases, we can build a custom crawler. -
Include URL Globs: Glob patterns matching URLs of pages that will be included in crawling. Setting this option will disable the default Start URLs based scoping Note that this affects only links found on pages.
Example:
start_urls: ["example.com/test"], include_url_globs = ["example.com/test/**", "example.com/test-2/**"]In this case also urls that are found onexample.com/testthat start withexample.com/test-2/are included in the results. This is beneficial if the structure of your website is not hierarchical. -
Exclude URL Globs: Glob patterns matching URLs of pages that will be excluded from crawling. Note that this affects only links found on pages, but not Start URLs, which are always crawled. Example:
start_urls: ["example.com"], exclude_url_globs = ["example.com/impressum"]Here, the specific urlexample.com/impressumis excluded from the results, although it would fit into the results according to thestart_urls. -
Ignore Errors: Select this if your language code is not set correctly and you want to ignore any errors (status code, language code, etc.)
-
Remove HTML Elements: Define HTML elements via their CSS selectors to remove from processing. We have pre-configured the most common elements, such as nav, header and footer. Must be provided as a comma-separated list. Do not escape custom attributes such as
div[my-custom-attribute="hello"]as this is done by the system. You can use the browser tools and the browser console with methoddocument.querySelectorAll()to identify the CSS selectors. Here, you may not to escape special characters. -
Filters: Define Page Metadata filter criteriums. We currently allow for fields to be filtered:
modified_time,published_timeby amax_age_daysvalue (e.g. pages that were updated over a year ago). -
Custom Tags: List of Tags that are shown for each search result when searching using the Search Interface or Navigator. All web pages that are crawled using this data source will have set a
tagsarray inside thecustom_metadata(i.e. the tags are shared among all web pages from this Data Source).
File Upload
We currently support reading in PDF and CSV files. Using the option Split Documents, we can try to identify the semantic structure of the PDFs and create multiple Nodes for you. As PDFs are inherently unstructured, the results of this depend on your input documents. We infer the language/locale of the documents. The maximum upload size is 20MB.
Webhooks
branchly supports Webhooks so that you can push knowledge base items (articles, web pages, e.g.) directly from your system (e.g. CMS, shop system, or database) to our platform.
- Push Data Updates: Instead of branchly pulling from your API, your system sends webhook events for create, update, or delete operations to keep your data in sync.
- API Keys: When setting up a Webhook Data Source, generate an API key. Note: The API key is shown only once—store it securely. You can regenerate the key anytime; the old key will then be invalid.
- Public Documentation: For the OpenAPI specification, required payload attributes, and other technical details on how to use Webhooks, please refer to: https://api.branchly.io/public/docs#/
We are working on accepting any custom payload using our Mapping System (similar to the OpenAPI specification) to match our data format.
OpenAPI specification
We provide an option to fetch your data directly from an API using the OpenAPI specification. OpenAPI is a widely used format for exchanging API information, often used to generate clients in code. For more details on OpenAPI specification, visit here.
Configuration Parameters
- OpenAPI Specification: Paste your OpenAPI specification in JSON format into the designated field.
- Operation ID: Specify an
operationIdto identify the endpoint to call. This ID must be unique, a default characteristic of a valid OpenAPI specification. - Extra Options: We use default values to make requests. If you want to specify arguments without including them in the OpenAPI specification JSON, add them to the
extra_optionsfield. Note that these additional arguments must exist in theschemaof the defined endpoint. - Data Template Mapping: To align your API responses with the structure of our knowledge base, use
moustachetemplating syntax. Moustache is a logic-less templating syntax, widely used for this purpose. More information about moustache can be found here. Mapping for the parameterstitle,text, andsourceis required. Thesourcevalue should be unique and preferably a valid URL. - Response Content Path: This optional parameter should be included if your API's response items are nested inside an object. Use keypath or dot notation to define the path.
Currently, we support the following scenarios:
- No authentication, api_key, and basic authentication.
- To use api_key, add your API key to the
extra_optionsfield. - To use basic authentication, enter your username and password in the
extra_optionsfield.
- To use api_key, add your API key to the
- The API should respond with a 200 status code and with Content Type
application/json. - The API call result should be a JSON object or an array of objects.
- A base_url must be specified either in the OpenAPI file under
serversor as an extra option (Example: base_url: https://petstore3.swagger.io/api/v3).
Example Usage
Consider the following JSON object as your API response:
{
"status": "OK",
"count": 100,
"items": [
{
"url": "example.com/1234567",
"id": "1234567",
"title": "Title of your content",
"type": "type",
"texts": {
"detail": "Detailed text description",
"summary": "Short summary"
}
},
... // more items with the same structure
]
}
If you define the following:
- Response Content Path:
items - Template Mapping
- Title :
The title of this node is: {{title}} - Text:
{{texts.details}} - Source:
{{url}}
- Title :
The result for the first item in the items array would be:
- Title: The title of this node is: Title of your content
- Text: Detailed text description
- Source: example.com/1234567
The Template will be applied to all items in the items array and they will be loaded into the knowledge base as individual nodes.
HelpSpace Docs
We are partnering with HelpSpace to easily connect your HelpSpace Docs to the branchly platform.
Required field:
- Client ID of HelpSpace Workspace: You find it under
Settings > Access Token > Client ID - API Token of HelpSpace Workspace (read-only sufficient): You find it under
Settings > Access Token > Access Token - Site ID of your Docs: You can find it under
Docs > Sites > Site ID
Custom Website Crawler (Enterprise-Subscription only)
For enterprise customers, we offer to develop a custom crawler for your use case. Please contact us for more information.
MySQL / Postgres
This feature is currently only in use for selected test customers and can only activated by branchly employees. If you have a use case where this would be helpful, please contact us.