Data Sources
Integrate existing content seamlessly with branchly using various data sources. Presently, we support integration through a website crawler, file upload, openAPI specification, and MySQL/Postgres databases. We are expanding our data source offerings, so if you require a particular source, please reach out to us.
Shared Settings
For all data sources except for File Upload (if you do not link a document), you can specify a Schedule, when this data source should update your content in the background.
- Schedule: Enter a valid cron expression. Based on the schedule, the webcrawler is run automatically and updates or creates new nodes. Use a service like https://crontab.guru/ to generate a cron expression for your data source. Please note that the minimum time between schedules needs to be 60 Minutes.
Website Crawler
The Website Crawler starts with the URLs, finds links to other pages, and recursively crawls those pages, too, as long as their URL is under the start URL. For example, if you enter the start URL https://example.com/blog/
, the crawler will crawl pages like https://example.com/blog/article-1
or https://example.com/blog/section/article-2
, but will skip pages like https://example.com/docs/something-else
. Using the option Split Documents
, we can try to identify the semantic structure of the website using its heading and create multiple Nodes for you. The results of this depend on the semantic quality of your website.
Configuration Options
-
Split documents: We can try to identify the semantic structure of your website and create multiple Nodes for you based on the HTML headings (
h1
,h2
, etc.) of your page. This setting is helpful when crawling technical documentation or long pages of text. -
Crawler Type: Choose between
Adaptive
(default),Cheerio
andPlaywright
. We recommend using theAdaptive
type if you start out, as it can handle Javascript and switches dynamically http requests and executing javascript. -
Actor: For users only the
Website Content Crawler
is available. For special use cases, we can build a custom crawler. -
Include URL Globs: Glob patterns matching URLs of pages that will be included in crawling. Setting this option will disable the default Start URLs based scoping Note that this affects only links found on pages.
Example:
start_urls: ["example.com/test"], include_url_globs = ["example.com/test/**", "example.com/test-2/**"]
In this case also urls that are found onexample.com/test
that start withexample.com/test-2/
are included in the results. This is beneficial if the structure of your website is not hierarchical. -
Exclude URL Globs: Glob patterns matching URLs of pages that will be excluded from crawling. Note that this affects only links found on pages, but not Start URLs, which are always crawled. Example:
start_urls: ["example.com"], exclude_url_globs = ["example.com/impressum"]
Here, the specific urlexample.com/impressum
is excluded from the results, although it would fit into the results according to thestart_urls
. -
Ignore Errors: Select this if your language code is not set correctly and you want to ignore any errors (status code, language code, etc.)
-
Remove HTML Elements: Define HTML elements via their CSS selectors to remove from processing. We have pre-configured the most common elements, such as nav, header and footer.
-
Filters: Define Page Metadata filter criteriums. We currently allow for fields to be filtered:
modified_time
,published_time
by amax_age_days
value (e.g. pages that were updated over a year ago).
File Upload
We currently support reading in PDF and CSV files. Using the option Split Documents
, we can try to identify the semantic structure of the PDFs and create multiple Nodes for you. As PDFs are inherently unstructured, the results of this depend on your input documents. We infer the language/locale of the documents. The maximum upload size is 20MB.
openAPI specification
We provide an option to fetch your data directly from an API using the OpenAPI specification. OpenAPI is a widely used format for exchanging API information, often used to generate clients in code. For more details on OpenAPI specification, visit here.
Configuration Parameters
- OpenAPI Specification: Paste your OpenAPI specification in JSON format into the designated field.
- Operation ID: Specify an
operationId
to identify the endpoint to call. This ID must be unique, a default characteristic of a valid OpenAPI specification. - Extra Options: We use default values to make requests. If you want to specify arguments without including them in the OpenAPI specification JSON, add them to the
extra_options
field. Note that these additional arguments must exist in theschema
of the defined endpoint. - Data Template Mapping: To align your API responses with the structure of our knowledge base, use
moustache
templating syntax. Moustache is a logic-less templating syntax, widely used for this purpose. More information about moustache can be found here. Mapping for the parameterstitle
,text
, andsource
is required. Thesource
value should be unique and preferably a valid URL. - Response Content Path: This optional parameter should be included if your API's response items are nested inside an object. Use keypath or dot notation to define the path.
Currently, we support the following scenarios:
- No authentication, api_key, and basic authentication.
- To use api_key, add your API key to the
extra_options
field. - To use basic authentication, enter your username and password in the
extra_options
field.
- To use api_key, add your API key to the
- The API should respond with a 200 status code and with Content Type
application/json
. - The API call result should be a JSON object or an array of objects.
- A base_url must be specified either in the OpenAPI file under
servers
or as an extra option (Example: base_url: https://petstore3.swagger.io/api/v3).
Example Usage
Consider the following JSON object as your API response:
{
"status": "OK",
"count": 100,
"items": [
{
"url": "example.com/1234567",
"id": "1234567",
"title": "Title of your content",
"type": "type",
"texts": {
"detail": "Detailed text description",
"summary": "Short summary"
}
},
... // more items with the same structure
]
}
If you define the following:
- Response Content Path:
items
- Template Mapping
- Title :
The title of this node is: {{title}}
- Text:
{{texts.details}}
- Source:
{{url}}
- Title :
The result for the first item in the items
array would be:
- Title: The title of this node is: Title of your content
- Text: Detailed text description
- Source: example.com/1234567
The Template will be applied to all items in the items
array and they will be loaded into the knowledge base as individual nodes.
HelpSpace Docs
We are partnering with HelpSpace to easily connect your HelpSpace Docs to the branchly platform.
Required field:
- Client ID of HelpSpace Workspace: You find it under
Settings > Access Token > Client ID
- API Token of HelpSpace Workspace (read-only sufficient): You find it under
Settings > Access Token > Access Token
- Site ID of your Docs: You can find it under
Docs > Sites > Site ID
Custom Website Crawler (Enterprise-Subscription only)
For enterprise customer, we offer the option to develop a custom crawler for your use case. Please contact us for more information.
MySQL / Postgres
This feature is currently only in use for selected test customers and can only activated by branchly employees. If you have a use case where this would be helpful, please contact us.