Skip to main content

Data Sources

Integrate existing content seamlessly with branchly using various data sources. Presently, we support integration through a website crawler, file upload, openAPI specification, and MySQL/Postgres databases. We are expanding our data source offerings, so if you require a particular source, please reach out to us.

Shared Settings

For all data sources except for File Upload (if you do not link a document), you can specify a Schedule, when this data source should update your content in the background.

  • Schedule: Enter a valid cron expression. Based on the schedule, the webcrawler is run automatically and updates or creates new nodes. Use a service like https://crontab.guru/ to generate a cron expression for your data source. Please note that the minimum time between schedules needs to be 60 Minutes.

Website Crawler

The Website Crawler starts with the URLs, finds links to other pages, and recursively crawls those pages, too, as long as their URL is under the start URL. For example, if you enter the start URL https://example.com/blog/, the crawler will crawl pages like https://example.com/blog/article-1 or https://example.com/blog/section/article-2, but will skip pages like https://example.com/docs/something-else. Using the option Split Documents, we can try to identify the semantic structure of the website using its heading and create multiple Nodes for you. The results of this depend on the semantic quality of your website.

Configuration Options

  • Split documents: We can try to identify the semantic structure of your website and create multiple Nodes for you based on the HTML headings (h1, h2, etc.) of your page. This setting is helpful when crawling technical documentation or long pages of text.

  • Crawler Type: Choose between Adaptive(default), Cheerio and Playwright. We recommend using the Adaptive type if you start out, as it can handle Javascript and switches dynamically http requests and executing javascript.

  • Actor: For users only the Website Content Crawler is available. For special use cases, we can build a custom crawler.

  • Include URL Globs: Glob patterns matching URLs of pages that will be included in crawling. Setting this option will disable the default Start URLs based scoping Note that this affects only links found on pages.

    Example: start_urls: ["example.com/test"], include_url_globs = ["example.com/test/**", "example.com/test-2/**"] In this case also urls that are found on example.com/test that start with example.com/test-2/ are included in the results. This is beneficial if the structure of your website is not hierarchical.

  • Exclude URL Globs: Glob patterns matching URLs of pages that will be excluded from crawling. Note that this affects only links found on pages, but not Start URLs, which are always crawled. Example: start_urls: ["example.com"], exclude_url_globs = ["example.com/impressum"] Here, the specific url example.com/impressum is excluded from the results, although it would fit into the results according to the start_urls.

  • Ignore Errors: Select this if your language code is not set correctly and you want to ignore any errors (status code, language code, etc.)

  • Remove HTML Elements: Define HTML elements via their CSS selectors to remove from processing. We have pre-configured the most common elements, such as nav, header and footer.

  • Filters: Define Page Metadata filter criteriums. We currently allow for fields to be filtered: modified_time , published_time by a max_age_days value (e.g. pages that were updated over a year ago).

File Upload

We currently support reading in PDF and CSV files. Using the option Split Documents, we can try to identify the semantic structure of the PDFs and create multiple Nodes for you. As PDFs are inherently unstructured, the results of this depend on your input documents. We infer the language/locale of the documents. The maximum upload size is 20MB.

openAPI specification

We provide an option to fetch your data directly from an API using the OpenAPI specification. OpenAPI is a widely used format for exchanging API information, often used to generate clients in code. For more details on OpenAPI specification, visit here.

Configuration Parameters

  • OpenAPI Specification: Paste your OpenAPI specification in JSON format into the designated field.
  • Operation ID: Specify an operationId to identify the endpoint to call. This ID must be unique, a default characteristic of a valid OpenAPI specification.
  • Extra Options: We use default values to make requests. If you want to specify arguments without including them in the OpenAPI specification JSON, add them to the extra_options field. Note that these additional arguments must exist in the schema of the defined endpoint.
  • Data Template Mapping: To align your API responses with the structure of our knowledge base, use moustache templating syntax. Moustache is a logic-less templating syntax, widely used for this purpose. More information about moustache can be found here. Mapping for the parameters titletext, and source is required. The source value should be unique and preferably a valid URL.
  • Response Content Path: This optional parameter should be included if your API's response items are nested inside an object. Use keypath or dot notation to define the path.

Currently, we support the following scenarios:

  • No authentication, api_key, and basic authentication.
    • To use api_key, add your API key to the extra_options field.
    • To use basic authentication, enter your username and password in the extra_options field.
  • The API should respond with a 200 status code and with Content Type application/json.
  • The API call result should be a JSON object or an array of objects.
  • A base_url must be specified either in the OpenAPI file under servers or as an extra option (Example: base_url: https://petstore3.swagger.io/api/v3).

Example Usage

Consider the following JSON object as your API response:

{
"status": "OK",
"count": 100,
"items": [
{
"url": "example.com/1234567",
"id": "1234567",
"title": "Title of your content",
"type": "type",
"texts": {
"detail": "Detailed text description",
"summary": "Short summary"
}
},
... // more items with the same structure
]
}

If you define the following:

  • Response Content Path: items
  • Template Mapping
    • Title : The title of this node is: {{title}}
    • Text: {{texts.details}}
    • Source: {{url}}

The result for the first item in the items array would be:

  • Title: The title of this node is: Title of your content
  • Text: Detailed text description
  • Source: example.com/1234567

The Template will be applied to all items in the items array and they will be loaded into the knowledge base as individual nodes.

HelpSpace Docs

We are partnering with HelpSpace to easily connect your HelpSpace Docs to the branchly platform.

Required field:

  • Client ID of HelpSpace Workspace: You find it under Settings > Access Token > Client ID
  • API Token of HelpSpace Workspace (read-only sufficient): You find it under Settings > Access Token > Access Token
  • Site ID of your Docs: You can find it under Docs > Sites > Site ID

Custom Website Crawler (Enterprise-Subscription only)

For enterprise customer, we offer the option to develop a custom crawler for your use case. Please contact us for more information.

MySQL / Postgres

This feature is currently only in use for selected test customers and can only activated by branchly employees. If you have a use case where this would be helpful, please contact us.