Diffbot provides AI-powered tools to extract and structure data from web pages, transforming unstructured web content into structured, linked data.
Diffbot provides AI-powered tools to extract and structure data from web pages, transforming unstructured web content into structured, linked data. On Nagent, Diffbot is exposed as a fully-configurable artificial intelligence integration that any agent can call — 35 actions, and API key authentication. No code is required to wire Diffbot into your workflow — connect it once via the External Integrations panel and reuse it across every agent you build.
Agent builders use Diffbot to automate the kinds of tasks artificial intelligence teams previously handled manually. Concrete examples — each one is a single agent step in Nagent — include:
Every action and trigger is paired with a structured input/output schema (visible in the sections below), so when you wire Diffbot into Helix — our agentic agent builder — the editor knows exactly what each step expects and produces. Configure once, deploy anywhere across your Nagent agents.
Every operation an agent can call against Diffbot, with input parameters and output schema. Drop these into any step of an agent built in Helix.
DIFFBOT_COMBINE_ENTITY_PROFILESCombine multiple entity profiles into a unified view using the Diffbot Knowledge Graph. Returns enhanced person or organization data by matching on identifying attributes like name, email, employer, or URL. Use this to enrich partial entity data, merge duplicate profiles, or verify entity identity.
Input parameters
IP address of the entity to enhance. Can be used with types Person and Organization.
Origin or homepage URI(s) of entity to enhance. Can be used with types Person and Organization.
Name(s) of the entity to enhance. Can be used with types Person and Organization. Multiple names can be provided for better matching.
Valid entity types for the combine API.
Email address of the entity to enhance. Can be used only with type Person.
Phone number of the entity to enhance. Can be used with types Person and Organization.
Job title of the entity to enhance. Can be used only with type Person.
Semi-colon separated path filter to include specific fields in response JSON. Use dot notation (e.g., 'skills.name') or JsonPath expression (e.g., '$.name;$.locations.country.name').
School or educational institution of the entity to enhance. Can be used only with type Person.
When true, Diffbot will attempt to search the web for origins and merge relevant results with what's found in the Knowledge Graph.
When true, Diffbot will attempt to recrawl all origins of the identified entity and reconstruct the entity from refreshed data.
User-defined ID for correlation and tracking purposes.
Employer of the entity to enhance. Can be used only with type Person.
JSON mode options for response formatting.
Location of the entity to enhance. Can be used with types Person and Organization.
Enhance similarity threshold score (0.0 to 1.0). Higher values require stronger match confidence.
Semi-colon separated path filter to exclude specific fields from response JSON. Use dot notation or JsonPath expression.
When true, returns non-canonical facts in addition to canonical ones.
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_CREATE_BULKTool to submit a bulk extract job to process multiple URLs with Extract APIs. Use when you need to process many URLs asynchronously using any Extract API. The job will process URLs in the background and provide downloadable results.
Input parameters
Job name for identification. Must be unique.
URLs to process. Can be a single URL or comma-separated list of URLs.
Extract API endpoint URL to use for processing (e.g., 'https://api.diffbot.com/v3/article'). The token will be added automatically.
Email address to notify when the job completes.
Webhook URL to POST to when the job completes.
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_CREATE_CUSTOM_APITool to create or update the parameters and ruleset of a Custom API. Use this when you need to define custom extraction rules for specific websites that require tailored parsing logic beyond standard Diffbot APIs. Allows defining URL patterns, CSS selectors, extraction rules, and preprocessing filters to extract structured data from websites with unique layouts.
Input parameters
The specific API being targeted. Always precede the API name with "/api/" as in "/api/article" (except for "all")
An array of strings that can be added manually. The API automatically adds a note specifying when the API was last updated
An array of objects that defines a set of rules for the specific urlPattern-api combination
A URL that can be used to check that the rule still works as intended. This is the page that will load automatically when editing the ruleset in the Dashboard UI
Used to disable proxies (when they have been set globally), by applying the value "none"
An array of CSS selector strings that should be omitted from the DOM before extraction occurs
A regex pattern that defines the URLs for which the ruleset will be applied
Rendering options for the page (e.g., 'mobile', 'desktop')
X-Forward headers configuration.
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_CREATE_KG_BULK_ENHANCETool to submit a bulk enhance job to enrich multiple entities asynchronously. Use when you need to process many Person or Organization records in batch. The API accepts entity descriptions and returns enriched data from the Diffbot Knowledge Graph.
Input parameters
Human-readable name for the bulk job to help identify it later.
Maximum number of results to return per entity. Default is 1.
If true, Diffbot will search the web for additional origins and merge results. Default is false.
If true, Diffbot will recrawl all origins and reconstruct entities from fresh data. Default is false.
List of entities to enhance. Each entity can be a Person, Organization, or a reference by ID. Minimum 1 entity required.
JSON mode options for enhance results.
Similarity threshold for matching entities (0.0 to 1.0). Higher values require closer matches.
Webhook URL to receive notifications when the job completes.
If true, returns non-canonical facts in results. Default is false.
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_DELETE_CUSTOM_APITool to delete custom API definitions for a given URL pattern. Removes custom extraction rules from your account. Use when you need to remove previously configured custom APIs.
Input parameters
Base API of the custom API to delete. Always precede the API name with '/api/' (e.g., '/api/article')
URL pattern (regex) of the custom API to delete. This defines which URLs the custom API applies to.
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_DELETE_KG_ENHANCE_BULKJOBTool to delete an Enhance Bulkjob. Removes the bulk job and its results from the system. Use when cleaning up completed or failed jobs.
Input parameters
Enhance Bulkjob ID to delete (e.g., 'B-a6a72339-3af7')
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_DOWNLOAD_BULK_RESULTSTool to download results of a bulk enhance job with filtering options via POST request. Use this to retrieve processed results from a completed or running bulk job. Supports multiple export formats (json, jsonl, csv, xls, xlsx) and various filtering options to customize the output. HTTP 200 indicates results are ready, HTTP 201 means the job is still executing.
Input parameters
Starting index for pagination (offset). Should be used together with 'size' parameter.
Return first n results. Use this to limit the number of results returned.
Maximum number of results to return. Should be specified with 'from' parameter for pagination.
Seconds to wait for bulkjob results to export. Results will continue to export in the background. Use 0 to only trigger an export without waiting.
Semi-colon separated path filter to filter response json. You can use simple dot notation like 'skills.name' or JsonPath expressions like '$.name;$.locations.country.name'.
Export format options for bulk results.
Enhance Bulkjob ID (e.g., 'B-89cfc3b2-e744'). This is the unique identifier of the bulk job whose results you want to download.
File name of the export file. Specify a custom filename for the exported results.
The spec defines the columns to export. This is applicable for csv, xls and xlsx formats. Simple spec looks like 'name;summary'. For complex specs including list handling, see Diffbot documentation.
Prefixes the enhance query parameters to the CSV export result. Only applicable for CSV exports.
Return only records that have a match. Use this to filter out records without matches.
Semi-colon separated path filter to exclude data from response json. You can use simple dot notation or JsonPath expressions.
Separator for multi-value fields when exporting columnar results (csv, xls, xlsx formats).
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_ENHANCE_ENTITYEnrich a person or organization with comprehensive data from the Diffbot Knowledge Graph. Provide identifiers like name, email, employer, or URL and receive detailed entity information including employment history, education, location, skills, and more. Use when you need to gather all publicly available knowledge about a specific person or organization from billions of web pages.
Input parameters
DiffbotId of entity to enhance. Can be used with types Person and Organization. If you know the exact entity ID, this is the most precise identifier
IP address of the entity to enhance. Can be used with types Person and Organization
Origin or homepage URI of entity to enhance. Can be used with types Person and Organization. Provide multiple URLs associated with the entity
Name of the entity to enhance. Can be used with types Person and Organization. Provide multiple name variations to increase match accuracy
Maximum number of results to return (default=1). Set higher to get multiple potential matches
Type of entity to enhance.
Email address(es) of the entity to enhance. Can be used only with type Person. Provide multiple emails if available
Phone number of the entity to enhance. Can be used with types Person and Organization
Job title of the entity to enhance. Can be used only with type Person
Semi-colon separated path filter to include only specific fields in response JSON. Use dot notation (e.g., 'skills.name') or JsonPath expressions (e.g., '$.name;$.locations.country.name'). Reduces response size
School or educational institution of the entity to enhance. Can be used only with type Person
If true, Diffbot will attempt to search the web for origins for the search query and merge relevant results with what's found in the KG (default=false). Useful when entity might not be in the KG yet
If true, Diffbot will attempt to recrawl all origins of the identified entity and reconstruct the entity from refreshed data (default=false). This provides the most up-to-date information but takes longer
User-defined ID for correlation and tracking purposes. Will be returned in the response for request matching
Employer of the entity to enhance. Can be used only with type Person. Helps identify the correct person when names are common
JSON mode for response formatting.
Location of the entity to enhance. Can be used with types Person and Organization. Can be city, state, country, or full address
Enhance similarity threshold (0.0 to 1.0). Only return matches with similarity score above this threshold. Higher values return fewer but more confident matches
Description of the entity to enhance. Can be used with types Person and Organization. Helps identify the correct entity
Semi-colon separated path filter to exclude specific fields from response JSON. Use dot notation or JsonPath expressions. Useful to remove verbose fields
If true, returns non-canonical facts in addition to canonical facts (default=false). Non-canonical facts are alternative values found across multiple sources
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_EXTRACT_JOBTool to extract structured job posting data from job listing pages. Returns job title, company, location, salary, requirements, skills, and other job-related information. Use when you need to parse and structure data from job postings.
Input parameters
Target URL of the job listing page to extract structured data from
IP address of a custom proxy that will be used to fetch the target page. Leave empty to use default proxy
Comma-separated list of optional fields to be returned from any fully-extracted pages (e.g. 'querystring,links'). Valid values: links, extlinks, meta, querystring, breadcrumb
Maximum time in milliseconds to wait for the retrieval/fetch of content from the requested URL. Default is 30000 (30 seconds)
Set to 'default' to use Diffbot's datacenter proxy for this request. Set to 'none' to instruct Extract to not use proxies
Authentication parameters for the custom proxy specified in the proxy parameter (format: username:password)
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_EXTRACT_LISTTool to extract structured data from list-style pages like news indexes, product listings, and directory pages. Returns an array of items with their titles, links, and descriptions. Use when you need to extract multiple items from a page organized as a list or index.
Input parameters
Target URL to extract list data from. Must be a valid URL starting with http or https.
Specify an IP address of a custom proxy that will be used to fetch the target page.
Comma-separated list of optional fields to be returned from any fully-extracted pages (e.g., 'links,meta,querystring'). Valid values: links, extlinks, meta, querystring, breadcrumb.
Sets a value in milliseconds to wait for the retrieval/fetch of content from the requested URL. The default timeout for the third-party response is 30 seconds (30000).
Set to 'default' to use Diffbot's datacenter proxy for this request. 'none' will instruct Extract to not use proxies, even if proxies have been enabled for this particular URL globally.
Used to specify the authentication parameters that will be used with a custom proxy specified in the &proxy parameter.
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_GET_ACCOUNTRetrieves comprehensive Diffbot account information including subscription plan details, credit balance, usage history, and account status. Returns account holder name, email, current plan, available credits, and daily usage statistics for the past 31 days. Use this to check your account's credit balance, monitor API usage patterns, verify account status, or retrieve account metadata.
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_GET_ANALYZEAutomatically analyzes a web page to determine its type and extract structured data. The Analyze API intelligently classifies pages into types (article, product, discussion, image, video, organization, etc.) and extracts relevant structured data. Use this when you need to process URLs of unknown type or want automatic extraction without specifying the page type in advance.
Input parameters
The full URL of the page to analyze, including http:// or https://
Restrict extraction to a specific page type. Options: article, product, discussion, image, video, list, event. If not specified, all types are considered.
Comma-separated list of additional fields to include or limit output fields. Options include: links, extlinks, meta, querystring, breadcrumb, or specific field names to limit output.
Maximum time (in milliseconds) to wait for API response. Default is 30000ms (30 seconds). Maximum is 300000ms (5 minutes).
API to use if page type cannot be determined. Options: article, product, discussion, image, video.
Set to false to disable automatic extraction of comments/discussions from the page. Default is true.
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_GET_ARTICLETool to extract information from articles, including authors, publication dates, and images. Use when you need structured metadata from a web article URL.
Input parameters
Full URL of the web page to analyze, must start with http or https
Extraction mode override (defaults to 'article')
Whether to include statistics like word count
List of specific fields to include in the response. If provided, only these fields are returned.
Paging token for multi-page articles (returned in previous response)
Maximum time in milliseconds to wait for page rendering
Whether to include discussion/comment data in the response
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_GET_BULK_DATATool to download extracted results from a completed bulk job. Use after a bulk job has finished processing to retrieve the data. Supports JSON and CSV formats.
Input parameters
Limit results to the N most recently processed URLs. Useful for testing or sampling large result sets.
Name of the bulk job whose results you want to download. Must match the job name specified when the job was created.
Type of data to retrieve.
Output format for bulk job data.
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_GET_BULK_JOB_STATUSTool to poll the status of a specific Diffbot Knowledge Graph Enhance bulk job. Use when you need to check the progress, completion status, or details of a bulk enhancement job.
Input parameters
Enhance Bulkjob ID to poll status for
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_GET_BULK_RESULTSTool to download the results of a completed Enhance Bulkjob. Returns enriched records from the bulk job. Use after a bulk enhance job has completed processing.
Input parameters
Starting index for pagination (use with size parameter)
Return first n results
Maximum number of results to return. Should be specified with from parameter. Use -1 for all results.
Seconds to wait for bulkjob results to export. Results will continue to export in the background. Use 0 to only trigger an export without waiting.
Semi-colon separated path filter to filter response json. Use dot notation like 'skills.name' or JsonPath expressions like '$.name;$.locations.country.name'.
Export format options.
Enhance Bulkjob ID to retrieve results for
File name of the export file
Defines the columns to export for csv, xls and xlsx formats. Use semi-colon separated field paths like 'name;summary' or JsonPath expressions. See Diffbot documentation for advanced syntax.
Enum for exportquery parameter.
Enum for onlyMatches parameter.
Semi-colon separated path filter to exclude data from response json. Use dot notation or JsonPath expressions.
Separator for multi-value fields when exporting columnar results
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_GET_BULK_SINGLE_RESULTTool to download the result of a single job within a Diffbot bulk enhance job. Returns enriched entity data for a specific input record by its index. Use after a bulk enhance job has completed to retrieve individual results without downloading the entire dataset.
Input parameters
Job index within the bulkjob (0-based index of the input record)
Enhance Bulkjob ID (e.g., 'B-1ff60452-8421')
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_GET_CRAWL_DATADownload extracted results from a completed crawl job. Returns all structured data extracted during crawl processing (articles, products, etc.). Use after a crawl job has completed to retrieve the collected data.
Input parameters
Maximum number of results to return. Omit to return all available results.
Name of the crawl job to retrieve data from. Must be an existing crawl job created via Start Crawl.
Output format for crawl data.
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_GET_DISCUSSIONExtract structured discussion threads from web pages including forums, comment sections, product reviews, Reddit discussions, and blog comments. Returns posts with author info, timestamps, content, and hierarchical relationships. Useful for analyzing conversations, gathering feedback, or monitoring discussions. Supported platforms: Native comment systems, Disqus, Facebook Comments, Reddit, forum software, and more. Use this when you need to: - Extract all comments/posts from a discussion thread - Analyze user feedback or reviews - Monitor forum discussions or social media threads - Gather structured conversation data with metadata
Input parameters
The URL of the discussion page to process (e.g., forum thread, Reddit discussion, product review page, or comment section).
Comma-separated list of additional fields to include in the response (e.g., 'sentiment', 'links', 'meta', 'breadcrumb').
Maximum time in milliseconds to wait for the response. Default is 30000 (30 seconds).
Maximum number of pages to concatenate. Default is 1 (no concatenation). Set to 'all' to retrieve all pages in the thread. Each page counts as a separate API call.
Whether to disable full page rendering. Set to True for faster responses with potentially lower extraction quality. Default is False.
Whether to extract comments/reviews. Set to False to disable comment extraction for faster response times. Default is True.
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_GET_EVENTTool to extract event details from web pages. Use when you need structured event data such as venue, date, and description.
Input parameters
URL of the event page to analyze
Comma-separated list of fields to return, e.g., title,date,location
Enable automatic paging of results
Maximum timeout in milliseconds for the API call
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_GET_IMAGETool to extract detailed information about images, including dimensions and recognition data. Use after confirming the image URL is publicly accessible.
Input parameters
Publicly-accessible URL of the image to analyze
Comma-separated list or array of specific fields to include in response, e.g., 'naturalWidth','captions'
Whether to include paging information for multi-image responses
Maximum time to wait for API response, in milliseconds
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_GET_KG_COVERAGE_REPORT_BY_IDDownload Knowledge Graph coverage report by report ID. Returns detailed CSV coverage statistics showing field presence across query results. Use this after generating a coverage report from a DQL query to retrieve the statistical breakdown of field coverage.
Input parameters
Report ID to retrieve. Format: C-<hash> (e.g., 'C-a0b39dad-68bf'). Reports may expire quickly, so use immediately after generation.
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_GET_PRODUCTTool to extract product information such as specifications, prices, availability, and reviews. Use when you need structured product data including specs, pricing, and reviews.
Input parameters
URL of the product page to analyze
Extraction mode override (defaults to 'product')
List of fields to return, e.g., title,offerPrice,images
Enable automatic paging of results
Maximum timeout in milliseconds for the API call
Include discussions/comments in the response
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_GET_VIDEOTool to extract information from videos, including titles, descriptions, and embedded HTML. Use when you need structured video metadata from any web page.
Input parameters
Full URL of the web page to analyze for embedded videos, must start with http or https
Extraction mode override (e.g., 'auto')
Comma-separated list or array of optional fields to include in the response (e.g., 'links', 'meta', 'querystring', 'breadcrumb'). Standard fields are always returned.
Whether to return all detected results in one call (may increase runtime)
Maximum time in milliseconds to wait for extraction
Name of the JSONP callback function (if using JSONP)
Whether to try an alternate extraction method if the primary fails
Include user discussion data (comments) if available
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_LIST_BULK_JOBSTool to list all Bulk jobs associated with a specific token. Use after authenticating to retrieve statuses of all jobs for the account.
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_LIST_BULK_JOBS_STATUS_FOR_TOKENTool to get the status of all bulk enhance jobs for a token. Returns list of all bulk jobs associated with your API token. Use when you need to monitor or retrieve the status of multiple bulk jobs at once.
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_LIST_CUSTOM_APISTool to retrieve all Custom APIs and their extraction rules currently defined on your Diffbot token. Use when you need to list, review, or audit custom API configurations for your account.
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_MANAGE_CRAWLManages Diffbot crawl jobs: pause, restart, delete, or view status. Returns list of all active crawl jobs when called without parameters. Use 'name' parameter with action flags (pause=1, restart=1, delete=1) to control specific jobs.
Input parameters
Unique identifier of the crawl job to manage. Omit to list all active crawl jobs.
Set to 1 to pause the specified crawl job. Requires 'name' parameter.
Set to 1 to delete the specified crawl job. Requires 'name' parameter.
Set to 1 to restart the specified crawl job. Requires 'name' parameter.
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_RESOLVE_LOST_IDTool to resolve lost IDs in the Knowledge Graph. Use when you need to map a lost identifier to its canonical counterpart for data consistency.
Input parameters
The type of object (e.g., 'article', 'product'). If omitted, Diffbot will attempt to infer.
The lost ID which needs to be resolved to a canonical ID.
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_SEARCHSearch the Diffbot Knowledge Graph using DQL (Diffbot Query Language). Query billions of entities including organizations, people, articles, products, and more. Use structured queries to filter by type, fields, and relationships.
Input parameters
Comma-separated list of custom crawl collections to search (default="all"). Only used when query_type="crawl".
Maximum number of results to return (default=50). Use -1 to return all results. Constraint: from+size ≤ 10,000 for facet queries.
DQL (Diffbot Query Language) query to search the Knowledge Graph. Use structured syntax to filter entities and documents.
Path filter to include a specific field in the response JSON, using dot notation or JsonPath. Only a single path is accepted by the Diffbot API.
Output format. Only "json" is supported by this action. Non-JSON formats (jsonl, csv, xls, xlsx) cannot be processed and will be ignored.
Starting index for pagination (API param 'from'; default=0). Constraint: from+size ≤ 10,000 for facet queries.
JSON mode: "extended" (includes origin info) or "id" (returns diffbotIds only).
Type of query: "query" (default, structured DQL), "text" (free-text search), "queryTextFallback" (tries DQL first, falls back to text), or "crawl" (search custom crawl collections).
Path filter to exclude a specific field from the response JSON. Only a single path is accepted by the Diffbot API.
Include non-canonical facts in results (default=false).
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_SEARCH_CRAWL_DATATool to query crawl job collections using DQL (Diffbot Query Language). Use when you need to search extracted data from completed crawl or bulk jobs by collection name.
Input parameters
Name of the collection (Crawl or Bulk job name) to search
Number of results to return per page (default varies by API)
Search query string using Diffbot Query Language (DQL). Supports operators like type:Article, sortby:date, etc.
Pagination offset - number of results to skip (default=0)
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_START_BULKTool to start a Bulk Extract job. Use when processing large numbers of URLs asynchronously. The Diffbot Bulk API uses GET requests with query parameters to create jobs.
Input parameters
Unique job name for identification. Required by the Diffbot API.
List of page URLs to process. URLs should be separated by whitespace in the API call.
Diffbot Extract API endpoint to use (e.g., 'https://api.diffbot.com/v3/article'). Do NOT include the token - it will be added automatically.
Maximum number of URLs to crawl.
Email to notify when job completes.
Maximum number of URLs to process.
Webhook URL to POST on job completion.
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_START_CRAWLInitiates a Diffbot crawl job that spiders a website starting from seed URLs and processes discovered pages with a specified Extract API. The crawler follows links within the domain, collects structured data (articles, products, etc.), and stores results for download. Use this to systematically extract data from entire websites or sections. Requires Diffbot Plus plan or higher.
Input parameters
Unique identifier for the crawl job. Used to manage and retrieve the crawl.
List of seed URLs where crawling will begin. URLs will be URL-encoded automatically.
Full Diffbot Extract API endpoint URL to process crawled pages. Examples: 'https://api.diffbot.com/v3/article' for articles, 'https://api.diffbot.com/v3/product' for products, 'https://api.diffbot.com/v3/analyze' for automatic type detection.
Number of days between automatic crawl repeats. Use 7.0 for weekly, 1.0 for daily. Omit for one-time crawl.
Delay in seconds between requests to the same IP address. Default is 0.25 seconds.
Maximum number of pages to crawl/spider. Default is 100,000. Use -1 for unlimited.
Whether to respect robots.txt directives. 1 = obey (default), 0 = ignore.
Email address to notify when the crawl completes.
Maximum number of pages to process with the Extract API. Default is 100,000. Use -1 for unlimited.
Custom HTTP headers to include in crawl requests (e.g., {'User-Agent': 'MyBot/1.0'}).
Webhook URL to POST to when the crawl completes.
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_STOP_BULK_JOBTool to pause (stop) a running Bulk job. Pausing halts further processing of URLs while preserving existing progress. To resume, use the appropriate resume action. Specify the exact job name (case-sensitive) as provided when the job was created.
Input parameters
Name of the Bulk job to pause/stop (as defined when creating the job)
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
DIFFBOT_STOP_KG_BULK_JOB_BY_IDTool to stop an active Knowledge Graph Enhance bulk job by its ID. Halts processing of a running KG bulk job immediately. Use when you need to stop a specific KG bulk job using its bulkjobId.
Input parameters
The unique identifier of the Knowledge Graph Enhance bulk job to stop (e.g., 'B-1ff60452-8421')
Output
Data from the action execution
Error if any occurred during the execution of the action
Whether or not the action execution was successful or not
No publicly available marketplace agent is found using this tool yet. There are 38 agents privately built on Nagent that already use Diffbot.
Build on Nagent
Connect Diffbot to any Nagent agent in minutes — no API key management, no boilerplate. Just configure and deploy.
The five questions agent builders ask before adopting a new integration.
Open the External Integrations panel inside Nagent (app.nagent.ai/externalIntegration), find Diffbot, and click "Connect Now." You'll authenticate with an API key — Nagent handles credential storage and refresh automatically. Once connected, Diffbot is available to any agent in your workspace.
No. Nagent provides no-code integration for every tool. Once Diffbot is connected, you configure its 35 actions directly in the agent builder UI — no API calls, no boilerplate, no schema management.
Helix — Nagent's agentic agent builder — lets you drop Diffbot steps into any workflow visually. Pick an action (e.g., one of those listed above), fill in the inputs (Helix knows the required vs. optional schema for each parameter), and connect it to upstream/downstream steps. Triggers run as the entry point of an agent, so when a Diffbot event fires, the agent kicks off automatically.
Every Diffbot action and trigger ships with a fully-typed schema — input parameters with name, type, required flag, and description, plus the output payload shape. The schemas are documented in the sections above. Helix uses these schemas to validate your configuration at build time and to type-check the data flowing between steps.
Yes. While Diffbot ships with 35 pre-built artificial intelligence actions, you can layer custom logic around them inside Helix — pre/post-processing steps, conditional branches, retries, or stitching Diffbot together with other connected tools. For deeper customization, talk to our team about Nagent's Agentic AI Lab — forward-deployed engineers who build Diffbot-based workflows tailored to your business.