Crawling and indexing

To index pages for related link selection, Graphite’s bot, user agent: GraphiteBot/1.0 (+<https://www.graphitehq.com>), crawls pages in the sitemap. The daily crawling run starts at 00:00 UTC and crawls 60-240 pages per minute.

API

The links for a specific page can be retrieved from the API.

Host

URL: https://api.graphitehq.com/il/{{CLIENT}}/

Endpoints

{{PAGE_TYPE}}/related-links/

Description: Returns a list of related links for a page. If the application can’t find related links, it returns randomly selected links from the index.
URL: https://api.graphitehq.com/il/{{CLIENT}}/{{PAGE_TYPE}}/related-links
Method: GET
Allowed Cross-Origin Resource Sharing: True
Special Headers Required: None
HTTP Authentication: None
Input Parameters: Query Strings

Parameters

Query String Parameters

Links for a page can be retrieved using the page canonical URL.

url
- Description: Unique URL of a page
- Required: Yes
- Notes:
  - The index document IDs are derived from unique URLs. Using URLs leads to an ID-based search on the API index, and allows possible and immediate side crawling processes of new pages that have not been indexed yet.
- Example Request URL: https://api.graphitehq.com/il/{{CLIENT}}/{{PAGE_TYPE}}/related-links?url={{URL}}

Schema

Status: 200


{
    "$schema": "http://json-schema.org/draft-07/schema",
    "type": "object",
    "description": "Successful response from the related-links/ API endpoint.",
    "required": [
        "message",
        "related_links"
    ],
    "properties": {
        "message": {
            "type": "string",
            "description": "Response results description."
        },
        "related_links": {
            "type": "array",
            "description": "Related links array containing related links to a single page.",
            "items": {
                "type": "object",
                "description": "Related link object containing data from a single related link to a page.",
                "required": [
                    "type",
                    "title",
                    "url",
                    "url_path"
                ],
                "properties": {
                    "type": {
                        "type": "string",
                        "description": "Link type: 'related' if the link was selected using related selection logic, or 'random' if it was selected uniformly at random without replacement."
                    },
                    "title": {
                        "type": "string",
                        "description": "Page title."
                    },
                    "url": {
                        "type": "string",
                        "description": "Page URL."
                    },
                    "url_path": {
                        "type": "string",
                        "description": "Page URL path."
                    }
                }
            }
        }
    }
}

Other link properties in the API index available fields can be included, if desired.

Status: 4XX, 5XX


{
    "$schema": "http://json-schema.org/draft-07/schema",
    "type": "object",
    "description": "Error response from the related-links/ API endpoint.",
    "required": [
        "message"
    ],
    "properties": {
        "message": {
            "type": "string",
            "description": "Error message."
        }
    }
}

Example call

Request

cURL

curl --location --request GET 'https://api.graphitehq.com/il/{{CLIENT}}/{{PAGE_TYPE}}/related-links?url={{URL}}'

Javascript Fetch

var requestOptions = {
  method: 'GET',
  redirect: 'follow'
};

fetch("https://api.graphitehq.com/il/{{CLIENT}}/{{PAGE_TYPE}}/related-links?url={{URL}}", requestOptions)
  .then(response => response.text())
  .then(result => console.log(result))
  .catch(error => console.log('error', error));

Python

import requests

url = "https://api.graphitehq.com/il/{{CLIENT}}/{{PAGE_TYPE}}/related-links?url={{URL}}"
response = requests.request("GET", url)
print(response.text.encode('utf8'))

Response

{{EXAMPLE_RESPONSE}}

{
    "message": "...",
    "related_links": [
        {
            "type": "related",
            "title": "...",
            "url": "...",
            "url_path": "..."
        },
        ...
        {
            "type": "related",
            "title": "...",
            "url": "...",
            "url_path": "..."
        }
    ]
}

Index

Available Link Fields
The index has several fields with page information obtained from crawling. All of these fields are available for export through the related-links/ endpoint response:

text*(text)* : Page plain text content.
title*(text)* : Page title
url (text): Page canonical URL
url_path (text): Page URL path
{{ADDITIONAL_FIELDS_FROM_INDEX}}

Current Response Link Fields

title
type (added when processing the API request)
url
url_path
{{ADDITIONAL_FIELDS_FROM_INDEX}}

Uptime and latency

The API is built on standard AWS services and as of May 27, 2022 we have had no major outages, with a 99.9% uptime. The average response time is approximately 150ms.

Requests rate limits

The API endpoints are not restricted by request rate limits; however, we encourage keeping the requests under 20 requests/second per endpoint. Updating data for a set of 10k pages will be done in less than 10 minutes.

When using the API endpoints to get data by batches, the API users should plan their jobs accordingly, considering the number of pages and the data update period, which could vary from one day to one week.

Caching

Server-side rendering with caching is strongly recommended. The results can be cached using the endpoint URL with query string parameters as the key.

The links are updated at most daily, so a one day TTL is appropriate.

GraphiteGrowth™

Enterprise API general documentation