Standard API general documentation

Consult this page if you are currently running or considering running Graphite's Standard Internal Links API

API Host

URL: https://ilapi.graphite.io

Internal Links API endpoints

Related links endpoint

This endpoint delivers a list of related links from a source webpage to other target webpages within the same website. The source and target pages can be of the same type (e.g., blog-to-blog links) or different types (e.g., blog-to-product links).

Prerequisites

  • Upon the client's or API user's request, Graphite creates a Related links endpoint
  • The client specifies the sets of source and target pages for which related links will be built
    • For more information about source and target sets, and how to define them on your site, please review our API definitions
  • Graphite then performs an initial crawl to gather the data that will be used to calculate related links or be returned through the endpoint response
  • Following this, Graphite carries out the initial computation of related links
  • The links for pages are chosen to maximize relatedness, ensuring that each page receives 'k' incoming links on average. The semantic similarity of their text determines the relatedness between the pages.
  • The endpoints are fully functional once the data is collected and the links are computed

Request

  • Endpoint Path: /<client_id>/<page_type_relation>/related-links
  • HTTP Method: GET
  • HTTP Authentication: None
  • Required Headers: None
  • CORS: Yes, accessible from all origins without the need for authentication or headers
    Input Parameters: Query Strings

Path parameters

  • client_id: (Required) A Graphite’s client (or API user) ID
    Provided by Graphite
  • page_type_relation: (Required) A label that describes the relationship between the source and target pages, for example, “blog” for same-type pages or “blog-to-product” for different types
    Provided by Graphite

📘

Informational API Endpoints

Check the Informational API Endpoints section to learn about available Internal Links API endpoints and their details for a certain client_id

Query string parameters

url

(Required) A canonical or unique webpage URL
The API IDs for each webpage are derived from unique URLs, which lead to an ID-based search on the API index.

Valid URLs:

  • URLs with a scheme, e.g., https://example.com/example/path.html
  • URLs without a scheme, e.g., example.com/example/path.html
  • A URL path (technically not a URL but valid for the API request). It must start with a slash (/), which indicates the root folder, e.g., /example/path.html
  • Encoded URLs, e.g., https%3A%2F%2Fexample.com%2Fexample%2Fpath.html
  • URLs with query string parameters must be encoded, e.g., https%3A%2F%2Fexample.com%2Fexample%2Fpath.html%3Fq%3Dv

The order of parameters in a URL doesn't change the data it leads to; however, different sets of parameters pointing to the same data are treated as unique webpages. Therefore, URLs with query string parameters should be standardized.

Example usage:

https://ilapi.graphite.io/example-client/example-section/related-links?url=https%3A%2F%2Fexample-client.com%2Fexample-section%2Feu-wants-to-see-if-lawmakers-will-block-brexit-before-striking-new-deal-uk-s-johnson

example-client: client_id as input
example-section: page_type_relation as input
example url=: url as input

Response

  • HTTP Status: 200
  • Content-Type: application/json

Response Body Properties

  • message: (String, Non-null) Text that gives details about the response. Only for information.
  • related_links: (List of JSON objects, Non-null) A list of link objects. A link object is a JSON object containing link data, which supports the creation of HTML link elements from a source page to a target page the link object describes.

📘

Properties of link objects

The specific properties of the link object are outlined in Related Link Object Properties

This list can contain random links, which are link objects selected uniformly at random without replacement if the random link completion feature is enabled (default)

Random links are still considered related because they are of the same kind as target pages. The primary purpose of random links is to fulfill the SEO constraint of the API, ensuring that each page receives 'k' incoming links in order to fill gaps whenever the API cannot assign an expected number of related links.

This list could be empty under some conditions.

“Non-null” properties will always be present in the link object and never be null. Other properties may be null or absent in the link object.

curl --request GET 
'https://ilapi.graphite.io/example-client/example-section/related-links?url=https%3A%2F%2Fexample-client.com%2Fexample-section%2Feu-wants-to-see-if-lawmakers-will-block-brexit-before-striking-new-deal-uk-s-johnson'
{
  "message": "Related links found",
  "related_links": [
    {
      "type": "related",
      "author": "Reuters Editorial",
      "image_url": "https://s3.reutersmedia.net/resources/r/?m=02&d=20190903&t=2&i=1425818863&w=1200&r=LYNXNPEF821I6",
      "published_time": "2019-09-03T16:27:26Z",
      "description": "Prime Minister Boris Johnson...",
      "title": "UK public must decide next steps if parliament votes against Johnson: PM's spokesman",
      "url": "https://example-client.com/example-section/uk-public-must-decide-next-steps-if-parliament-votes-against-johnson-pm-s-spokesman",
      "url_path": "/example-section/uk-public-must-decide-next-steps-if-parliament-votes-against-johnson-pm-s-spokesman"
    },
    ...
    {
      "type": "related",
      "author": "Reuters Editorial",
      "image_url": "https://s2.reutersmedia.net/resources/r/?m=02&d=20190903&t=2&i=1425826613&w=1200&r=LYNXNPEF821KH",
      "published_time": "2019-09-03T16:53:48Z",
      "description": "Earnings and revenue expectations for European...",
      "title": "European third quarter profit outlook improves slightly but still in recession: Refinitv",
      "url": "https://example-client.com/example-section/european-third-quarter-profit-outlook-improves-slightly-but-still-in-recession-refinitv",
      "url_path": "/example-section/european-third-quarter-profit-outlook-improves-slightly-but-still-in-recession-refinitv"
    }
  ]
}

Related Link Object Properties

The standard data properties for a related link object represent the target webpages recommended by the API for linking from a specific source webpage.

  • type: (String, Non-null) Labels a link object as “related” or “random”. It could help filter out link objects when making HTML linking elements.
  • author: (List of strings) A list of author names. It is commonly found on article pages.
  • category: (List of strings) A list of category names
  • description: (String) A webpage’s description
  • image_url: (String) A webpage’s main image URL
  • published_time: (String) A webpage’s published time. It is commonly found on article pages.
  • read_time: (String) A webpage’s read time. It is commonly found on article pages.
  • title: (String, Non-null) A webpage’s title. Usually, the first <h1> element with non-empty text.
  • url: (String, Non-null) A webpage’s canonical URL and actual link. It's usually derived from the canonical link element found in the HTML <head> section, as shown: <link rel="canonical" href="http://example.com/product.html" />. If the canonical link is not specified, the final URL detected by the crawler, following any redirects, is utilized.
  • url_path: (String, Non-null) Page canonical URL path
  • url_qs_params: (String) Page canonical URL query string parameters

📘

Data extraction

Please refer to the Notes on Data Extraction for detailed information on how we extract data for these link data properties

❗️

Missing or null properties

If a property is missing or null from the endpoint response, it indicates that the crawling process didn't find any data to store

Other response statuses

This section provides information on the circumstances that lead to various HTTP statuses

HTTP 204:
  • The API endpoint exists, but data still needs to be gathered. The response does not have a body.

Error responses

This section provides information on the circumstances that lead to various HTTP error statuses

Usually, responses are delivered in the “application/json” format and include details regarding any errors that may have transpired

HTTP 400
  • Input parameters not found
  • Input parameters validation error
HTTP 404
  • API endpoint not found
HTTP 500
  • Internal server error

Informational API endpoints

Internal Links API endpoints lists

This endpoint delivers a comprehensive list of endpoints that a specific Graphite client or API user can access.

Request

  • Endpoint Path: /<client_id>/endpoints
  • HTTP Method: GET
  • HTTP Authentication: None
  • Required Headers: None

Path parameters

  • client_id": (Required) A Graphite’s client (or API user) ID
    Provided by Graphite

Response

  • HTTP Status: 200
  • Content-Type: application/json

Response body properties

  • message: (String, Non-null) Text that gives details about the response.
  • client_id: (String, Non-null) API user ID that matches the client_id path parameter.
  • endpoints: (List of JSON objects, non-null) A list of endpoint information objects. An endpoint information object is a JSON object that holds information about an Internal Links API endpoint, including its path and basic configuration details.

📘

Endpoint information object properties

The specific properties of the endpoint object are outlined in Endpoint Information Object Properties

“Non-null” properties will always be present in the link object and never be null. Other properties may be null or absent in the endpoint information object.

curl --request GET 'https://ilapi.graphite.io/example-client/endpoints'
{
  "message": "Success",
  "client_id": "example-client",
  "endpoints": [
    {
      "endpoint_path": "/example-client/example-section-2/related-links",
      "is_active": true,
      "endpoint_type": "related_links",
      "source_set_id": "example-client-example-section-2",
      "target_set_id": "example-client-example-section-2",
      "links_count": 4,
      "random_links_completion": true
    },
    {
      "endpoint_path": "/example-client/example-section/related-links",
      "is_active": true,
      "endpoint_type": "related_links",
      "source_set_id": "example-client-example-section",
      "target_set_id": "example-client-example-section",
      "links_count": 4,
      "random_links_completion": true
    }
  ]
}

Endpoint information object properties

  • endpoint_path: (String, Non-null) The endpoint route
  • is_active: (Boolean, Non-null) true if the endpoint is active; otherwise, false
  • endpoint_type: (String, Non-null) Endpoint type. Possible values are: “related_links” for a Related Links Endpoint.
  • source_set_id: (String, Non-null) ID of the source set of pages. This ID is an internal value but can be helpful to check the type of links provided by the endpoint.
  • target_set_id: (String) ID of the target set of pages. This ID is an internal value but can be helpful to check the type of links provided by the endpoint.
    • It will always be present in the object for the “related_links” endpoint type
  • links_count: (Number) Default number of links returned in the endpoint response
    • It will always be present in the object for the “related_links” endpoint type
    • It is a final value for the “related_links” endpoint type, as related links cannot be computed on the fly
  • random_links_completion: (Boolean) true if random links completion is enabled; otherwise, false
    • It will always be present in the object for the “related_links” endpoint type. If set to true, the API will randomly pick links to fill up the resultant list if there aren't enough related links (links_count).

Error responses

This section provides information on the circumstances that lead to various HTTP error statuses.

Usually, responses are delivered in the “application/json” format and include details regarding any errors that may have transpired.

HTTP 404
  • Unable to locate endpoint settings as the client_id provided could not be found
HTTP 500
  • Internal server error

Crawling

To index pages for related links selection, Graphite’s bot crawls pages from a source of URLs.

Optimal URL sources are:

  • XML sitemap index (preferred)
    • A sitemap URL pattern could also be specified to avoid crawling all sub-sitemaps in massive websites
  • XML sitemap
  • HTML sitemap
  • robots.txt file with sitemaps

We run crawling at most daily starting at 00:00 UTC with a request rate of 60-240 pages per minute

📘

Crawl settings

For more information about our crawler including speed, IP and user agent please visit our Crawl settings documentation

Graphite's crawling bot

These are the currently used Graphite bot user agents:

We recommend allowing GraphiteBot to crawl using the user agent. If you are worried about “User-Agent” spoofing, Graphite can provide static IP addresses for the bot's connections. These can be utilized to grant access authorization.

Indexing

This applies to pages that are part of the client's pre-determined source or target sets for available endpoints. The pre-determined source or target sets are most commonly identified by URL patterns (i.e., example.com/blog/{slug}). More complex patterns or multiple patterns are also supported.

Page Status Codes

200 HTTP status

  • The GraphiteBot crawler will index all pages from URL sources that successfully return a 200 HTTP status that are part of the pre-determined source or target sets for available endpoints

301 and 302 HTTP status

  • The GraphiteBot crawler will follow 301 and 302 HTTP statuses by default
    • The link URL the crawler indexes will be the canonical tag or the last seen URL if there are redirections
    • The redirected URL needs to be part of the same URL pattern identified for source and target sets for available endpoints or it will be excluded
      • The URLs in between redirects can have different patterns, but the first and last URL to crawl needs to follow the identified pattern from the set

4xx HTTP status

  • By default the GraphiteBot will not include any pages found with 400 HTTP status codes

Preventing page indexing

API users have several options to prevent page indexing:

  • Provide URL sources that omit the pages the user wishes to exclude from indexing
    • This can be done by providing a dedicated sitemap for the API that only includes pages that should be included in the index
  • Use the HTML robots meta tag to mark a page as “noindex”. This means that any pages blocked from indexing by search engines will not be indexed by GraphiteBot either. For example, use <meta name="robots" content="noindex">
  • To specifically prevent GraphiteBot from indexing a page, use the HTML robots meta tag. For instance, the tag <meta name="graphitebot" content="noindex"> or the header X-Robots-Tag: graphitebot: noindex can be used.

Notes on data extraction

GraphiteBot uses a hierarchy of data sources to extract information from HTML documents:

  • It looks at the page's first <h1> element for the webpage title
  • Then looks at schema.org's JSON structured data, particularly the CreativeWork type, to gather various details like title, description, authors, categories, images, published time, and text content
  • If needed, it also uses the Open Graph protocol to gather details like the title, description, authors, images, and published time
  • Finally, standard HTML metadata may be used to gather details from following elements: title tag, meta author, and meta description

The order of precedence ensures that the most reliable and accurate data sources are prioritized in the extraction. If a piece of information isn't available from a high-precedence source, the system will then look to the next highest precedence source. This method helps maintain data quality and reliability.

Text context

To provide accurate content to the API algorithms, text extraction is crucial. This can be achieved using schema.org's objects via the "text" property. If the text is not available through JSON structured data, HTML blocks within the <body> element marked with the itemprop="text" attribute can be used. If neither of these sources provide text, a heuristics-based extraction from the <body> will be performed as a last resort.

Recommendations

Generally, to build HTML linking elements, an API user would only need the target page URLs and will make use of its database to get the required information for building navigable links. However, when relying on the API’s extracted data, these recommendations may be helpful:

  • Use a Canonical Link: Always ensure that the canonical link you use is unique. This helps in avoiding duplicate content issues and improves SEO.
  • Use One <h1> Element: It's a good SEO practice to use one <h1> element to mark up the page's title. This helps search engines understand the content of the page better.
  • Use Structured Data: Make use of structured data from either schema.org or Open Graph, or even both. This helps in providing more detailed information about the page content to search engines and our API.
  • Add an ID to Main Content: Adding an ID to the main content of the page can help in better navigation and accessibility.
  • Add IDs to Relevant HTML Elements: Consider adding IDs to relevant HTML elements such as authors, breadcrumbs, categories, published time, images, etc. This can help in better organization and accessibility of the content.

Uptime and latency

Built on robust AWS services, the API has consistently achieved a 99% uptime, with no major disruptions. It also maintains an average response time of 150 milliseconds.

Request rate limit

Although API endpoints are not restricted by rate limits, we recommend maintaining a rate of under 20 requests/second for each endpoint when retrieving data by batches. With this rate, an API user can update links for 10,000 pages in less than 10 minutes.

Graphite can provide guidance on the frequency of batch data retrieval jobs.

Caching

We highly recommend server-side rendering coupled with caching for optimal performance. You can cache the results using the endpoint URL (with query string parameters) serving as the key.

The links are refreshed no more than once a day, thus a 24-hour Time To Live (TTL) is suitable.