Standard API general documentation
Consult this page if you are currently running or considering running Graphite's Standard Internal Links API
API Host
URL: https://ilapi.graphite.io
Internal Links API endpoints
Related links endpoint
This endpoint delivers a list of related links from a source webpage to other target webpages within the same website. The source and target pages can be of the same type (e.g., blog-to-blog links) or different types (e.g., blog-to-product links).
Prerequisites
- Upon the client's or API user's request, Graphite creates a Related links endpoint
- The client specifies the sets of source and target pages for which related links will be built
- For more information about source and target sets, and how to define them on your site, please review our API definitions
- Graphite then performs an initial crawl to gather the data that will be used to calculate related links or be returned through the endpoint response
- Following this, Graphite carries out the initial computation of related links
- The links for pages are chosen to maximize relatedness, ensuring that each page receives 'k' incoming links on average. The semantic similarity of their text determines the relatedness between the pages.
- The endpoints are fully functional once the data is collected and the links are computed
Request
- Endpoint Path:
/<client_id>/<page_type_relation>/related-links
- HTTP Method:
GET
- HTTP Authentication: None
- Required Headers: None
- CORS: Yes, accessible from all origins without the need for authentication or headers
Input Parameters: Query Strings
Path parameters
client_id
: (Required) A Graphite’s client (or API user) ID
Provided by Graphitepage_type_relation
: (Required) A label that describes the relationship between the source and target pages, for example, “blog
” for same-type pages or “blog-to-product
” for different types
Provided by Graphite
Informational API Endpoints
Check the Informational API Endpoints section to learn about available Internal Links API endpoints and their details for a certain
client_id
Query string parameters
url
url
(Required) A canonical or unique webpage URL
The API IDs for each webpage are derived from unique URLs, which lead to an ID-based search on the API index.
Valid URLs:
- URLs with a scheme, e.g.,
https://example.com/example/path.html
- URLs without a scheme, e.g.,
example.com/example/path.html
- A URL path (technically not a URL but valid for the API request). It must start with a slash (
/
), which indicates the root folder, e.g.,/example/path.html
- Encoded URLs, e.g.,
https%3A%2F%2Fexample.com%2Fexample%2Fpath.html
- URLs with query string parameters must be encoded, e.g.,
https%3A%2F%2Fexample.com%2Fexample%2Fpath.html%3Fq%3Dv
The order of parameters in a URL doesn't change the data it leads to; however, different sets of parameters pointing to the same data are treated as unique webpages. Therefore, URLs with query string parameters should be standardized.
Example usage:
example-client: client_id
as input
example-section: page_type_relation
as input
example url=: url
as input
Response
- HTTP Status:
200
- Content-Type:
application/json
Response Body Properties
message
: (String, Non-null) Text that gives details about the response. Only for information.related_links
: (List of JSON objects, Non-null) A list of link objects. A link object is a JSON object containing link data, which supports the creation of HTML link elements from a source page to a target page the link object describes.
Properties of link objects
The specific properties of the link object are outlined in Related Link Object Properties
This list can contain random links, which are link objects selected uniformly at random without replacement if the random link completion feature is enabled (default)
Random links are still considered related because they are of the same kind as target pages. The primary purpose of random links is to fulfill the SEO constraint of the API, ensuring that each page receives 'k' incoming links in order to fill gaps whenever the API cannot assign an expected number of related links.
This list could be empty under some conditions.
“Non-null” properties will always be present in the link object and never be null. Other properties may be null or absent in the link object.
curl --request GET
'https://ilapi.graphite.io/example-client/example-section/related-links?url=https%3A%2F%2Fexample-client.com%2Fexample-section%2Feu-wants-to-see-if-lawmakers-will-block-brexit-before-striking-new-deal-uk-s-johnson'
{
"message": "Related links found",
"related_links": [
{
"type": "related",
"author": "Reuters Editorial",
"image_url": "https://s3.reutersmedia.net/resources/r/?m=02&d=20190903&t=2&i=1425818863&w=1200&r=LYNXNPEF821I6",
"published_time": "2019-09-03T16:27:26Z",
"description": "Prime Minister Boris Johnson...",
"title": "UK public must decide next steps if parliament votes against Johnson: PM's spokesman",
"url": "https://example-client.com/example-section/uk-public-must-decide-next-steps-if-parliament-votes-against-johnson-pm-s-spokesman",
"url_path": "/example-section/uk-public-must-decide-next-steps-if-parliament-votes-against-johnson-pm-s-spokesman"
},
...
{
"type": "related",
"author": "Reuters Editorial",
"image_url": "https://s2.reutersmedia.net/resources/r/?m=02&d=20190903&t=2&i=1425826613&w=1200&r=LYNXNPEF821KH",
"published_time": "2019-09-03T16:53:48Z",
"description": "Earnings and revenue expectations for European...",
"title": "European third quarter profit outlook improves slightly but still in recession: Refinitv",
"url": "https://example-client.com/example-section/european-third-quarter-profit-outlook-improves-slightly-but-still-in-recession-refinitv",
"url_path": "/example-section/european-third-quarter-profit-outlook-improves-slightly-but-still-in-recession-refinitv"
}
]
}
Related Link Object Properties
The standard data properties for a related link object represent the target webpages recommended by the API for linking from a specific source webpage.
type
: (String, Non-null) Labels a link object as “related” or “random”. It could help filter out link objects when making HTML linking elements.author
: (List of strings) A list of author names. It is commonly found on article pages.category
: (List of strings) A list of category namesdescription
: (String) A webpage’s descriptionimage_url
: (String) A webpage’s main image URLpublished_time
: (String) A webpage’s published time. It is commonly found on article pages.read_time
: (String) A webpage’s read time. It is commonly found on article pages.title
: (String, Non-null) A webpage’s title. Usually, the first<h1>
element with non-empty text.url
: (String, Non-null) A webpage’s canonical URL and actual link. It's usually derived from the canonical link element found in the HTML<head>
section, as shown:<link rel="canonical" href="http://example.com/product.html" />
. If the canonical link is not specified, the final URL detected by the crawler, following any redirects, is utilized.url_path
: (String, Non-null) Page canonical URL pathurl_qs_params
: (String) Page canonical URL query string parameters
Data extraction
Please refer to the Notes on Data Extraction for detailed information on how we extract data for these link data properties
Missing or null properties
If a property is missing or null from the endpoint response, it indicates that the crawling process didn't find any data to store
Other response statuses
This section provides information on the circumstances that lead to various HTTP statuses
HTTP 204:
- The API endpoint exists, but data still needs to be gathered. The response does not have a body.
Error responses
This section provides information on the circumstances that lead to various HTTP error statuses
Usually, responses are delivered in the “application/json
” format and include details regarding any errors that may have transpired
HTTP 400
- Input parameters not found
- Input parameters validation error
HTTP 404
- API endpoint not found
HTTP 500
- Internal server error
Informational API endpoints
Internal Links API endpoints lists
This endpoint delivers a comprehensive list of endpoints that a specific Graphite client or API user can access.
Request
- Endpoint Path:
/<client_id>/endpoints
- HTTP Method:
GET
- HTTP Authentication: None
- Required Headers: None
Path parameters
client_id
": (Required) A Graphite’s client (or API user) ID
Provided by Graphite
Response
- HTTP Status:
200
- Content-Type:
application/json
Response body properties
message
: (String, Non-null) Text that gives details about the response.client_id
: (String, Non-null) API user ID that matches the client_id path parameter.endpoints
: (List of JSON objects, non-null) A list of endpoint information objects. An endpoint information object is a JSON object that holds information about an Internal Links API endpoint, including its path and basic configuration details.
Endpoint information object properties
The specific properties of the endpoint object are outlined in Endpoint Information Object Properties
“Non-null” properties will always be present in the link object and never be null. Other properties may be null or absent in the endpoint information object.
curl --request GET 'https://ilapi.graphite.io/example-client/endpoints'
{
"message": "Success",
"client_id": "example-client",
"endpoints": [
{
"endpoint_path": "/example-client/example-section-2/related-links",
"is_active": true,
"endpoint_type": "related_links",
"source_set_id": "example-client-example-section-2",
"target_set_id": "example-client-example-section-2",
"links_count": 4,
"random_links_completion": true
},
{
"endpoint_path": "/example-client/example-section/related-links",
"is_active": true,
"endpoint_type": "related_links",
"source_set_id": "example-client-example-section",
"target_set_id": "example-client-example-section",
"links_count": 4,
"random_links_completion": true
}
]
}
Endpoint information object properties
endpoint_path
: (String, Non-null) The endpoint routeis_active
: (Boolean, Non-null)true
if the endpoint is active; otherwise,false
endpoint_type
: (String, Non-null) Endpoint type. Possible values are: “related_links
” for a Related Links Endpoint.source_set_id
: (String, Non-null) ID of the source set of pages. This ID is an internal value but can be helpful to check the type of links provided by the endpoint.target_set_id
: (String) ID of the target set of pages. This ID is an internal value but can be helpful to check the type of links provided by the endpoint.- It will always be present in the object for the “
related_links
” endpoint type
- It will always be present in the object for the “
links_count
: (Number) Default number of links returned in the endpoint response- It will always be present in the object for the “
related_links
” endpoint type - It is a final value for the “
related_links
” endpoint type, as related links cannot be computed on the fly
- It will always be present in the object for the “
random_links_completion
: (Boolean)true
if random links completion is enabled; otherwise,false
- It will always be present in the object for the “
related_links
” endpoint type. If set to true, the API will randomly pick links to fill up the resultant list if there aren't enough related links (links_count).
- It will always be present in the object for the “
Error responses
This section provides information on the circumstances that lead to various HTTP error statuses.
Usually, responses are delivered in the “application/json
” format and include details regarding any errors that may have transpired.
HTTP 404
- Unable to locate endpoint settings as the
client_id
provided could not be found
HTTP 500
- Internal server error
Crawling
To index pages for related links selection, Graphite’s bot crawls pages from a source of URLs.
Optimal URL sources are:
- XML sitemap index (preferred)
- A sitemap URL pattern could also be specified to avoid crawling all sub-sitemaps in massive websites
- XML sitemap
- HTML sitemap
- robots.txt file with sitemaps
We run crawling at most daily starting at 00:00 UTC with a request rate of 60-240 pages per minute
Crawl settings
For more information about our crawler including speed, IP and user agent please visit our Crawl settings documentation
Graphite's crawling bot
These are the currently used Graphite bot user agents:
- GraphiteBot
- Base bot identifier.
- GraphiteBot/1.0 (+https://www.graphitehq.com)
- Identifier with version number and comment.
We recommend allowing GraphiteBot to crawl using the user agent. If you are worried about “User-Agent” spoofing, Graphite can provide static IP addresses for the bot's connections. These can be utilized to grant access authorization.
Indexing
This applies to pages that are part of the client's pre-determined source or target sets for available endpoints. The pre-determined source or target sets are most commonly identified by URL patterns (i.e., example.com/blog/{slug}
). More complex patterns or multiple patterns are also supported.
Page Status Codes
200 HTTP status
- The GraphiteBot crawler will index all pages from URL sources that successfully return a 200 HTTP status that are part of the pre-determined source or target sets for available endpoints
301 and 302 HTTP status
- The GraphiteBot crawler will follow 301 and 302 HTTP statuses by default
- The link URL the crawler indexes will be the canonical tag or the last seen URL if there are redirections
- The redirected URL needs to be part of the same URL pattern identified for source and target sets for available endpoints or it will be excluded
- The URLs in between redirects can have different patterns, but the first and last URL to crawl needs to follow the identified pattern from the set
4xx HTTP status
- By default the GraphiteBot will not include any pages found with 400 HTTP status codes
Preventing page indexing
API users have several options to prevent page indexing:
- Provide URL sources that omit the pages the user wishes to exclude from indexing
- This can be done by providing a dedicated sitemap for the API that only includes pages that should be included in the index
- Use the HTML robots meta tag to mark a page as “
noindex
”. This means that any pages blocked from indexing by search engines will not be indexed by GraphiteBot either. For example, use<meta name="robots" content="noindex">
- Other tags like
<meta name="googlebot" content="noindex">
are also supported. - The
X-Robots-Tag
HTTP header is another supported method, as detailed in https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag.
- Other tags like
- To specifically prevent GraphiteBot from indexing a page, use the HTML
robots
meta tag. For instance, the tag<meta name="graphitebot" content="noindex">
or the headerX-Robots-Tag: graphitebot: noindex
can be used.
Notes on data extraction
GraphiteBot uses a hierarchy of data sources to extract information from HTML documents:
- It looks at the page's first
<h1>
element for the webpage title - Then looks at schema.org's JSON structured data, particularly the
CreativeWork
type, to gather various details like title, description, authors, categories, images, published time, and text content - If needed, it also uses the Open Graph protocol to gather details like the title, description, authors, images, and published time
- Finally, standard HTML metadata may be used to gather details from following elements: title tag, meta author, and meta description
The order of precedence ensures that the most reliable and accurate data sources are prioritized in the extraction. If a piece of information isn't available from a high-precedence source, the system will then look to the next highest precedence source. This method helps maintain data quality and reliability.
Text context
To provide accurate content to the API algorithms, text extraction is crucial. This can be achieved using schema.org's objects via the "text
" property. If the text is not available through JSON structured data, HTML blocks within the <body>
element marked with the itemprop="text"
attribute can be used. If neither of these sources provide text, a heuristics-based extraction from the <body>
will be performed as a last resort.
Recommendations
Generally, to build HTML linking elements, an API user would only need the target page URLs and will make use of its database to get the required information for building navigable links. However, when relying on the API’s extracted data, these recommendations may be helpful:
- Use a Canonical Link: Always ensure that the canonical link you use is unique. This helps in avoiding duplicate content issues and improves SEO.
- Use One
<h1>
Element: It's a good SEO practice to use one<h1>
element to mark up the page's title. This helps search engines understand the content of the page better. - Use Structured Data: Make use of structured data from either schema.org or Open Graph, or even both. This helps in providing more detailed information about the page content to search engines and our API.
- Add an ID to Main Content: Adding an ID to the main content of the page can help in better navigation and accessibility.
- Add IDs to Relevant HTML Elements: Consider adding IDs to relevant HTML elements such as authors, breadcrumbs, categories, published time, images, etc. This can help in better organization and accessibility of the content.
Uptime and latency
Built on robust AWS services, the API has consistently achieved a 99% uptime, with no major disruptions. It also maintains an average response time of 150 milliseconds.
Request rate limit
Although API endpoints are not restricted by rate limits, we recommend maintaining a rate of under 20 requests/second for each endpoint when retrieving data by batches. With this rate, an API user can update links for 10,000 pages in less than 10 minutes.
Graphite can provide guidance on the frequency of batch data retrieval jobs.
Caching
We highly recommend server-side rendering coupled with caching for optimal performance. You can cache the results using the endpoint URL (with query string parameters) serving as the key.
The links are refreshed no more than once a day, thus a 24-hour Time To Live (TTL) is suitable.
Updated 4 months ago