API Reference
The Oncrawl REST API is used for accessing your crawl data as well as managing your projects and your crawls.
In order to use this API you need to have an Oncrawl account, an active subscription and an access token.
The current version of the web API is known as V2
, although we don’t expect it to change too much it is still considered under development.
We try to keep breaking change as little as possible but this is not 100% guaranteed.
Requests
All API requests should be made to the /api/v2
prefix, and will return JSON as the response.
HTTP Verbs
When applicable, API tries to use the appropriate HTTP verb for each action:
Verb | Description |
---|---|
GET |
Used for retrieving resources. |
POST |
Used for creating resources. |
PUT |
Used for updating resources. |
DELETE |
Used for deleting resources. |
Parameters and Data
curl "https://app.oncrawl.com/api/v2/projects" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"project": {
"name": "Project name",
"start_url": "https://www.oncrawl.com"
}
}
EOF
import requests
requests.post("https://app.oncrawl.com/api/v2/projects", json={
"project": {
"name": "Project name",
"start_url": "https://www.oncrawl.com"
}
})
Any parameters not included in the URL should be encoded as JSON with a Content-Type of application/json
.
Additional parameters are sometimes specified via the querystring, even for POST
, PUT
and DELETE
requests.
When a complex object is required to be passed via the querystring, the rison encoding format is used.
Errors
Format an en error message
{
"type": "error_type",
"code": "error_code",
"message": "Error message",
"fields": [{
"name": "parameter_name",
"type": "field_error_type",
"message": "Error message"
}]
}
When an error occurs, the API returns a JSON object with the following properties:
Property | Description |
---|---|
type |
An attribute that groups errors based on their nature. |
code optional |
A more specific attribute to let you handle specific errors. |
message optional |
A human readable message describing the error. |
fields optional |
List of field’s related errors. |
Quota error message
{
"type": "quota_error",
"message": "Not enough quota"
}
Forbidden error message
{
"type": "forbidden",
"code": "no_active_subscription"
}
Fields related errors
{
"type": "invalid_request_parameters",
"fields": [{
"name": "start_url",
"type": "required",
"message": "The start URL is required."
}]
}
Permissions errors
The following errors occurs if you are not allowed to perform a request.
Type | Description |
---|---|
unauthorized |
Returned when the request is not authenticated. HTTP Code: 401 |
forbidden |
Returned when the request is authenticated but the action is not allowed. HTTP Code: 403 |
quota_error |
Returned the current quota does not allow the action to be performed. HTTP Code: 403 |
The forbidden
error is usually accompanied with a code
key:
unauthorized
if the action is not authorized for the authenticated user, for example if you are not allowed to modify a resource.feature_not_available
if the current subscription does not allow the usage of a feature.no_active_subscription
if there are no active subscription.
Validations errors
The following errors are caused due to invalid request. In most cases it means the request won’t be able to complete unless the parameters are changed.
Type | Description |
---|---|
invalid_request |
Returned when the request has incompatible values or does not match the API specification. HTTP Code: 400 |
invalid_request_parameters |
Returned when the value does not meet the required specification for the parameter. HTTP Code: 400 |
resource_not_found |
Returned when any of resource(s) referred in the request is not found. HTTP Code: 404 |
duplicate_entry |
Returned when the request provides a duplicate value for an attribute that is specified as unique. HTTP Code: 400 |
Operation failure errors
There errors are returned when the request was valid but the requested operation could not be completed.
Type | Description |
---|---|
invalid_state_for_request |
Returned when the requested operation is not allowed for current state of the resource. HTTP Code: 409 |
internal_error |
Returned when the request couldn’t be completed due to a bug in Oncrawl side. HTTP Code: 500 |
Authentication
To authorize, use this code:
# With shell, you can just pass the correct header with each request
curl "https://app.oncrawl.com/api/v2/projects" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
response = requests.get("https://app.oncrawl.com/api/v2/projects",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)
Make sure to replace
{ACCESS_TOKEN}
with your own access token.
Oncrawl uses access tokens to allow access to the API. You can create tokens from your settings panel if your subscription allows it.
Oncrawl expects the access token to be included in all API requests to the server.
An access token may be created with various scopes:
Scope | Description |
---|---|
account:read |
Give read access to all account’s related data. Examples: Profile, invoices, subscription. |
account:write |
Give write access to all account’s related data. Examples: Close account, update billing information. |
projects:read |
Give read access to all project’s and crawl’s data. Examples: View crawl reports, export data. |
projects:write |
Give write access to all project’s and crawl’s data. Examples: Launch crawl, create project. |
Oncrawl Query Language
Oncrawl provides a JSON-style language that you can use to execute queries.
This is referred to as the OQL
for OnCrawl Query Language.
An OQL query has a tree-like structure composed of nodes
.
A node
can be terminal and is referred to as a leaf
, or be a compound
of other nodes
.
An OQL query must start with a single root
node.
Leaf nodes
Example of OQL using a
field
node:
{
"field": [ "field_name", "filter_type", "filter_value" ]
}
Node | Description |
---|---|
field |
Apply a filter on a field . |
The value of a field
node is an array with 3 values:
- The field name to apply the filter to.
- The type of filter to apply.
- The value of the filter.
Compound nodes
Example OQL using an
and
node:
{
"and": [ {
"field": [ "field_name", "filter_type", "filter_value" ]
}, {
"field": [ "field_name", "filter_type", "filter_value" ]
}]
}
Node | Description |
---|---|
and |
Execute a list of nodes using the logical operator AND . |
or |
Execute a list of nodes using the logical operator OR . |
Common filters
OQL to retrieve pages found in the structure:
{
"field": [ "depth", "has_value", "" ]
}
Filter type | Description |
---|---|
has_no_value |
The field must have no value. |
has_value |
The field must have any value. |
String filters
OQL to retrieve pages with “cars” in title
{
"field": [ "title", "contains", "cars" ]
}
Filter type | Description |
---|---|
contains |
The field’s value must contains the filter value. |
endswith |
The field’s value must ends with the filter value. |
startswith |
The field’s value must starts with the filter value. |
equals |
The field’s value must be strictly equals to the filter value. |
Numeric filters
OQL to retrieve pages with less than 10 inlinks:
{
"field": [ "follow_inlinks", "lt", "10" ]
}
OQL to retrieve pages between depth 1 and 4
{
"field": [ "depth", "between", [ "1", "4" ]]
}
Filter type | Description |
---|---|
gt |
The field’s value must be greater than the filter value. |
gte |
The field’s value must be greater or equal than the filter value. |
lt |
The field’s value must be lesser than the filter value. |
lte |
The field’s value must be lesser or equal than the filter value. |
between |
The field’s value must be between both filter values (lower inclusive, upper exclusive). |
Filters options
OQL to retrieve urls within /blog/{year}/:
{
"field": [ "urlpath", "startswith", "/blog/([0-9]+)/", { "regex": true } ]
}
The filters equals
, contains
, startswith
and endswith
can take options as the fourth parameter of the field
node as a JSON object.
Property | Description |
---|---|
ci boolean |
true is the match should be case insensitive. |
regex boolean |
true if the filter value is a regex. |
Pagination
The majority of endpoints returning resources such as projects and crawls are paginated.
HTTP request
Example of paginated query
curl "https://app.oncrawl.com/api/v2/projects?offset=50&limit=100&sort=name:desc" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
response = requests.get(
"https://app.oncrawl.com/api/v2/projects?offset={offset}&limit={limit}&sort={sort}"
.format(
offset=50,
limit=100,
sort='name:desc'
),
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()
The HTTP query expect the following parameters:
Parameter | Description |
---|---|
offset optional |
The offset for matching items. Defaults to 0 . |
limit optional |
The maximum number of matching items to return. Defaults to 10 . |
sort optional |
How to sort matching items, order can be asc or desc .Natural ordering is from most recent to least recent. |
filters optional |
The OQL filters used for the query. Defaults to null . |
Because filters
is a JSON objects that need to be passed in the querystring, the rison encoding format is used.
The sort
parameter is expected to be the following format {name}:{order}
where:
{name}
is the field’s name to sort on{order}
is the sort order, eitherasc
ordesc
.
HTTP response
Example of paginated response
{
"meta": {
"offset": 0,
"limit": 10,
"total": 100,
"filters": "<OQL>",
"sort": [
[ "name", "desc" ]
]
},
"projects": [ "..." ]
}
The HTTP response always follow the same pattern:
- a key with the list of resources where the name depends on the paginated resource.
meta
key containing info allowing the pagination.
The meta
key returns a JSON object that allows you to easily paginate through the resources:
Property | Description |
---|---|
offset |
The offset used for the query. Defaults to 0 . |
limit |
The limit used for the query. Defaults to 10 . |
total |
The total number of matching items. |
sort |
The sort used for the query. Defaults to null . |
filters |
The OQL filters used for the query. Defaults to {} . |
Data API
The Data API allows you to explore, aggregate and export your data.
There are 3 main sources:
- Crawl Report
- Crawl over Crawl
- Log monitoring
Each sources can have one or several data types
behind them.
Data types
For Crawl Reports:
curl "https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/<data_type>" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
response = requests.get("https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/<data_type>",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()
For Crawl over Crawl:
curl "https://app.oncrawl.com/api/v2/data/crawl_over_crawl/<coc_id>/<data_type>" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
response = requests.get("https://app.oncrawl.com/api/v2/data/crawl_over_crawl/<coc_id>/<data_type>",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()
For Log Monitoring (events):
curl "https://app.oncrawl.com/api/v2/data/project/<project_id>/log_monitoring/<data_type>" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
response = requests.get("https://app.oncrawl.com/api/v2/data/project/<project_id>/log_monitoring/<data_type>",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()
For Log Monitoring (pages):
curl "https://app.oncrawl.com/api/v2/data/project/<project_id>/log_monitoring/<data_type>/<granularity>" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
response = requests.get("https://app.oncrawl.com/api/v2/data/project/<project_id>/log_monitoring/<data_type>/<granularity>",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()
A data type
is the nature of the objects you are exploring, each data type
has its own schema and purpose.
Source | Data type | Description |
---|---|---|
Crawl report | pages | Lists of crawled pages of your website. |
Crawl report | links | Lists of all links of your website. |
Crawl report | clusters | Lists of duplicate clusters of your website. |
Crawl report | structured_data | Lists of structured data of your website. |
Crawl over Crawl | pages | List of compared pages. |
Logs monitoring | pages | Lists of all urls. |
Logs monitoring | events | Lists of all events. |
- Pages
- Represents an HTML page of the website.
- Links
- Represents a link between two pages.
Example: an ‘href’ link to another page.
- Clusters
- Represents a cluster of pages that are considered similar.
A cluster has a size and an average similarity ratio.
- Structured data
- Represents a structured data item found on a page.
Supported format are: JSON-LD, RDFa and microdata.
- Events
- Represents a single line of a log file.
Available only in logs monitoring.
Data granularity
A granularity
, only available for pages in log monitoring, defines how the metrics will be aggregated for a page.
- days
- Data will be aggregated by days, a
day
field will be available with the formatYYYY-MM
.
- weeks
- Data will by aggregated by weeks, a
week
field will be available with the formatYYYY-[W]WW
.
A week may start on monday or sunday depending on project’s configuration.
- months
- Data will be aggregated by months, a
month
field will be available with the formatYYYY-MM
.
You can find more information on what is available using /metadata
endpoint.
HTTP Request
Exemple of HTTP request
curl "https://app.oncrawl.com/api/v2/data/project/<project_id>/log_monitoring/<data_type>/metadata" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
fields = requests.get("https://app.oncrawl.com/api/v2/data/project/<project_id>/log_monitoring/<data_type>/metadata",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()
# For Logs Monitoring/span>
GET /api/v2/data/project/<project_id>/log_monitoring/<data_type>/metadata
HTTP Response
Example of HTTP response
{
"bot_kinds": [
"seo",
"vertical"
],
"dates": [
{
"from": "2018-09-21",
"granularity": "days",
"to": "2019-11-18"
},
{
"from": "2018-09-16",
"granularity": "weeks",
"to": "2019-11-16"
},
{
"from": "2018-06-01",
"granularity": "months",
"to": "2019-10-31"
}
],
"search_engines": [
"google"
],
"week_definition": "sunday_start"
}
Property | Description |
---|---|
bot_kinds |
Bot kind can by seo , sea or vertical |
dates |
List of available granularities with their min/max date |
search_engines |
Search engine can by google |
week_definition |
Can be sunday_start or iso |
Data Schema
HTTP Request
Example of field’s request
curl "https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/<data_type>/fields" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
fields = requests.get("https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/<data_type>/fields",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json().get('fields', [])
HTTP Response
Example of HTTP response
{
"fields": [{
"name": "canonical_evaluation",
"type": "enum",
"arity": "one",
"values": [
"matching",
"not_matching",
"not_set"
],
"actions": [
"has_no_value",
"not_equals",
"equals",
"has_value"
],
"agg_dimension": true,
"agg_metric_methods": [
"value_count",
"cardinality"
],
"can_display": true,
"can_filter": true,
"can_sort": false,
"user_select": true,
"category": "HTML Quality"
}, "..."]
}
Property | Description |
---|---|
name |
The name of the field |
type |
The field’s type (natural, float, hash, enum, bool, string, percentage, object, date, datetime, ratio) |
arity |
If the field is multivalued, can be one or many . |
values |
List of possible values for enum type. |
actions |
List of possible filters of this field. |
agg_dimension |
true if can be used as a dimension in aggregate queries. |
agg_metric_methods |
List of available aggregations methods for this field. |
can_display |
true if the field can be retrieved in search or export queries. |
can_filter |
true if the field can be used in filters queries. |
can_sort |
true if the field can be sort on in search or export queries. |
category deprecated |
Do not use. |
user_select deprecated |
Do not use. |
Search Queries
The search queries allows you to explore your data by filtering, sorting and paginating.
HTTP Request
Search for crawled with with 301 or 404 HTTP status code.
curl "https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"offset": 0,
"limit": 10,
"fields": [ "url", "status_code" ],
"sort": [
{ "field": "status_code", "order": "asc" }
],
"oql": {
"and":[
{"field":["fetched","equals",true]},
{"or":[
{"field":["status_code","equals",301]},
{"field":["status_code","equals",404]}
]}
]
}
}
EOF
import requests
response = requests.post("https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' },
json={
"offset": 0,
"limit": 10,
"fields": [ "url", "status_code" ],
"sort": [
{ "field": "status_code", "order": "asc" }
],
"oql": {
"and":[
{"field":["fetched","equals",true]},
{"or":[
{"field":["status_code","equals",301]},
{"field":["status_code","equals",404]}
]}
]}
}
}
).json()
The HTTP request is expected to be a JSON object as its payload with the following properties:
Property | Description |
---|---|
limit optional |
Maximum number of matching result to return. |
offset optional |
An offset for the returned matching results. |
oql optional |
An Oncrawl Query Language object. |
fields optional |
List of fields to retrieve for each matching result. |
sort optional |
Ordering of the returned matching results. |
The sort
parameter is expected to be an array of object with the a field
key and an order
key where:
{name}
is the field’s name to sort on{order}
is the sort order, eitherasc
ordesc
.
HTTP response
{
"meta": {
"columns": [
"url",
"inrank",
"status_code",
"meta_robots",
"fetched"
],
"total_hits": 1,
"total_pages": 1
},
"oql": {
"and": [
{ "field": [ "fetched", "equals", true ] },
{
"or": [
{ "field": [ "status_code", "equals", 301 ] },
{ "field": [ "status_code", "equals", 404 ] }
]
}
]
},
"urls": [
{
"fetched": true,
"inrank": 8,
"meta_robots": null,
"status_code": 301,
"url": "http://www.website.com/redirect/"
}
]
}
The response will be a JSON object with an urls
key, an oql
key and a meta
key.
The urls
key will contains an array of matching results.
The oql
key will contains the Oncrawl Query Language object used for filtering.
The meta
key will contains as keys:
Property | Description |
---|---|
columns |
List of returned fields. They are the keys used in urls objects. |
total_hits |
Total number of matching results. |
total_pages deprecated |
Total number of pages according to limit and total_hits . |
Aggregate Queries
Average load time of crawled pages
curl "https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages/aggs" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"aggs": [{
"oql": {
"field": ["fetched", "equals", "true"]
},
"value": "load_time:avg"
}]
}
EOF
import requests
response = requests.post("https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages/aggs",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' },
json={
"aggs": [{
"oql": {
"field": ["fetched", "equals", "true"]
},
"value": "load_time:avg"
}]
}
).json()
The returned JSON looks like:
{
"aggs": [
{
"cols": [
"load_time:avg"
],
"rows": [
[
183.41091954022988
]
]
}
]
}
HTTP Request
This HTTP endpoint expect a JSON object as its payload with a single aggs
key and an array of aggregate queries as its value.
An aggregate query is an object with the following properties:
Property | Description |
---|---|
oql optional |
An Oncrawl Query Language object to match a set of items. By default it will match all items. |
fields optional |
Specify how to create buckets of matching items. |
value optional |
Specify how to aggregate matching items. By default it will return the number of matching items. |
How to aggregate items
By default an aggregate request will return the count but you can also perform a different aggregation using the field
parameter.
The expected format is <field_name>:<aggregation_type>
.
For example:
inrank:avg
will returns the the average Inrank.weight:sum
will returns the sum of all weights.
But not all fields can be aggregated and not all aggregations are available on all fields.
To know which aggregations are available on a field you can check the agg_metric_methods
value returned by the Data Schema endpoint.
The available methods are:
- min
- Returns the minimal value for this field.
- max
- Returns the maximal value for this field.
- avg
- Returns the average value for this field.
- sum
- Returns the sum of all the values for this field.
- value_count
- Returns how many items have a value for this field.
- cardinality
- Returns the number of different values for this field.
How to create simple buckets
Average inrank by depth
curl "https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages/aggs" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"aggs": [{
"fields": [{
"name": "depth"
}],
"value": "inrank:avg"
}]
}
EOF
import requests
response = requests.post("https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages/aggs",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' },
json={
"aggs": [{
"fields": [{
"name": "depth"
}],
"value": "inrank:avg"
}]
}
).json()
Pages count by range of inlinks
curl "https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages/aggs" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"aggs": [{
"fields": [{
"name": "nb_inlinks_range",
"ranges": [
{
"name": "under_10",
"to": 10
},
{
"name": "10_50",
"from": 10,
"to": 51
},
{
"name": "more_50",
"from": 51
}
]
}]
}]
}
EOF
import requests
response = requests.post("https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages/aggs",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' },
json={
"aggs": [{
"fields": [{
"name": "nb_inlinks_range",
"ranges": [
{
"name": "under_10",
"to": 10
},
{
"name": "10_50",
"from": 10,
"to": 51
},
{
"name": "more_50",
"from": 51
}
]
}]
}]
}
).json()
When performing an aggregation, you can create buckets for your matching items using the fields
parameter which takes an array of JSON objects.
The simplest way is to simply use the field’s name like so: {"name": "field_name"}
.
It will returns the item’s count for all different values of field_name
.
But not all fields can be used to create a bucket.
To know which fields are available as a bucket you can check the agg_dimension
value returned by the Data Schema endpoint.
How to create ranges buckets
If the field_name
returns too many different values it could be useful to group them as ranges.
To do so you can add a ranges
key that takes an array of range. A range is a JSON object with the following expected keys:
Property | Description |
---|---|
name required |
The name that will be returned in the JSON response for this range. |
from optional |
The lowest or equal value for this range. |
to optional |
The highest (not equal) value for this range. |
Only numeric fields can be used with ranges buckets.
Export Queries
Export all pages from the structure.
curl "https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages?export=true" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"fields": ["url"],
"oql": {
"field":["depth","has_value", ""]
}
}
EOF > my_export.csv
import requests
response = requests.post("https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages?export=true",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' },
json={
"fields": ["url"],
"oql": {
"field":["depth","has_value", ""]
}
}
)
An export query allows you to save as a csv file the result of your search query.
It does not suffer from the 10K items limitation and allows you to export all of the matching results.
To export the result of your search query as csv, simply add ?export=true
within the URL.
Property | Description |
---|---|
file_type optional |
Can be csv or json (exported as JSONL), defaults to csv . |
HTTP response
The response of the query will be a streamed csv file.
Projects API
The Projects API allows you manage all your projects and your crawls.
With this API you can, for example:
- Launch a new crawl
- Pilot a crawl’s state (pause, resume, cancel)
- Lists your projects
- Create a new project
- Check the progression of your crawl
Projects
List projects
Get list of projects.
curl "https://app.oncrawl.com/api/v2/projects" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
projects = requests.get("https://app.oncrawl.com/api/v2/projects",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()
HTTP Request
The projects can be paginated and filtered using the parameters described in the pagination section.
The fields available for the sort
and filters
are:
Property | Description |
---|---|
id |
The project ID. |
name |
The project’s name. |
start_url |
The project’s start URL. |
features |
The project’s enabled features. |
HTTP Response
{
"meta":{
"filters":{},
"limit":100,
"offset":0,
"sort":null,
"total":1
},
"projects": [
"<Project Object>",
"<Project Object>"
]
}
A JSON object with a meta
, described by the pagination section and a projects
key with the list of project.
Get a project
Get a project.
curl "https://app.oncrawl.com/api/v2/projects/<project_id>" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
project = requests.get("https://app.oncrawl.com/api/v2/projects/<project_id>",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()
HTTP Response
{
"project": {
"id": "592c1e1cf2c3a42743d14350",
"name": "Oncrawl",
"start_url": "http://www.oncrawl.com/",
"user_id": "54dce0f264b65e1eef3ef61b",
"is_verified_by": "google_analytics",
"domain": "oncrawl.com",
"features": [
"at_internet",
"google_search_console"
],
"last_crawl_created_at": 1522330515000,
"last_crawl_id": "5abceb9303d27a70f93151cb",
"limits": {
"max_custom_dashboard_count": null,
"max_group_count": null,
"max_segmentation_count": null,
"max_speed": 100
},
"log_monitoring_data_ready": true,
"log_monitoring_processing_enabled": true,
"log_monitoring_ready": true,
"crawl_config_ids": [
"5aa80a1303d27a729113bb2d"
],
"crawl_ids": [
"5abceb9303d27a70f93151cb"
],
"crawl_over_crawl_ids": [
"5abcf43203d27a1ecf100b2c"
]
},
"crawl_configs": [
"<CrawlConfig Object>"
],
"crawls": [
"<Crawl Object>"
]
}
The HTTP response is JSON object with three keys:
- A
project
key with the project’s data - A
crawl_configs
key with a list of all the project’s crawl configurations. - A
crawls
key with a list of all project’s crawl.
The project’s properties are:
Property | Description |
---|---|
id |
The project ID. |
name |
The project’s name. |
start_url |
The project’s start URL. |
user_id |
The ID of the project’s owner. |
is_verified_by |
Holds how the project’s ownership was verified. Can be google_analytics , google_search_console , admin or null . |
domain |
The start URL’s domain. |
features |
List of project’s enabled features. |
last_crawl_id |
The ID of the latest created crawl. |
last_crawl_created_at |
UTC timestamp of the latest created crawl, in milliseconds. Defaults to null . |
limits |
An object with customized limits for this project. |
log_monitoring_data_ready |
true if the project’s log monitoring index is ready to be searched. |
log_monitoring_processing_enabled |
true if the project’s files for the log monitoring are automatically processed. |
log_monitoring_ready |
true if the project’s log monitoring configuration was submitted. |
crawl_config_ids |
The list of Crawl over Crawl IDs attached to this project. |
crawl_ids |
The list of Crawl IDs for this project. |
crawl_config_ids |
The list of Crawl configurations IDs for this project. |
Create a project
Create a project.
curl -X POST "https://app.oncrawl.com/api/v2/projects" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"project": {
"name": "Project name",
"start_url": "https://www.oncrawl.com"
}
}
EOF
import requests
requests.post("https://app.oncrawl.com/api/v2/projects", json={
"project": {
"name": "Project name",
"start_url": "https://www.oncrawl.com"
}
},
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)
HTTP request
Property | Description |
---|---|
name required |
The project’s name, must be unique. |
start_url required |
The project’s start url starting by http:// or https:// . |
HTTP Response
Examples of HTTP response
{
"project": "<Project Object>"
}
An HTTP 200
status code is returned with the created project returned directly as the response within a project
key.
Delete a project
Delete a project.
curl -X DELETE "https://app.oncrawl.com/api/v2/projects/<project_id>" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
requests.delete("https://app.oncrawl.com/api/v2/projects/<project_id>",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)
HTTP request
No HTTP parameters.
HTTP Response
Returns an HTTP 204
status code if successful.
Scheduling
The scheduling of crawls allows you to start your crawl at a later date, run it periodically automatically or both.
Schedule your crawls to be run every week or every month and never think about it again.
List scheduled crawls
Get list of scheduled crawls.
curl "https://app.oncrawl.com/api/v2/projects/<project_id>/scheduled_crawls" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
projects = requests.get("https://app.oncrawl.com/api/v2/projects/<project_id>/scheduled_crawls",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()
HTTP Request
The scheduled crawls can be paginated using the parameters described in the pagination section.
There are not sort
or filters
available.
HTTP Response
{
"meta":{
"filters": {},
"limit":50,
"offset":0,
"sort": null,
"total":1
},
"scheduled_crawls": [
{
"config_id":"59f3048cc87b4428618d7c44",
"id":"5abdeb0f03d27a69ef169c52",
"project_id":"592c1e1cf2c3a42743d14350",
"recurrence":"week",
"start_date":1522482300000
}
]
}
A JSON object with a meta
, described by the pagination section and a scheduled_crawls
key with the list of scheduled crawls for this project.
Create a scheduled crawl
HTTP request
Create a scheduled crawl.
curl "https://app.oncrawl.com/api/v2/projects/<project_id>/scheduled_crawls" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"scheduled_crawl": {
"config_id": "59f3048cc87b4428618d7c49",
"recurrence": "week",
"start_date": 1522482300000
}
}
EOF
import requests
requests.post("https://app.oncrawl.com/api/v2/projects", json={
"scheduled_crawl": {
"config_id": "59f3048cc87b4428618d7c49",
"recurrence": "week",
"start_date": 1522482300000
}
},
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)
The request is expected to be a JSON object with a scheduled_crawl
key and the following properties:
Property | Description |
---|---|
config_id required |
The ID of the crawl configuration to schedule. |
recurrence optional |
Can be day , week , 2weeks or month . |
start_date required |
An UTC timestamp in milliseconds for when to start the first crawl. |
HTTP Response
Examples of HTTP response
{
"scheduled_crawl":{
"config_id":"59f3048cc87b4428618d7c29",
"id":"5abdeb0f03d27a69ef169c53",
"project_id":"592c1e1cf2c3a42743d14350",
"recurrence":"week",
"start_date":1522482300000
}
}
An HTTP 200
status code is returned with the created scheduled crawl returned directly as the response within a scheduled_crawl
key.
Delete a scheduled crawl
Delete a scheduled crawl.
curl -X DELETE "https://app.oncrawl.com/api/v2/projects/<project_id>/scheduled_crawls/<scheduled_crawl_id>" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
requests.delete("https://app.oncrawl.com/api/v2/projects/<project_id>/scheduled_crawls/<scheduled_crawl_id>",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)
HTTP request
No HTTP parameters.
HTTP Response
Returns an HTTP 204
status code if successful.
Crawls
Launch a crawl
Launch a crawl.
curl -X POST "https://app.oncrawl.com/api/v2/projects/<project_id>/launch-crawl?configId=<crawl_config_id>" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
requests.post("https://app.oncrawl.com/api/v2/projects/<project_id>/launch-crawl?configId=<crawl_config_id>",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)
HTTP request
You have to pass a configId
parameter in the query string with the ID of the crawl configuration you want to launch.
HTTP Response
Example of HTTP response
{
"crawl": "<Crawl Object>"
}
Returns an HTTP 200
status code if successful with the created crawl returned directly as the response within a crawl
key.
List crawls
Get list of crawls.
curl "https://app.oncrawl.com/api/v2/crawls" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
crawls = requests.get("https://app.oncrawl.com/api/v2/crawls",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()
HTTP Request
The crawls can be paginated and filtered using the parameters described in the pagination section.
The fields available for the sort
and filters
are:
Property | Description |
---|---|
id |
The crawl’s ID. |
user_id |
The crawl’s owner ID. |
project_id |
The crawl’s project ID. |
status |
The crawl’s status. Can be running , done , cancelled , terminating , pausing , paused , archiving unarchiving , archived . |
created_at |
The crawl’s creation date as UTC timestamp in milliseconds. |
HTTP Response
{
"meta":{
"filters":{},
"limit":100,
"offset":0,
"sort":null,
"total":1
},
"crawls": [
"<Crawl Object>",
"<Crawl Object>"
]
}
A JSON object with a meta
, described by the pagination section and a crawls
key with the list of crawl.
Get a crawl
Get a crawl.
curl "https://app.oncrawl.com/api/v2/crawls/<crawl_id>" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
project = requests.get("https://app.oncrawl.com/api/v2/crawls/<crawl_id>",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()
HTTP Response
{
"crawl": {
"id":"5a57819903d27a7faa253683",
"project_id":"592c1e1cf2c3a42743d14341",
"user_id":"54dce0f264b65e1eef3ef61b",
"link_status":"live",
"status":"done",
"created_at":1515684249000,
"ended_at":1515685455000,
"fetched_urls":10,
"end_reason":"max_url_reached",
"features":[
"at_internet"
],
"crawl_config": "<CrawlConfig Object>",
"cross_analysis_access_logs": null,
"cross_analysis_at_internet": {
"dates":{
"from":"2017-11-26",
"to":"2018-01-10"
}
},
"cross_analysis_google_analytics": {
"error": "No quota remaining."
},
"cross_analysis_majestic_back_links": {
"stores":[
{
"name":"www.oncrawl.com",
"success": true,
"sync_date":"2017-10-27"
}
],
"tld":{
"citation_flow":35,
"name":"oncrawl.com",
"trust_flow":29
}
}
}
}
The HTTP response is JSON object with a single crawl
key containing the crawl’s data:
The crawls’s properties are:
Property | Description |
---|---|
id |
The crawl ID. |
project_id |
The crawl’s project ID. |
user_id |
The crawl’s owner ID. |
link_status |
The links index status. Can be live or archived . |
status |
The crawl’s status. Can be running , done , cancelled , terminating , pausing , paused , archiving unarchiving , archived . |
created_at |
Date of the crawl creation using an UTC timestamp in milliseconds. |
ended_at |
Date of the crawl termination using an UTC timestamp in milliseconds. |
fetched_urls |
Number of URLs that were fetched for this crawl. |
last_depth |
At what depth the crawl ended. |
end_reason |
A code describing the reason of why the crawl’s stopped. This value may not be present. |
features |
List of features available by this crawl. |
crawl_config |
The crawl configuration object used for this crawl. |
cross_analysis_access_logs |
Dates used by the Logs monitoring cross analysis.null if no cross analysis were done. |
cross_analysis_at_internet |
Dates used by the AT Internet cross analysis.null if no cross analysis were done. |
cross_analysis_google_analytics |
Dates used by the Google Analytics cross analysis.null if no cross analysis were done. |
cross_analysis_majestic_back_links |
Majestic cross analysis metadata.null if no cross analysis available. |
End reasons
- ok
- All the URL of the structure have been crawled.
- crawl_already_running
- A crawl with the same configuration was already running.
- quota_reached_before_start
- When a scheduled crawl could not run because of missing quota.
- quota_reached
- When the URL quota was reached during the crawl.
- max_url_reached
- When the maximum number of URL defined in the crawl configuration was reached.
- max_depth_reached
- When the maximum depth defined in the crawl configuration was reached.
- user_cancelled
- When the crawl was manually cancelled.
- user_requested
- When the crawl was manually terminated and a partial crawl report was produced.
- no_active_subscription
- When no active subscription were available.
- stopped_progressing
- Technical end reason: at the end of the crawl, there are still unfetched urls, but for some reason the crawler is unable to fetch them. To prevent the crawler from iterating indefinitely, we abort the fetch phase when, after three attempts, he still has not managed to crawl those pages.
- max_iteration_reached
- Technical end reason: crawl evolved abnormally very slowly. It can happen, for example, when the website server is very busy with randomly dropped connections. We abort the fetch phase after 500 iterations when we detect this pathological server behavior.
Get a crawl progress
Get a crawl progress
curl "https://app.oncrawl.com/api/v2/crawls/<crawl_id>/progress" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
project = requests.get("https://app.oncrawl.com/api/v2/crawls/<crawl_id>/progress",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()
You can call this endpoint for running
crawls in order to follow its progression.
It allows you for example to monitoring if the crawler encounters an abnormal number of errors.
HTTP request
This endpoint takes no parameters.
HTTP Response
{
"progress": {
"crawler": {
"by_depth": [
{
"depth": 1,
"fetched_2xx": 1,
"fetched_3xx": 1,
"fetched_4xx": 0,
"fetched_5xx": 0,
"unfetched_exception": 0,
"unfetched_robots_denied": 0
},
{
"depth": 2,
"fetched_2xx": 912,
"fetched_3xx": 117,
"fetched_4xx": 0,
"fetched_5xx": 0,
"unfetched_exception": 0,
"unfetched_robots_denied": 161
}
],
"counts": {
"fetched_2xx": 3,
"fetched_3xx": 118,
"fetched_4xx": 0,
"fetched_5xx": 0,
"queued_urls": 500,
"unfetched_exception": 0,
"unfetched_robots_denied": 161
},
"depth": 6,
"samples": [
{
"error_code": "robots_denied",
"fetch_date": "2022-08-20T06:58:26Z",
"fetch_status": "unfetched_robots_denied",
"url": "https://www.oncrawl.com/error"
},
{
"fetch_date": "2022-08-20T06:58:14Z",
"fetch_duration": 733,
"fetch_status": "fetched_2xx",
"status_code": 200,
"url": "https://www.oncrawl.com/"
}
],
"status": "done"
},
"status": "running",
"steps": [
{
"name": "fetch",
"status": "done"
},
{
"jobs": [
{
"name": "google_search_console",
"status": "done"
}
],
"name": "connectors",
"status": "done"
},
{
"jobs": [
{
"name": "parse",
"status": "running"
},
{
"name": "inlinks",
"status": "done"
},
{
"name": "sitemaps",
"status": "done"
},
{
"name": "outlinks",
"status": "done"
},
{
"name": "redirect",
"status": "done"
},
{
"name": "scoring",
"status": "done"
},
{
"name": "top_ngrams",
"status": "waiting"
},
{
"name": "hreflang",
"status": "waiting"
},
{
"name": "duplicate_description",
"status": "waiting"
},
{
"name": "duplicate_title",
"status": "waiting"
},
{
"name": "duplicate_h1",
"status": "waiting"
},
{
"name": "duplicate_simhash",
"status": "waiting"
},
{
"name": "cluster_similarities",
"status": "waiting"
}
],
"name": "analysis",
"status": "running"
},
{
"name": "cross_analysis",
"status": "waiting"
},
{
"name": "export",
"status": "waiting"
}
]
},
"timestamp": 1661159435000,
"version": 3
}
The HTTP response is JSON object with a progress
key containing the crawl’s progression.
The properties are:
Property | Description |
---|---|
status |
The crawl’s status. |
crawler.counts |
Crawler’s fetch progression. |
crawler.by_depth |
A detailed progression per depth. |
crawler.depth |
The current crawler’s depth. |
crawler.status |
Crawler’s fetch status. |
crawler.samples |
A list of URL’s samples per status. It varies during the crawl and may not have a sample for a status. |
steps |
A detailed progression per step |
Fetch statuses
- fetched_2xx
- Status code between 200 and 299
- fetched_3xx
- Status code between 300 and 399
- fetched_4xx
- Status code between 400 and 499
- fetched_5xx
- Status code between 500 and 599
- unfetched_robots_denied
- URL access denied by robots.txt
- unfetched_exception
- Unable to fetched an URL (ex: a server timeout.)
Update crawl state
HTTP request
Pause a running crawl
curl "https://app.oncrawl.com/api/v2/crawls/<crawl_id>/pilot" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"command": "pause"
}
EOF
import requests
requests.post("https://app.oncrawl.com/api/v2/crawls/<crawl_id>/pilot", json={
"command": "pause"
},
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)
You have to pass a JSON object with a command
key and the desired state.
The crawl’s commands are:
Command | Description |
---|---|
cancel |
Cancel the crawl. It won’t produce a report. Crawl must be running or paused . |
resume |
Resume a paused crawl. Crawl must be paused . |
pause |
Pause a crawl. Crawl must be running . |
terminate |
Terminate a crawl early. It will produce a report. Crawl must be running or paused . |
unarchive |
Un-archive all crawl’s data. Crawl must be archived or links_status must be archived . |
unarchive-fast |
Un-archive crawl’s data except links. Crawl must be archived . |
HTTP Response
Example of HTTP response
{
"crawl": "<Crawl Object>"
}
Returns an HTTP 200
status code if successful with the updated crawl returned directly as the response within a crawl
key.
Delete a crawl
Delete a crawl.
curl -X DELETE "https://app.oncrawl.com/api/v2/crawls/<crawl_id>" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
requests.delete("https://app.oncrawl.com/api/v2/crawls/<crawl_id>",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)
HTTP request
No HTTP parameters.
HTTP Response
Returns an HTTP 204
status code if successful.
Crawls Configurations
List configurations
Get list of crawl configurations.
curl "https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
crawl_configs = requests.get("https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()
HTTP Request
The endpoint does not take any parameter.
HTTP Response
{
"crawl_configs": [
"<CrawlConfig Object>",
"<CrawlConfig Object>"
]
}
A JSON object with a crawl_configs
key with the list of crawl configuration.
Get a configuration
Get a configuration.
curl "https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs/<crawl_config_id>" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
crawl_config = requests.get("https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs/<crawl_config_id>",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()
HTTP Response
{
"crawl_config": {
"agent_kind":"web",
"ajax_crawling":false,
"allow_query_params":true,
"alternate_start_urls": [],
"at_internet_params": {},
"crawl_subdomains":false,
"custom_fields":[],
"dns": [],
"extra_headers": {},
"filter_query_params":false,
"google_analytics_params":{},
"google_search_console_params":{},
"http_auth":{ },
"id":"592c1f53973cb53b75287a79",
"js_rendering":false,
"majestic_params":{},
"max_depth":15,
"max_speed":10,
"max_url":2000000,
"name":"default",
"notifications": {
"email_recipients": [],
"custom_webhooks": []
},
"query_params_list":"",
"resource_checker":false,
"reuse_cookies":false,
"robots_txt":[],
"scheduling_period":null,
"scheduling_start_date":null,
"scheduling_timezone":"Europe/Paris",
"sitemaps":[],
"start_url":"http://www.oncrawl.com/",
"strict_sitemaps":true,
"trigger_coc":false,
"use_cookies":true,
"use_proxy":false,
"user_agent":"Oncrawl",
"user_input_files":[],
"watched_resources":[],
"whitelist_params_mode":true
}
}
The HTTP response is JSON object with the crawl configuration inside a crawl_config
key.
The crawl base properties are:
Property | Description |
---|---|
agent_kind |
The type of user agent. Values are web or mobile . |
ajax_crawling |
true if the website should be crawled as a pre-rendered JavaScript website, false otherwise. |
allow_query_params |
true if the crawler should follow URL with query parameters, false otherwise. |
alternate_start_urls |
List of alternate start URLs. All those URLs will start with a depth of 1. They must all belongs to the same domain. |
at_internet_params |
Configuration for AT Internet cross analysis. The AT Internet cross analysis feature is required. |
crawl_subdomains |
true if the crawler should follow links of all the subdomains.Example: http://blog.domain.com for http://www.domain.com. |
custom_fields |
Configuration for custom fields scraping. The Data Scraping feature is required. |
dns |
Override the crawler’s default DNS. |
extra_headers |
Defines additional headers for the HTTP requests done by the crawler. |
filter_query_params |
true if the query string of URLs should be stripped. |
google_analytics_params |
Configuration for the Google Analytics cross analysis. The Google Analytics cross analysis feature is required. |
google_search_console_params |
Configuration for the Google Search Console cross analysis. The Google Search Console cross analysis feature is required. |
http_auth |
Configuration for the HTTP authentication of the crawler. |
id |
The ID of this crawl configuration. |
js_rendering |
true if the crawler should render the crawled pages using JavaScript.The Crawl JS feature is required. |
majestic_params |
Configuration for the Majestic Back-links cross analysis. The Majestic Back-Links feature is required. |
max_depth |
The maximum depth after which the crawler will stop follow links. |
max_speed |
The maximum speed at each the crawler should go in number of URLs per second. Valid values are 0.1 , 0.2 , 0.5 , 1 , 2 , 5 then every multiple of 5 until your maximum allowed crawl speed.To crawl above 1 URL/s you need to verify the ownership of the project. |
max_url |
The maximum number of fetched URLs after which the crawler will stop. |
name |
The name of the configuration. Only used as a label to easily identify it. |
notifications |
The notification channels for crawls that ended or failed to start. By default the owner of the workspace will receive the notifications. |
query_params_list |
If filter_query_params is true , this is a list of comma separated name of query parameter to filter. The parameter whitelist_params_mode will define how to filter them. |
resource_checker |
true if the crawler should watch for requested resources during the crawl, false otherwise. This feature requires js_rendering:true . |
reuse_cookies deprecated |
Not used anymore. |
robots_txt |
List of configured virtual robots.txt. The project’s ownership must be verified to use this option. |
scheduling_period deprecated |
Not used anymore. |
scheduling_start_date deprecated |
Not used anymore. |
scheduling_timezone deprecated |
Not used anymore. |
sitemaps |
List of sitemaps URLs. |
start_url |
The start URL of the crawl. This URL should not be a redirection to another URL. |
strict_sitemaps |
true if the crawler should follow strictly the sitemaps protocol, false otherwise. |
trigger_coc |
true if the crawler should automatically generate a Crawl over Crawl at the end.The Crawl over Crawl feature is required. |
use_cookies |
true if the crawler should keep the cookies returned by the server between requests, false otherwise. |
use_proxy |
true if the crawler should use the Oncrawl proxy which allows it to keep a static range of IP addresses during its crawl. |
user_agent |
Name of the Crawler, this name will appears in the user agent sent by the crawler. |
user_input_files |
List of ingested data files IDs to use in this crawl. The Data Ingestion feature is required. |
watched_resources |
List of patterns to watch if resource_checker is set to true . |
webhooks deprecated |
List of webhooks V1 to call at the end of the crawl. |
whitelist_params_mode |
true if the query_params_list should be used as a whitelist, false if it should be used as a blacklist. |
Create a configuration
Create a crawl configuration.
curl "https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"crawl_config": {
"name": "New crawl configuration",
"start_url": "https://www.oncrawl.com",
"user_agent": "Oncrawl",
"max_speed": 1
}
}
EOF
import requests
requests.post("https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs", json={
"crawl_config": {
"name": "New crawl configuration",
"start_url": "https://www.oncrawl.com",
"user_agent": "Oncrawl",
"max_speed": 1
}
},
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)
HTTP request
The expected HTTP request is exactly the same format as the response when you retrieve a crawl configuration.
The id
is automatically generated by the API for any new crawl configuration and must not be part of the payload.
The only required fields are name
, start_url
, user_agent
and max_speed
.
HTTP Response
Examples of HTTP response
{
"crawl_config": "<CrawlConfig Object>"
}
An HTTP 200
status code is returned with the created crawl configuration returned directly as the response within a crawl_config
key.
AT Internet
{
"at_internet_params": {
"api_key": "YOUR_API_KEY",
"site_id": "YOUR_SITE_ID"
}
}
A subscription with the AT Internet
feature is required to use this configuration.
To can request an API Key in API Accounts
within the settings area of your AT Internet homepage.
This API Key is necessary to allow Oncrawl to access your AT Internet data.
The site_id
specify from which site we should collect the data.
The HTTP requests that you need to whitelist are:
Note: You must replace the {site_id}
of both URLs with the actual site ID.
Without this, Oncrawl won’t be able to fetch the data.
Google Analytics
{
"google_analytics_params": {
"email": "local@domain.com",
"account_id": "12345678",
"website_id": "UA-12345678-9",
"profile_id": "12345678"
}
}
A subscription with the Google Analytics
feature is required to use this configuration.
You have the provides the following properties:
Property | Description |
---|---|
email |
Email of your Google account. |
account_id |
ID of your Google Analytics account. |
website_id |
ID of your website in Google Analytics. |
profile_id |
ID of the website’s profile to use for cross analysis. |
To use a Google Account you must first give access to your analytics data to Oncrawl using OAuth2.
For now you must use the onCrawl web client to add your Google account.
Google Search Console
{
"google_search_console_params": {
"email": "local@domain.com",
"websites": [
"https://www.oncrawl.com"
],
"branded_keywords": [
"oncrawl",
"on crawl",
"oncrowl"
]
}
}
A subscription with the Google Search Console
feature is required to use this configuration.
You have the provides the following properties:
Property | Description |
---|---|
email |
Email of your Google account. |
websites |
List of the websites URLs from your Google Search Console to use. |
branded_keywords |
List of keywords that the crawler should consider as part of a brand. |
To use a Google Account you must first give access to your analytics data to Oncrawl using OAuth2.
For now you must use the onCrawl web client to add your Google account.
Majestic
{
"majestic_params": {
"access_token": "ABCDEF1234"
}
}
A subscription with the Majestic
feature is required to use this configuration.
You have the provides the following properties:
Property | Description |
---|---|
access_token |
An access token that the crawler can use to access your data. |
You can create an access token authorizing Oncrawl to access your Majestic data here.
Custom fields
Documentation not available yet.
Notifications
{
"notifications": {
"email_recipients": ["<email1>", "<email2>"],
"custom_webhooks": [{
"url": "<webhook1_url>"
}, {
"url": "<webhook2_url>",
"secret": "<webhook_secret>"
}]
}
}
A notification is sent when:
- A crawl for that crawl configuration has ended.
- A scheduled crawl for that crawl configuration failed to start.
The supported notifications channels are:
- Emails
- Custom HTTP Webhooks
Emails
You can configure up to 10 recipients, if an empty list is provided, no emails are sent.
If you do not provide a notifications.email_recipients
configuration then it send a mail to the
workspace owner by default.
Verify webhook payload signature
import hashlib
import hmac
def verify_signature(webhook_payload, webhook_secret, signature_header):
if not signature_header:
raise Exception(message="signature header is missing")
hash_object = hmac.new(webhook_secret.encode('utf-8'), msg=webhook_payload, digestmod=hashlib.sha256)
if not hmac.compare_digest(hash_object.hexdigest(), signature_header):
raise Exception("Signatures didn't match")
Test a custom webhook endpoint
curl "https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs/validate_custom_webhook" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-X POST \
-d @- <<EOF
{
"url": "<webhook_url>",
"secret": "<webhook_secret>"
}
EOF
import requests
requests.post("https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs/validate_custom_webhook", json={
"url": "<webhook_url>",
"secret": "<webhook_secret>"
},
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)
Custom HTTP Webhooks
You can configure up to 10 custom HTTP webhooks.
The URLS:
- must be in HTTPS
- must be unique within
custom_webhooks
To protect your endpoint and verify that the request is coming from Oncrawl you can specify a secret with the webhook URL. This
secret can be any value between 8 and 100 characters.
By adding a secret a request HTTP header X-Oncrawl-Webhook-Signature
will be sent with the payload to verify the authenticity of the request.
The API will not return the secret for the webhook URL in the crawl config but only if it has a secret configured or not using secret_enabled
.
To remove the secret associated with the webhook, you have to pass null
as value for secret
.
If you change the webhook URL itself, any previously associated secret will be removed.
To test your webhook endpoint you can use the /validate_custom_webhook
endpoint.
DNS
{
"dns": [{
"host": "www.oncrawl.com",
"ips": [ "82.34.10.20", "82.34.10.21" ]
}, {
"host": "fr.oncrawl.com",
"ips": [ "82.34.10.20" ]
}]
}
The dns
configuration allows you resolve one or several domains to another IP address than they normally would.
This can be useful to crawl a website in pre-production as if it was already deployed on the real domain.
Extra HTTP headers
{
"extra_headers": {
"Cookie": "lang=fr;",
"X-My-Token": "1234"
}
}
The extra_headers
configuration allows you to inject custom HTTP headers to each of the crawl’s HTTP requests.
HTTP Authentication
{
"http_auth": {
"username": "user",
"password": "1234",
"scheme": "Digest",
"realm": null
}
}
The http_auth
configuration allows you to crawl sites behind an authentication.
It can be useful to crawl a website in pre-production that is password protected before its release.
Property | Description |
---|---|
username required |
Username to authenticate with. |
password required |
Password to authenticate with. |
scheme required |
How to authenticate. Available values are Basic Digest and NTLM . |
realm optional |
The authentication realm. To NTLM this correspond to the domain. |
Robots.txt
{
"robots_txt": [{
"host": "www.oncrawl.com",
"content": "CONTENT OF YOUR ROBOTS.TXT"
}]
}
The robots_txt
configuration allows you to override, for a given host, its robots.txt.
It can be use to:
- allow the crawler to crawl normally denied pages by the robots.txt.
- override a
Crawl-Delay
directive to allow the Crawler to crawl faster.
Because you can make the crawler ignore the robots.txt
of a website, it is necessary to verify the ownership of this project to use this feature.
For now you can only verify the ownership using the Oncrawl application.
Update a configuration
Update a crawl configuration.
curl "https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-X PUT \
-d @- <<EOF
{
"crawl_config": {
"name": "New crawl configuration",
"start_url": "https://www.oncrawl.com",
"user_agent": "Oncrawl",
"max_speed": 1
}
}
EOF
import requests
requests.put("https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs", json={
"crawl_config": {
"name": "New crawl configuration",
"start_url": "https://www.oncrawl.com",
"user_agent": "Oncrawl",
"max_speed": 1
}
},
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)
HTTP request
It takes the same parameters as a crawl configuration creation except the name
that cannot be modified and must be the same.
HTTP response
It returns the same response as a crawl configuration creation.
Delete a configuration
Delete a configuration.
curl -X DELETE "https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs/<crawl_config_id>" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
requests.delete("https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs/<crawl_config_id>",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)
HTTP request
No HTTP parameters.
HTTP Response
Returns an HTTP 204
status code if successful.
Ingest data
Data ingestion is the process of integrating additional data for URLs in your analysis.
The ingest data API allows you to upload, delete and retrieve data files used by the Data Ingestion feature.
List ingest files
Get list of ingest files.
curl "https://app.oncrawl.com/api/v2/projects/<project_id>/ingest_files" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
projects = requests.get("https://app.oncrawl.com/api/v2/projects/<project_id>/ingest_files",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()
HTTP Request
The ingest files can be paginated and filtered using the parameters described in the pagination section.
The fields available for the sort
and filters
are:
Property | Description |
---|---|
id |
The file’s ID. |
name |
The file’s name. |
status |
The file’s status, can be UPLOADING , UPLOADED , PROCESSING , PROCESSED , ERROR . |
kind |
The file’s kind, can be ingest , seed . |
created_at |
The file’s creation date. |
HTTP Response
{
"meta": {
"filters": {},
"limit": 20,
"offset": 0,
"sort": null,
"total": 10
},
"user_files": [
"<Ingest File Object>",
"<Ingest File Object>"
]
}
A JSON object with a meta
, described by the pagination section and a user_files
key with the list of ingest files.
Get an ingest file
Get an ingest file.
curl "https://app.oncrawl.com/api/v2/projects/<project_id>/ingest_files/<ingest_file_id>" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
project = requests.get("https://app.oncrawl.com/api/v2/projects/<project_id>/ingest_files/<ingest_file_id>",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()
HTTP Response
{
"user_file": {
"created_at": 1587473932000,
"detailed_invalid_lines": {
"invalid_url_format": 1
},
"error_message": null,
"fields": {
"median_position": "string",
"sum_search_volume": "double",
"total_keyword_count": "double"
},
"id": "5e9eee0c451c95288a2f8f8d",
"invalid_lines": 1,
"kind": "ingest",
"lines_errors_messages": {
"invalid_url_format": [
"Error!"
]
},
"name": "some_ingest_file.zip",
"project_id": "58fe0dd3451c9573f1d2adea",
"size": 2828,
"status": "PROCESSED",
"valid_lines": 62
}
}
The ingest file’s properties are:
Property | Description |
---|---|
id |
The file’s ID. |
project_id |
The project ID. |
name |
The file’s name. |
status |
The file’s status, can be UPLOADING , UPLOADED , PROCESSING , PROCESSED , ERROR . |
kind |
The file’s kind, can be ingest , seed . |
created_at |
The file’s creation date. |
detailed_invalid_lines |
The detail of invalid lines. |
error_message |
The error message. |
fields |
The fields parsed in the file. |
valid_lines |
The amount of valid lines. |
invalid_lines |
The amount of invalid lines. |
lines_errors_messages |
A map splitting into categories error messages. |
size |
The amount of characters in the file. |
Create an ingest file
Create an ingest file.
curl -X POST "https://app.oncrawl.com/api/v2/projects/<project_id>/ingest_files" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "Content-Type: multipart/form-data" \
-F "file=@<file_path>"
EOF
import requests
requests.post("https://app.oncrawl.com/api/v2/projects/<project_id>/ingest_files", files={
"file": <binary>
},
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)
HTTP request
Property | Description |
---|---|
file |
Binary data. |
HTTP Response
Returns an HTTP 204
status code if successful.
Delete an ingest file
Delete an ingest file.
curl -X DELETE "https://app.oncrawl.com/api/v2/projects/<project_id>/ingest_files/<ingest_file_id>" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
requests.delete("https://app.oncrawl.com/api/v2/projects/<project_id>/ingest_files/<ingest_file_id>",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)
HTTP request
No HTTP parameters.
HTTP Response
Returns an HTTP 204
status code if successful.
Ranking Performance API
RP Quotas
While you are using Ranking Performance, you will be subject to usage quotas. These quotas can prevent you from using the API whether you are using the web client, or whether you are requesting directly the API. Most users will not exceed load limits, but if you do, you will receive a “Ranking Performance query quota reached.” error message (403).
The quotas work this way: the more data you request, the more quota you will be using. For that reason, it is important to narrow down your requests as much as you can. A tight filter on field date
can decrease a lot the amount of data that is fetched and will use a low quantity of your quota.
More generally, if you hit the quotas too often, you can try and use more specific filters for your queries.
The quota is project-based: that means that if you have reached it for one of your projects, you can still perform requests to the API for your other projects.
RP Data Schema
HTTP Request
Example of fields request
curl "https://app.oncrawl.com/api/search/v2/data/project/<project_id>/ranking_performance/fields" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
fields = requests.get("https://app.oncrawl.com/api/search/v2/data/project/<project_id>/ranking_performance/fields",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json().get('fields', [])
HTTP Response
Example of HTTP response
{
"fields": [{
"actions": [
"equals",
"not_equals"
],
"agg_dimension": true,
"agg_metric_methods": [
"cardinality"
],
"arity": "one",
"can_display": true,
"can_filter": true,
"can_sort": true,
"name": "device",
"type": "enum"
}, "..."]
}
Property | Description |
---|---|
name |
The name of the field |
type |
The field’s type (natural, float, enum, bool, string, date) |
arity |
If the field is multivalued, can be one or many . |
values |
List of possible values for enum type. |
actions |
List of possible filters of this field. |
agg_dimension |
true if can be used as a dimension in aggregate queries. |
agg_metric_methods |
List of available aggregations methods for this field. |
can_display |
true if the field can be retrieved in queries. |
can_filter |
true if the field can be used in filters queries. |
can_sort |
true if the field can be sort on in queries. |
RP Aggregate Queries
Sum of clicks for each url/query pair sorted on urls by alphabetical order
curl "https://app.oncrawl.com/api/search/v2/data/project/<project_id>/ranking_performance/aggs" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-d '{
"aggs": [
{
"fields": [
"url",
"query"
],
"value": [
{
"field": "clicks",
"method": "sum",
"alias": "clicks_sum"
}
],
"oql": {
"field": [
"date",
"lt",
"2022-10-01"
]
},
"sort": {
"field": "url",
"order": "asc"
}
}
]
}'
import requests
response = requests.post("https://app.oncrawl.com/api/search/v2/data/project/<project_id>/ranking_performance/aggs",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' },
json={
"aggs": [
{
"fields": [
"url",
"query"
],
"value": [
{
"field": "clicks",
"method": "sum",
"alias": "clicks_sum"
}
],
"oql": {
"field": [
"date",
"lt",
"2022-10-01"
]
},
"sort": {
"field": "url",
"order": "asc"
}
}
]
}
).json()
The returned JSON looks like:
{
"aggs": [
{
"cols": [
"url",
"query",
"clicks_sum"
],
"rows": [
[
"https://www.oncrawl.com/",
"analyse backlinks free",
0
],
[
"https://www.oncrawl.com/",
"seo servers",
0
],
...
]
}
]
}
Sum of impressions for each url where it is greater than 2000 sorted by descending order
curl "https://app.oncrawl.com/api/search/v2/data/project/<project_id>/ranking_performance/aggs" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-d '{
"aggs": [
{
"fields": [
"url"
],
oql": {
"field": [
"date",
"lt",
"2022-10-01"
]
},
"value": [
{
"field": "impressions",
"method": "sum",
"alias": "impressions_sum"
}
],
"sort": {
"field": "impressions_sum",
"order": "desc"
},
"post_aggs_oql": {
"field": [
"impressions_sum",
"gt",
2000
]
}
}
]
}'
import requests
response = requests.post("https://app.oncrawl.com/api/search/v2/data/project/<project_id>/ranking_performance/aggs",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' },
json={
"aggs": [
{
"fields": [
"url"
],
"oql": {
"field": [
"date",
"lt",
"2022-10-01"
]
},
"value": [
{
"field": "impressions",
"method": "sum",
"alias": "impressions_sum"
}
],
"sort": {
"field": "impressions_sum",
"order": "desc"
},
"post_aggs_oql": {
"field": [
"impressions_sum",
"gt",
2000
]
}
}
]
}
).json()
The returned JSON looks like:
{
"aggs": [
{
"cols": [
"url",
"impressions_sum"
],
"rows": [
[
"https://www.oncrawl.com/oncrawl-seo-thoughts/12-great-tools-for-keyword-tracking-campaigns/",
2445803
],
[
"https://www.oncrawl.com/technical-seo/submit-website-bing-webmaster-tools/",
2256877
],
...
]
}
]
}
HTTP Request
This HTTP endpoint expects a JSON object as its payload with a single aggs
key and an array of aggregate queries as its value.
An aggregate query is an object with the following properties:
Property | Description |
---|---|
fields optional |
Specify how to create buckets of matching items. |
oql optional |
An Oncrawl Query Language object to filter on fields. |
value optional |
Specify how to aggregate matching items. By default it will return the number of matching items. |
sort optional |
Ordering of the returned matching results. |
post_aggs_oql optional |
An Oncrawl Query Language object to filter on metric aggregations. |
limit optional |
Maximum number of matching result to return. |
offset optional |
An offset for the returned matching results. |
The
value
parameter is expected to be an array of objects with afield
key, amethod
key and optionally analias
key where:{field}
is the field’s name to perform the aggregation on{method}
is the aggregation method{alias}
is the alias for the metric aggregation.
The
sort
parameter is expected to be either an array of objects or an object. Those objects must have afield
key, anorder
key where:{field}
is the field’s name or the alias of a metric aggregation to sort on{order}
is the sort order, eitherasc
ordesc
.
The
post_aggs_oql
parameter is expected to be a regular OQL but you can only use aliases designating metric aggregations to filter on them.
How to aggregate on items
Url cardinality for each query by cardinalities in descending order and by query in alphabetical order
curl "https://app.oncrawl.com/api/search/v2/data/project/<project_id>/ranking_performance/aggs" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-d '{
"aggs": [
{
"fields": [
"query"
],
"value": [
{
"field": "url",
"method": "cardinality",
"alias": "url_cardinality"
}
],
"oql": {
"field": [
"date",
"lt",
"2022-10-01"
]
},
"sort": [
{
"field": "url_cardinality",
"order": "desc"
},
{
"field": "query",
"order": "asc"
}
]
}
]
}'
import requests
response = requests.post("https://app.oncrawl.com/api/search/v2/data/project/<project_id>/ranking_performance/aggs",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' },
json={
"aggs": [
{
"fields": [
"query"
],
"value": [
{
"field": "url",
"method": "cardinality",
"alias": "url_cardinality"
}
],
"oql": {
"field": [
"date",
"lt",
"2022-10-01"
]
},
"sort": [
{
"field": "url_cardinality",
"order": "desc"
},
{
"field": "query",
"order": "asc"
}
]
}
]
}
).json()
The returned JSON looks like:
{
"aggs": [
{
"cols": [
"query",
"url_cardinality"
],
"rows": [
[
"site:oncrawl.com",
370
],
[
"site:www.oncrawl.com",
279
],
...
]
}
]
}
The expected format is {"field": <field_name>, "method": <aggregation_type>, "alias": <alias>}
.
For example:
{"field": "clicks", "method": "sum", "alias": "clicks_sum"}
will returns the sum of clicks.{"field": "url", "method": "cardinality"}
will returns the total amount of urls.
But not all fields can be aggregated and not all aggregations are available on all fields.
To know which aggregations are available on a field you can check the agg_metric_methods
value returned by the Data Schema endpoint.
The available methods are:
- sum
- Returns the sum of all the values for this field.
- cardinality
- Returns the number of different values for this field.
- weighted_average
- Returns the weighted average for this field (available only for position).
- ctr
- Returns the ctr (available only on clicks).
When performing an aggregation, you can create buckets for your matching items using the fields
parameter which takes an array of JSON objects.
The simplest way is to simply use the field’s name.
But not all fields can be used to create a bucket.
To know which fields are available as a bucket you can check the agg_dimension
value returned by the Data Schema endpoint.
RP Export queries
HTTP Request
Example of async-export request
curl "https://app.oncrawl.com/api/search/v2/data/project/<project_id>/ranking_performance/async-export" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-d '{
"request": {
"oql": {
"and": [
{
"field": [
"date",
"between",
[
"2023-09-18",
"2023-11-02"
]
]
},
{
"field": [
"url",
"equals",
"https://www.oncrawl.com"
]
}
]
},
"post_aggs_oql": {
"field": [
"nb_of_ranking_queries",
"gt",
0
]
},
"value": [
{
"field": "url",
"method": "cardinality",
"alias": "nb_of_ranking_pages"
},
{
"field": "query",
"method": "cardinality",
"alias": "nb_of_ranking_queries"
}
],
"fields": [
{
"name": "query"
}
]
},
"output_format": "csv",
"output_format_parameters": {
"csv_delimiter": ","
}
}'
import requests
response = requests.post("https://app.oncrawl.com/api/search/v2/data/project/<project_id>/ranking_performance/async-export",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' },
json={
"request": {
"oql": {
"and": [
{
"field": [
"date",
"between",
[
"2023-09-18",
"2023-11-02"
]
]
},
{
"field": [
"url",
"equals",
"https://www.oncrawl.com"
]
}
]
},
"post_aggs_oql": {
"field": [
"nb_of_ranking_queries",
"gt",
0
]
},
"value": [
{
"field": "url",
"method": "cardinality",
"alias": "nb_of_ranking_pages"
},
{
"field": "query",
"method": "cardinality",
"alias": "nb_of_ranking_queries"
}
],
"fields": [
{
"name": "query"
}
]
},
"output_format": "csv",
"output_format_parameters": {
"csv_delimiter": ","
}
}
).json()
The returned JSON looks like:
{
"data_export": {
"data_type": "keyword",
"expiration_date": 1701526732000,
"export_failure_reason": null,
"id": "00000020f51bb4362eee2a4c",
"output_format": "csv",
"output_format_parameters": null,
"output_row_count": null,
"output_size_in_bytes": null,
"requested_at": 1698934732000,
"resource_id": "00000020f51bb4362eee2a4d",
"status": "REQUESTED",
"target": "download_center",
"target_parameters": {}
}
}
When performing an async-export request, the request oql must contain a filter on dates.
The available “request” properties can be found in the RP Aggregate Queries section.
The other properties are:
Property | Description |
---|---|
output_format |
Specify the output format. Can be either “csv” or “json” . |
output_format_parameters optional |
Specify parameters for the output format. A property csv_delimiter can be defined. The supported values for that property are “,” , “;” , “\t” . |
Account API
The Account API allows you to manage account settings and data.
With this API you can, for example:
Secrets
List secrets
Get a list of secrets.
curl "https://app.oncrawl.com/api/v2/account/secrets" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
secrets = requests.get("https://app.oncrawl.com/api/v2/account/secrets",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()
HTTP Request
The secrets can be paginated and filtered using the parameters described in the pagination section.
The fields available for the sort
and filters
are:
Property | Description |
---|---|
name |
Name of the secret |
type |
Can be gcs_credentials or s3_credentials . |
HTTP Response
{
"meta":{
"filters":{},
"limit":100,
"offset":0,
"sort":null,
"total":1
},
"secrets": [
"<Secret Object>",
"<Secret Object>"
]
}
A JSON object with a meta
, described by the pagination section and a secrets
key with the list of secrets.
Create a secret
Create a secret.
curl -X POST "https://app.oncrawl.com/api/v2/account/secrets" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"secret": {
"name": "secret_name",
"type": "gcs_credentials",
"value": "secret value"
}
}
EOF
import requests
requests.post("https://app.oncrawl.com/api/v2/account/secrets", json={
"secret": {
"name": "secret_name",
"type": "gcs_credentials",
"value": "secret value"
}
},
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)
HTTP Request
Property | Description |
---|---|
name |
Must be unique and match ^[a-z-A-Z][a-zA-Z0-9_-]{2,63}$ . |
type |
Can be gcs_credentials or s3_credentials . |
value |
The secret’s value is a JSON string encoded in base64. |
Secret value
- For
gcs_credentials
, this is the key of a service account exported using JSON format, see documentation. - For
s3_credentials
, this is a JSON file such as:
HTTP Response
Examples of HTTP response
{
"secret": {
"creation_date": 1623320422000,
"id": "6039166bcde251bfdcf624aa",
"name": "my_secret",
"owner_id": "60c1e79b554c6c975f218bad",
"type": "gcs_credentials"
}
}
An HTTP 200
status code is returned with the created project returned directly as the response within a secret
key.
Delete a secret
Delete a secret.
curl -X DELETE "https://app.oncrawl.com/api/v2/account/secrets/<secret_id>" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
requests.delete("https://app.oncrawl.com/api/v2/projects/secrets/<secret_id>",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)
HTTP request
No HTTP parameters.
HTTP Response
Returns an HTTP 204
status code if successful.
Data exports
List data exports
Get a list of data exports.
curl "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
account = requests.get("https://app.oncrawl.com/api/v2/account/data_exports",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()
HTTP Request
The data exports can be paginated and filtered using the parameters described in the pagination section.
The fields available for the sort
and filters
are:
Property | Description |
---|---|
id |
The export ID. |
status |
The status of the export. |
requested_at |
The date of the export request. |
created_at |
The create date of the export. |
size_bytes |
The size of the export in bytes. |
row_count |
The amount of produced rows. |
data_type |
The data type of the export, can be page , link . |
output_format |
The output format, can be json , csv , parquet . |
resource_id |
The ID of crawl object that was exported. |
HTTP Response
{
"meta":{
"filters":{},
"limit":100,
"offset":0,
"sort":null,
"total":1
},
"account": [
"<Data Export Object>",
"<Data Export Object>"
]
}
A JSON object with a meta
, described by the pagination section and a account
key with the list of data exports.
Get a data export
Get a data export.
curl "https://app.oncrawl.com/api/v2/account/data_exports/<data_export_id>" \
-H "Authorization: Bearer {ACCESS_TOKEN}"
import requests
export = requests.get("https://app.oncrawl.com/api/v2/account/data_exports/<data_export_id>",
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()
HTTP Response
{
"data_export": {
"data_type": "page",
"export_failure_reason": null,
"id": "6039166bcde251bfdcf624aa",
"output_format": "parquet",
"output_format_parameters": null,
"output_row_count": 1881.0,
"output_size_in_bytes": 517706.0,
"requested_at": 1614354027000,
"resource_id": "5feeac5c5567fd69557a1855",
"status": "DONE",
"target": "gcs",
"target_parameters": {}
}
}
The data_export’s properties are:
Property | Description |
---|---|
id |
The unique identifier of the file. |
data_type |
The data type of the export, can be page , link . |
export_failure_reason |
The reason of the export failure. |
output_format |
The output format, can be csv , json , parquet . |
output_format_parameters |
A list of parameters containing at the moment the delimiter to be used when the format is CSV. |
output_row_count |
Number of items that were exported. |
output_size_in_bytes |
Total size of the exported data. |
requested_at |
The date of the export request. |
resource_id |
The unique identifier of the file. |
status |
The status of the export. |
target |
The destination, can be gcs , s3 . |
target_parameters |
An object with the configuration for the selected target. |
Create a data export
S3
Create an S3 data export.
curl -X POST "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"data_export": {
"data_type": "page",
"resource_id": "666f6f2d6261722d71757578",
"output_format": "csv",
"output_format_parameters": {
"csv_delimiter": ";"
},
"target": "s3",
"target_parameters": {
"s3_credentials": "secrets://60c1dc72d61c55b9a313e5b4/my_secret",
"s3_bucket": "my-bucket",
"s3_prefix": "some-prefix",
"s3_region": "us-west-2"
},
"include_all_page_group_lists": true
}
}
EOF
import requests
requests.post("https://app.oncrawl.com/api/v2/account/data_exports", json={
{
"data_export": {
"data_type": "page",
"resource_id": "666f6f2d6261722d71757578",
"output_format": "csv",
"output_format_parameters": {
"csv_delimiter": ";"
},
"target": "s3",
"target_parameters": {
"s3_credentials": "secrets://60c1dc72d61c55b9a313e5b4/my_secret"
"s3_bucket": "my-bucket",
"s3_prefix": "some-prefix",
"s3_region": "us-west-2"
},
"include_all_page_group_lists": True
}
},
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)
The target parameters for S3 buckets are:
Property | Description |
---|---|
s3_credentials |
URI of the secret. |
s3_bucket |
Name of the bucket to upload data to. |
s3_region |
Valid S3 region where the bucket is located. |
s3_prefix |
Path on the bucket where the files will be uploaded. |
GCS
To export data in a GCS bucket, you must allow our service account to write in the desired bucket.
Our service account is `oncrawl-data-transfer@oncrawl.iam.gserviceaccount.com`.
You MUST give the following roles to our service account: ( with IAM )
- roles/storage.legacyBucketReader
- storage.objects.list
- storage.bucket.get
- roles/storage.legacyObjectReader
- storage.objects.get
Create a GCS data export.
curl -X POST "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"data_export": {
"data_type": "page",
"resource_id": "666f6f2d6261722d71757578",
"output_format": "csv",
"output_format_parameters": {
"csv_delimiter": ";"
},
"target": "gcs",
"target_parameters": {
"gcs_bucket": "test_bucket",
"gcs_prefix": "some_bucket_prefix"
},
"include_all_page_group_lists": true
}
}
EOF
import requests
requests.post("https://app.oncrawl.com/api/v2/account/data_exports", json={
{
"data_export": {
"data_type": "page",
"resource_id": "666f6f2d6261722d71757578",
"output_format": "csv",
"output_format_parameters": {
"csv_delimiter": ";"
},
"target": "gcs",
"target_parameters": {
"gcs_bucket": "test_bucket",
"gcs_prefix": "some_bucket_prefix"
},
"include_all_page_group_lists": True
}
},
headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)
The target parameters for GCS buckets are:
Property | Description |
---|---|
gcs_bucket |
Name of the bucket to upload data to. |
gcs_prefix |
Path on the bucket where the files will be uploaded. |
HTTP request
Property | Description |
---|---|
data_type |
The data type of the export, can be page , link . |
output_format |
The output format, can be csv , json , parquet . |
output_format_parameters |
A list of parameters containing at the moment the delimiter to be used when the format is CSV. |
resource_id |
The unique identifier of the file. |
target |
The destination, can be gcs , s3 . |
target_parameters |
An object with the configuration for the selected target. |
page_group_lists_included |
A list of segmentations to export |
include_all_page_group_lists |
Whether or not all segmentations should be exported, can be true or false (overrides page_group_lists_included if true , defaults to false ) |
HTTP Response
If successful, the API will respond with the same output as Get a single data export
Fields
This is the list of Oncrawl fields that are exported while using our Data Studio connector.
They are listed below by category. For each field you’ll find the following information: name, definition, type and arity.
The Oncrawl field type can be one of the following:
Type | Definition |
---|---|
integer |
integer number |
natural |
non-negative integer (>= 0) |
float |
floating-point number |
percentage |
floating-point number between 0 and 1 |
string |
a sequence of characters, text |
enum |
a string from a defined list of values |
bool |
boolean (true or false ) |
datetime |
an timestamp, in the following format: yyyy/MM/dd HH:mm:ss z |
date |
a date in the following format: yyyy-MM-dd |
object |
a raw JSON object |
hash |
a hashed string (the output of a hash function) |
Note: These types are the ones exposed by the Oncrawl API. The underlying storage for the Data Studio connector may use slightly different type names or date/time formats.
There are two possible values for arity:
Arity | Definition |
---|---|
one |
the field holds a single value |
many |
the field holds a list of values |
Content
Field name | Definition | Type | Arity |
---|---|---|---|
language |
Language code, in ISO 639 two-letter format. Either parsed from the HTML or detected from the text. | string | one |
text_to_code |
Number of text characters divided by the total number of characters in the HTML. | percentage | one |
word_count |
The number of words on the page. | natural | one |
Duplicate content
Field name | Definition | Type | Arity |
---|---|---|---|
clusters |
Oncrawl IDs of the groups of URLs with similar content that this URL belongs to. | hash | many |
nearduplicate_content |
Whether this page's content is very similar to another page, according to our SimHash-based algorithm. | bool | one |
nearduplicate_content_similarity |
Highest ratio of content similarity, compared to other pages in the cluster. | percentage | one |
duplicate_description_status |
Status of duplication issues for the group of pages with the same meta description as this page: canonical_ok (duplication is correctly handled using canonical declarations), hreflang_ok (duplication is correctly handled using hreflang declarations), canonical_not_matching (canonical declarations within the group do not match), hreflang_error (the implementation of hreflang declarations within the group has errors), canonical_not_set (no hreflang or canonical declarations) | enum | one |
duplicate_h1_status |
Status of duplication issues for the group of pages with the same H1 as this page: canonical_ok (duplication is correctly handled using canonical declarations), hreflang_ok (duplication is correctly handled using hreflang declarations), canonical_not_matching (canonical declarations within the group do not match), hreflang_error (the implementation of hreflang declarations within the group has errors), canonical_not_set (no hreflang or canonical declarations) | enum | one |
duplicate_title_status |
Status of duplication issues for the group of pages with the same title tag as this page: canonical_ok (duplication is correctly handled using canonical declarations), hreflang_ok (duplication is correctly handled using hreflang declarations), canonical_not_matching (canonical declarations within the group do not match), hreflang_error (the implementation of hreflang declarations within the group has errors), canonical_not_set (no hreflang or canonical declarations) | enum | one |
has_duplicate_description_issue |
Whether there are duplications of this page's meta description on other pages that are not handled using canonical or hreflang declarations. 'true' if problems remain and 'false' if correctly handled or if there's no duplication. | bool | one |
has_duplicate_h1_issue |
Whether there are duplications of this page's H1 on other pages that are not handled using canonical or hreflang declarations. 'true' if problems remain and 'false' if correctly handled or if there's no duplication. | bool | one |
has_duplicate_title_issue |
Whether there are duplications of this page's title tag on other pages that are not handled using canonical or hreflang declarations. 'true' if problems remain and 'false' if correctly handled or if there's no duplication. | bool | one |
has_nearduplicate_issue |
Whether the similarity of this page with other pages is handled using canonical or hreflang descriptions. 'true' if problems remain and 'false' if correctly handled or if there's no duplication. | bool | one |
nearduplicate_status |
Status of duplication issues for the group of pages with similar content to this one: canonical_ok (duplication is correctly handled using canonical declarations), hreflang_ok (duplication is correctly handled using hreflang declarations), canonical_not_matching (canonical declarations within the group do not match), hreflang_error (the implementation of hreflang declarations within the group has errors), canonical_not_set (no hreflang or canonical declarations) | enum | one |
Hreflang errors
Field name | Definition | Type | Arity |
---|---|---|---|
hreflang_cluster_id |
Oncrawl ID of the group of pages that reference one another through hreflang declarations. | hash | one |
Indexability
Field name | Definition | Type | Arity |
---|---|---|---|
meta_robots |
List of values in the meta robots tag. | string | many |
meta_robots_follow |
Whether the links on the page should be followed (true) or not (false) according to the meta robots. | bool | one |
meta_robots_index |
Whether the the page should be indexed (true) or not (false) according to the meta robots. | bool | one |
robots_txt_denied |
Whether the crawler was denied by the robots.txt file while visiting this page. | bool | one |
Linking & popularity
Field name | Definition | Type | Arity |
---|---|---|---|
depth |
Page depth in number of clicks from the crawl's Start URL. | natural | one |
external_follow_outlinks |
Number of followable outlinks to pages on other domains. | natural | one |
external_nofollow_outlinks |
Number of nofollow outlinks to pages on other domains. | natural | one |
external_outlinks |
Number of outlinks to other domains. | natural | one |
external_outlinks_range |
Range of the number of outlinks to pages on other domains: 0-50, 50-100, 100-150, 150-200, >200 | enum | one |
follow_inlinks |
Number of followable links pointing to a URL from other pages on the same site. | natural | one |
inrank |
Whole number from 0-10 indicating the URL's relative PageRank within the site. Higher numbers indicate better popularity. | natural | one |
inrank_decimal |
Decimal number from 0-10 indicating the URL's relative PageRank within the site. Higher numbers indicate better popularity. | float | one |
internal_follow_outlinks |
Number of followable links from this page to other pages on the same site. | natural | one |
internal_nofollow_outlinks |
Number of nofollow links from this page to other pages on the same site. | natural | one |
internal_outlinks |
Number of outlinks to pages on the same site. | natural | one |
internal_outlinks_range |
Range of the number of outlinks to pages on the same site: 0-50, 50-100, 100-150, 150-200, >200 | enum | one |
nb_inlinks |
Number of links pointing to this page from other pages on this site. | natural | one |
nb_inlinks_range |
Range of values that the number of links to this page from other pages on the site falls into. Ranges are: 0-50, 50-100, 100-150, 150-200, >200 | enum | one |
nb_outlinks_range |
Range of values that the number of links from this page falls into. Ranges are: 0-50, 50-100, 100-150, 150-200, >200 | enum | one |
nofollow_inlinks |
Number of links pointing to this page with a rel="nofollow" tag. | natural | one |
Oncrawl bot
Field name | Definition | Type | Arity |
---|---|---|---|
fetch_date |
Date on which the Oncrawl bot obtained the URL's source code expressed as yyyy/MM/dd HH:mm:ss z | datetime | one |
fetch_status |
Whether the Oncrawl bot successfully obtained the URL's source code. Indicates "success" when true. | string | one |
fetched |
Whether the Oncrawl bot obtained the URL's source code (true) or not (false). | bool | one |
parsed_html |
Whether the Oncrawl bot was able to obtain an HTTP status and textual content for this page. | bool | one |
sources |
List of sources for this page. Sources may be: OnCrawl bot, at_internet, google_analytics, google_search_console, ingest_data, logs_cross_analysis, majestic, adobe_analytics, sitemaps | string | many |
Payload
Field name | Definition | Type | Arity |
---|---|---|---|
load_time |
Time (in milliseconds) it took to fetch the entire HTML of the page, excluding external resources. Also known as "time to last byte" (TTLB). | natural | one |
weight |
The size of the page in KB, excluding resources. | natural | one |
Core Web Vitals
Field name | Definition | Type | Arity |
---|---|---|---|
cwv_bytes_saving |
The number of bytes which can be saved by optimising the page | natural | one |
cwv_cls |
Cumulative Layout Shift reported by Lighthouse | float | one |
cwv_fcp |
First Contentful Paint reported by Lighthouse | natural | one |
cwv_lcp |
Largest Contentful Paint reported by Lighthouse | natural | one |
cwv_performance_score |
Performance score reported by Lighthouse | float | one |
cwv_si |
Speed index reported by Lighthouse | natural | one |
cwv_tbt |
Total blocking time reported by Lighthouse | natural | one |
cwv_tti |
Time to interactive reported by Lighthouse | natural | one |
cwv_time_saving |
The time which can be saved by optimising the page | natural | one |
Redirect chains & loops
Field name | Definition | Type | Arity |
---|---|---|---|
final_redirect_location |
Final URL reached after following a chain of one or more 3xx redirects. | string | one |
final_redirect_status |
HTTP status code of the final URL reached after following a chain of one or more 3xx redirects. | natural | one |
is_redirect_loop |
Whether the chain of redirects loops back to a URL in the chain. | bool | one |
is_too_many_redirects |
Whether the chain contains more than 16 redirects. | bool | one |
redirect_cluster_id |
The Oncrawl ID of this page's redirect cluster. The redirect cluster is the group of pages found in all branches of a redirect chain or loop. | hash | one |
redirect_count |
Number of redirects needed from this page to reach the final target in the redirect chain. | natural | one |
Rel alternate
Field name | Definition | Type | Arity |
---|---|---|---|
canonical_evaluation |
Canonical status of the URL: matching (declares itself as canonical), not_matching (declares a different page as canonical), not_set (has no canonical declaration) | enum | one |
rel_canonical |
URL declared in the rel canonical tag. | string | one |
rel_next |
URL declared in the rel next tag. | string | one |
rel_prev |
URL declared in the rel prev tag. | string | one |
Scraping
Field name | Definition | Type | Arity |
---|---|---|---|
custom_qsdd |
Custom field created through user-defined scraping rules. | string | one |
SEO tags
Field name | Definition | Type | Arity |
---|---|---|---|
description_evaluation |
Duplication status of the page's meta description: unique, duplicated (another URL has the same meta description), not_set | enum | one |
description_length |
Length of the URL's meta description in number of characters. | natural | one |
description_length_range |
Evaluation of the URL's meta description length: perfect (135-159), good (110-134 or 160-169), too short (<110), too long (>=170) | enum | one |
h1 |
First H1 on the page. | string | one |
h1_evaluation |
Duplication status of the page's H1 text: unique, duplicated (another URL has the same H1), not_set | enum | one |
meta_description |
Meta description for this page. | string | one |
num_h1 |
Number of H1 tags on this page. | natural | one |
num_h2 |
Number of H2 tags on this page. | natural | one |
num_h3 |
Number of H3 tags on this page. | natural | one |
num_h4 |
Number of H4 tags on this page. | natural | one |
num_h5 |
Number of H5 tags on this page. | natural | one |
num_h6 |
Number of H6 tags on this page. | natural | one |
num_img |
Number of images on this page. | natural | one |
num_img_alt |
Number of image 'alt' attributes on this page. | natural | one |
num_img_range |
Whether the page contains no images, one image, or more than one. | enum | one |
num_missing_alt |
Number of missing 'alt' attributes for images on this page. | natural | one |
semantic_item_count |
Number of semantic tags on the page. | natural | one |
semantic_types |
List of semantic tags found on the page. | string | many |
title |
Page title found in the <title> tag. | string | one |
title_evaluation |
Duplication status of the title tag: unique, duplicated (another page has the same title), not_set. | enum | one |
title_length |
Length of the title tag in characters. | natural | one |
Sitemaps
Field name | Definition | Type | Arity |
---|---|---|---|
sitemaps_file_origin |
List of URLs of the Sitemaps files where this page was found. | string | many |
sitemaps_num_alternate |
Number of alternates to this page that were found in the sitemaps. | natural | one |
sitemaps_num_images |
Number of images for this page that were found in the sitemaps. | natural | one |
sitemaps_num_news |
Number of news publications for this page that were found in the sitemaps. | natural | one |
sitemaps_num_videos |
Number of videos for this page that were found in the sitemaps. | natural | one |
Status code
Field name | Definition | Type | Arity |
---|---|---|---|
redirect_location |
URL this page redirects to. | string | one |
status_code |
HTTP status code returned by the server when crawling the page. | natural | one |
status_code_range |
HTTP status code class. Classes are: ok, redirect, client_error, server_error. | enum | one |
URL
Field name | Definition | Type | Arity |
---|---|---|---|
querystring_key |
List of keys found in the querystring of this page's URL. | string | one |
querystring_keyvalue |
List of key-value pairs found in the querystring of this page's URL. | string | one |
url |
Full URL including the protocol (https://). | string | one |
url_ext |
URL's file extension. | string | one |
url_first_path |
First directory following the URL's domain, or / if there is no directory | string | one |
url_has_params |
Whether the URL has query parameters. | bool | one |
url_host |
Hostname or subdomain found in the URL. | string | one |
urlpath |
URL path. | string | one |