MENU
bash python

API Reference

The OnCrawl REST API is used for accessing your crawl data as well as managing your projects and your crawls.

In order to use this API you need to have an OnCrawl account, an active subscription and an access token.

The current version of the web API is known as V2, although we don’t expect it to change too much it is still considered under development.

We try to keep breaking change as little as possible but this is not 100% guaranteed.

Requests

All API requests should be made to the /api/v2 prefix, and will return JSON as the response.

HTTP Verbs

When applicable, API tries to use the appropriate HTTP verb for each action:

Verb Description
GET Used for retrieving resources.
POST Used for creating resources.
PUT Used for updating resources.
DELETE Used for deleting resources.

Parameters and Data

curl "https://app.oncrawl.com/api/v2/projects" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
        "project": {
            "name": "Project name",
            "start_url": "https://www.oncrawl.com"
        }
    }
EOF
import requests

requests.post("https://app.oncrawl.com/api/v2/projects", json={
  "project": {
    "name": "Project name",
    "start_url": "https://www.oncrawl.com"
  }
})

Any parameters not included in the URL should be encoded as JSON with a Content-Type of application/json.

Additional parameters are sometimes specified via the querystring, even for POST, PUT and DELETE requests.

When a complex object is required to be passed via the querystring, the rison encoding format is used.

Errors

Format an en error message

{
  "type": "error_type",
  "code": "error_code",
  "message": "Error message",
  "fields": [{
    "name": "parameter_name",
    "type": "field_error_type",
    "message": "Error message"
  }]
}

When an error occurs, the API returns a JSON object with the following properties:

Property Description
type An attribute that groups errors based on their nature.
code
optional
A more specific attribute to let you handle specific errors.
message
optional
A human readable message describing the error.
fields
optional
List of field’s related errors.

Quota error message

{
 "type": "quota_error",
 "message": "Not enough quota" 
}

Forbidden error message

{
  "type": "forbidden",
  "code": "no_active_subscription"
}

Fields related errors

{
 "type": "invalid_request_parameters",
 "fields": [{
  "name": "start_url",
  "type": "required",
  "message": "The start URL is required."
 }]
}

Permissions errors

The following errors occurs if you are not allowed to perform a request.

Type Description
unauthorized Returned when the request is not authenticated.
HTTP Code: 401
forbidden Returned when the request is authenticated but the action is not allowed.
HTTP Code: 403
quota_error Returned the current quota does not allow the action to be performed.
HTTP Code: 403

The forbidden error is usually accompanied with a code key:

Validations errors

The following errors are caused due to invalid request. In most cases it means the request won’t be able to complete unless the parameters are changed.

Type Description
invalid_request Returned when the request has incompatible values or does not match the API specification.
HTTP Code: 400
invalid_request_parameters Returned when the value does not meet the required specification for the parameter.
HTTP Code: 400
resource_not_found Returned when any of resource(s) referred in the request is not found.
HTTP Code: 404
duplicate_entry Returned when the request provides a duplicate value for an attribute that is specified as unique.
HTTP Code: 400

Operation failure errors

There errors are returned when the request was valid but the requested operation could not be completed.

Type Description
invalid_state_for_request Returned when the requested operation is not allowed for current state of the resource.
HTTP Code: 409
internal_error Returned when the request couldn’t be completed due to a bug in OnCrawl side.
HTTP Code: 500

Authentication

To authorize, use this code:

# With shell, you can just pass the correct header with each request
curl "https://app.oncrawl.com/api/v2/projects" \
  -H "Authorization: Bearer {ACCESS_TOKEN}"
import requests

response = requests.get("https://app.oncrawl.com/api/v2/projects",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

Make sure to replace {ACCESS_TOKEN} with your own access token.

OnCrawl uses access tokens to allow access to the API. You can create tokens from your settings panel if your subscription allows it.

OnCrawl expects the access token to be included in all API requests to the server.

An access token may be created with various scopes:

Scope Description
account:read Give read access to all account’s related data.
Examples: Profile, invoices, subscription.
account:write Give write access to all account’s related data.
Examples: Close account, update billing information.
projects:read Give read access to all project’s and crawl’s data.
Examples: View crawl reports, export data.
projects:write Give write access to all project’s and crawl’s data.
Examples: Launch crawl, create project.

OnCrawl Query Language

OnCrawl provides a JSON-style language that you can use to execute queries.

This is referred to as the OQL for OnCrawl Query Language.

An OQL query has a tree-like structure composed of nodes.

A node can be terminal and is referred to as a leaf, or be a compound of other nodes.

An OQL query must start with a single root node.

Leaf nodes

Example of OQL using a field node:

{
  "field": [ "field_name", "filter_type", "filter_value" ]
}
Node Description
field Apply a filter on a field.

The value of a field node is an array with 3 values:

Compound nodes

Example OQL using an and node:

{
  "and": [ {
    "field": [ "field_name", "filter_type", "filter_value" ]
  }, {
    "field": [ "field_name", "filter_type", "filter_value" ]   
  }]
}
Node Description
and Execute a list of nodes using the logical operator AND.
or Execute a list of nodes using the logical operator OR.

Common filters

OQL to retrieve pages found in the structure:

{
  "field": [ "depth", "has_value", "" ]
}
Filter type Description
has_no_value The field must have no value.
has_value The field must have any value.

String filters

OQL to retrieve pages with “cars” in title

{
  "field": [ "title", "contains", "cars" ]
}
Filter type Description
contains The field’s value must contains the filter value.
endswith The field’s value must ends with the filter value.
startswith The field’s value must starts with the filter value.
equals The field’s value must be strictly equals to the filter value.

Numeric filters

OQL to retrieve pages with less than 10 inlinks:

{
  "field": [ "follow_inlinks", "lt", "10" ]
}

OQL to retrieve pages between depth 1 and 4

{
  "field": [ "depth", "between", [ "1", "4" ]]
}
Filter type Description
gt The field’s value must be greater than the filter value.
gte The field’s value must be greater or equal than the filter value.
lt The field’s value must be lesser than the filter value.
lte The field’s value must be lesser or equal than the filter value.
between The field’s value must be between both filter values (lower inclusive, upper exclusive).

Filters options

OQL to retrieve urls within /blog/{year}/:

{
  "field": [ "urlpath", "startswith", "/blog/([0-9]+)/", { "regex": true } ]
}

The filters equals, contains, startswith and endswith can take options as the fourth parameter of the field node as a JSON object.

Property Description
ci
boolean
true is the match should be case insensitive.
regex
boolean
true if the filter value is a regex.

Pagination

The majority of endpoints returning resources such as projects and crawls are paginated.

HTTP request

Example of paginated query

curl "https://app.oncrawl.com/api/v2/projects?offset=50&limit=100&sort=name:desc" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"
import requests

response = requests.get(
    "https://app.oncrawl.com/api/v2/projects?offset={offset}&limit={limit}&sort={sort}"
    .format(
        offset=50,
        limit=100,
        sort='name:desc'
    ),
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

The HTTP query expect the following parameters:

Parameter Description
offset
optional
The offset for matching items.
Defaults to 0.
limit
optional
The maximum number of matching items to return.
Defaults to 10.
sort
optional
How to sort matching items, order can be asc or desc.
Natural ordering is from most recent to least recent.
filters
optional
The OQL filters used for the query.
Defaults to null.

Because filters is a JSON objects that need to be passed in the querystring, the rison encoding format is used.

The sort parameter is expected to be the following format {name}:{order} where:

HTTP response

Example of paginated response

{
  "meta": {
    "offset": 0,
    "limit": 10,
    "total": 100,
    "filters": "<OQL>",
    "sort": [
      [ "name", "desc" ]
    ]
  },
  "projects": [ "..." ]
}

The HTTP response always follow the same pattern:

The meta key returns a JSON object that allows you to easily paginate through the resources:

Property Description
offset The offset used for the query.
Defaults to 0.
limit The limit used for the query.
Defaults to 10.
total The total number of matching items.
sort The sort used for the query.
Defaults to null.
filters The OQL filters used for the query.
Defaults to {}.

Data API

The Data API allows you to explore, aggregate and export your data.

There are 3 main sources:

Each sources can have one or several data types behind them.

Data types

For Crawl Reports:

curl "https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/<data_type>" \
  -H "Authorization: Bearer {ACCESS_TOKEN}"
import requests

response = requests.get("https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/<data_type>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

For Crawl over Crawl:

curl "https://app.oncrawl.com/api/v2/data/crawl_over_crawl/<coc_id>/<data_type>" \
  -H "Authorization: Bearer {ACCESS_TOKEN}"
import requests

response = requests.get("https://app.oncrawl.com/api/v2/data/crawl_over_crawl/<coc_id>/<data_type>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

For Log Monitoring (events):

curl "https://app.oncrawl.com/api/v2/data/project/<project_id>/log_monitoring/<data_type>" \
  -H "Authorization: Bearer {ACCESS_TOKEN}"
import requests

response = requests.get("https://app.oncrawl.com/api/v2/data/project/<project_id>/log_monitoring/<data_type>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

For Log Monitoring (pages):

curl "https://app.oncrawl.com/api/v2/data/project/<project_id>/log_monitoring/<data_type>/<granularity>" \
  -H "Authorization: Bearer {ACCESS_TOKEN}"
import requests

response = requests.get("https://app.oncrawl.com/api/v2/data/project/<project_id>/log_monitoring/<data_type>/<granularity>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

A data type is the nature of the objects you are exploring, each data type has its own schema and purpose.

Source Data type Description
Crawl report pages Lists of crawled pages of your website.
Crawl report links Lists of all links of your website.
Crawl report clusters Lists of duplicate clusters of your website.
Crawl report structured_data Lists of structured data of your website.
Crawl over Crawl pages List of compared pages.
Logs monitoring pages Lists of all urls.
Logs monitoring events Lists of all events.
Pages
Represents an HTML page of the website.
Links
Represents a link between two pages.
Example: an ‘href’ link to another page.
Clusters
Represents a cluster of pages that are considered similar.
A cluster has a size and an average similarity ratio.
Structured data
Represents a structured data item found on a page.
Supported format are: JSON-LD, RDFa and microdata.
Events
Represents a single line of a log file.
Available only in logs monitoring.

Data granularity

A granularity, only available for pages in log monitoring, defines how the metrics will be aggregated for a page.

days
Data will be aggregated by days, a day field will be available with the format YYYY-MM.
weeks
Data will by aggregated by weeks, a week field will be available with the format YYYY-[W]WW.
A week may start on monday or sunday depending on project’s configuration.
months
Data will be aggregated by months, a month field will be available with the format YYYY-MM.

You can find more information on what is available using /metadata endpoint.

HTTP Request

Exemple of HTTP request

curl "https://app.oncrawl.com/api/v2/data/project/<project_id>/log_monitoring/<data_type>/metadata" \
  -H "Authorization: Bearer {ACCESS_TOKEN}"
import requests

fields = requests.get("https://app.oncrawl.com/api/v2/data/project/<project_id>/log_monitoring/<data_type>/metadata",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

# For Logs Monitoring/span>
GET /api/v2/data/project/<project_id>/log_monitoring/<data_type>/metadata

HTTP Response

Example of HTTP response

{
   "bot_kinds": [
      "seo",
      "vertical"
   ],
   "dates": [
      {
         "from": "2018-09-21",
         "granularity": "days",
         "to": "2019-11-18"
      },
      {
         "from": "2018-09-16",
         "granularity": "weeks",
         "to": "2019-11-16"
      },
      {
         "from": "2018-06-01",
         "granularity": "months",
         "to": "2019-10-31"
      }
   ],
   "search_engines": [
      "google"
   ],
   "week_definition": "sunday_start"
}
Property Description
bot_kinds Bot kind can by seo, sea or vertical
dates List of available granularities with their min/max date
search_engines Search engine can by google
week_definition Can be sunday_start or iso

Data Schema

HTTP Request

Example of field’s request

curl "https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/<data_type>/fields" \
  -H "Authorization: Bearer {ACCESS_TOKEN}"
import requests

fields = requests.get("https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/<data_type>/fields",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json().get('fields', [])

HTTP Response

Example of HTTP response

{
  "fields": [{
    "name": "canonical_evaluation", 
    "type": "enum", 
    "arity": "one", 
    "values": [
         "matching", 
         "not_matching", 
         "not_set"
    ],
    "actions": [
     "has_no_value", 
     "not_equals", 
     "equals", 
     "has_value"
    ], 
    "agg_dimension": true, 
    "agg_metric_methods": [
     "value_count", 
     "cardinality"
    ], 
    "can_display": true, 
    "can_filter": true, 
    "can_sort": false,
    "user_select": true, 
    "category": "HTML Quality"
  }, "..."]
}
Property Description
name The name of the field
type The field’s type (natural, float, hash, enum, bool, string, percentage, object, date, datetime, ratio)
arity If the field is multivalued, can be one or many.
values List of possible values for enum type.
actions List of possible filters of this field.
agg_dimension true if can be used as a dimension in aggregate queries.
agg_metric_methods List of available aggregations methods for this field.
can_display true if the field can be retrieved in search or export queries.
can_filter true if the field can be used in filters queries.
can_sort true if the field can be sort on in search or export queries.
category
deprecated
Do not use.
user_select
deprecated
Do not use.

Search Queries

The search queries allows you to explore your data by filtering, sorting and paginating.

HTTP Request

Search for crawled with with 301 or 404 HTTP status code.

curl "https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
        "offset": 0,
        "limit": 10,
        "fields": [ "url", "status_code" ],
        "sort": [
            { "field": "status_code", "order": "asc" }
        ],
        "oql": {
            "and":[
                {"field":["fetched","equals",true]},
                {"or":[
                    {"field":["status_code","equals",301]},
                    {"field":["status_code","equals",404]}
                ]}
            ]
        }
    }
EOF
import requests

response = requests.post("https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' },
    json={
      "offset": 0,
      "limit": 10,
      "fields": [ "url", "status_code" ],
      "sort": [
          { "field": "status_code", "order": "asc" }
      ],
      "oql": {
        "and":[
            {"field":["fetched","equals",true]},
            {"or":[
                {"field":["status_code","equals",301]},
                {"field":["status_code","equals",404]}
            ]}
        ]}
      }
    }
).json()

The HTTP request is expected to be a JSON object as its payload with the following properties:

Property Description
limit
optional
Maximum number of matching result to return.
offset
optional
An offset for the returned matching results.
oql
optional
An OnCrawl Query Language object.
fields
optional
List of fields to retrieve for each matching result.
sort
optional
Ordering of the returned matching results.

The sort parameter is expected to be an array of object with the a field key and an order key where:

HTTP response

{
  "meta": {
    "columns": [
      "url", 
      "inrank", 
      "status_code", 
      "meta_robots", 
      "fetched"
    ], 
    "total_hits": 1, 
    "total_pages": 1
  }, 
  "oql": {
    "and": [
      { "field": [ "fetched",  "equals",  true ] }, 
      {
        "or": [
          { "field": [ "status_code", "equals", 301 ] }, 
          { "field": [ "status_code", "equals", 404 ] }
        ]
      }
    ]
  }, 
  "urls": [
    {
      "fetched": true, 
      "inrank": 8, 
      "meta_robots": null, 
      "status_code": 301, 
      "url": "http://www.website.com/redirect/"
    }
  ]
}

The response will be a JSON object with an urls key, an oql key and a meta key.

The urls key will contains an array of matching results.

The oql key will contains the OnCrawl Query Language object used for filtering.

The meta key will contains as keys:

Property Description
columns List of returned fields. They are the keys used in urls objects.
total_hits Total number of matching results.
total_pages
deprecated
Total number of pages according to limit and total_hits.

Aggregate Queries

Average load time of crawled pages

curl "https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages/aggs" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
      "aggs": [{
        "oql": {
          "field": ["fetched", "equals", "true"]
        },
        "value": "load_time:avg"
      }]
    }
EOF
import requests

response = requests.post("https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages/aggs",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' },
    json={
      "aggs": [{
        "oql": {
          "field": ["fetched", "equals", "true"]
        },
        "value": "load_time:avg"
      }]
    }
).json()

The returned JSON looks like:

{
  "aggs": [
    {
      "cols": [
        "load_time:avg"
      ],
      "rows": [
        [
          183.41091954022988
        ]
      ]
    }
  ]
}

HTTP Request

This HTTP endpoint expect a JSON object as its payload with a single aggs key and an array of aggregate queries as its value.

An aggregate query is an object with the following properties:

Property Description
oql
optional
An OnCrawl Query Language object to match a set of items.
By default it will match all items.
fields
optional
Specify how to create buckets of matching items.
value
optional
Specify how to aggregate matching items.
By default it will return the number of matching items.

How to aggregate items

By default an aggregate request will return the count but you can also perform a different aggregation using the field parameter.

The expected format is <field_name>:<aggregation_type>.

For example:

But not all fields can be aggregated and not all aggregations are available on all fields.

To know which aggregations are available on a field you can check the agg_metric_methods value returned by the Data Schema endpoint.

The available methods are:

min
Returns the minimal value for this field.
max
Returns the maximal value for this field.
avg
Returns the average value for this field.
sum
Returns the sum of all the values for this field.
value_count
Returns how many items have a value for this field.
cardinality
Returns the number of different values for this field.

How to create simple buckets

Average inrank by depth

curl "https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages/aggs" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
      "aggs": [{
        "fields": [{
            "name": "depth"
        }],
        "value": "inrank:avg"
      }]
    }
EOF
import requests

response = requests.post("https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages/aggs",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' },
    json={
      "aggs": [{
        "fields": [{
            "name": "depth"
        }],
        "value": "inrank:avg"
      }]
    }
).json()

Pages count by range of inlinks

curl "https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages/aggs" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
      "aggs": [{
        "fields": [{
          "name": "nb_inlinks_range",
          "ranges": [
            {
              "name": "under_10",
              "to": 10
            },
            {
              "name": "10_50",
              "from": 10,
              "to": 51
            },
            {
              "name": "more_50",
              "from": 51
            }
          ]
        }]
      }]
    }
EOF
import requests

response = requests.post("https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages/aggs",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' },
    json={
      "aggs": [{
        "fields": [{
          "name": "nb_inlinks_range",
          "ranges": [
            {
              "name": "under_10",
              "to": 10
            },
            {
              "name": "10_50",
              "from": 10,
              "to": 51
            },
            {
              "name": "more_50",
              "from": 51
            }
          ]
        }]
      }]
    }
).json()

When performing an aggregation, you can create buckets for your matching items using the fields parameter which takes an array of JSON objects.

The simplest way is to simply use the field’s name like so: {"name": "field_name"}.

It will returns the item’s count for all different values of field_name.

But not all fields can be used to create a bucket.

To know which fields are available as a bucket you can check the agg_dimension value returned by the Data Schema endpoint.

How to create ranges buckets

If the field_name returns too many different values it could be useful to group them as ranges.

To do so you can add a ranges key that takes an array of range. A range is a JSON object with the following expected keys:

Property Description
name
required
The name that will be returned in the JSON response for this range.
from
optional
The lowest or equal value for this range.
to
optional
The highest (not equal) value for this range.

Only numeric fields can be used with ranges buckets.

Export Queries

Export all pages from the structure.

curl "https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages?export=true" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
         "fields": ["url"],
         "oql": {
            "field":["depth","has_value", ""]
        }
    }
EOF > my_export.csv
import requests

response = requests.post("https://app.oncrawl.com/api/v2/data/crawl/<crawl_id>/pages?export=true",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' },
    json={
      "fields": ["url"],
       "oql": {
          "field":["depth","has_value", ""]
      }
    }
)

An export query allows you to save as a csv file the result of your search query.

It does not suffer from the 10K items limitation and allows you to export all of the matching results.

To export the result of your search query as csv, simply add ?export=true within the URL.

Property Description
file_type
optional
Can be csv or json (exported as JSONL), defaults to csv.

HTTP response

The response of the query will be a streamed csv file.

Projects API

The Projects API allows you manage all your projects and your crawls.

With this API you can, for example:

Projects

List projects

Get list of projects.

curl "https://app.oncrawl.com/api/v2/projects" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"
import requests

projects = requests.get("https://app.oncrawl.com/api/v2/projects"
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

HTTP Request

The projects can be paginated and filtered using the parameters described in the pagination section.

The fields available for the sort and filters are:

Property Description
id The project ID.
name The project’s name.
start_url The project’s start URL.
features The project’s enabled features.

HTTP Response

{
   "meta":{
      "filters":{},
      "limit":100,
      "offset":0,
      "sort":null,
      "total":1
   },
   "projects": [
      "<Project Object>",
      "<Project Object>"
   ]
}

A JSON object with a meta, described by the pagination section and a projects key with the list of project.

Get a project

Get a project.

curl "https://app.oncrawl.com/api/v2/projects/<project_id>" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"
import requests

project = requests.get("https://app.oncrawl.com/api/v2/projects/<project_id>"
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

HTTP Response

{
  "project": {
    "id": "592c1e1cf2c3a42743d14350",
    "name": "OnCrawl",
    "start_url": "http://www.oncrawl.com/",
    "user_id": "54dce0f264b65e1eef3ef61b",
    "is_verified_by": "google_analytics",
    "domain": "oncrawl.com",
    "features": [
        "at_internet",
        "google_search_console"
    ],
    "last_crawl_created_at": 1522330515000,
    "last_crawl_id": "5abceb9303d27a70f93151cb",
    "limits": {
        "max_custom_dashboard_count": null,
        "max_group_count": null,
        "max_segmentation_count": null,
        "max_speed": 100
    },
    "log_monitoring_data_ready": true,
    "log_monitoring_processing_enabled": true,
    "log_monitoring_ready": true,
    "crawl_config_ids": [
        "5aa80a1303d27a729113bb2d"
    ],
    "crawl_ids": [
        "5abceb9303d27a70f93151cb"
    ],
    "crawl_over_crawl_ids": [
        "5abcf43203d27a1ecf100b2c"
    ]
  },
  "crawl_configs": [
    "<CrawlConfig Object>"
  ],
  "crawls": [
    "<Crawl Object>"
  ]
}

The HTTP response is JSON object with three keys:

The project’s properties are:

Property Description
id The project ID.
name The project’s name.
start_url The project’s start URL.
user_id The ID of the project’s owner.
is_verified_by Holds how the project’s ownership was verified.
Can be google_analytics, google_search_console, admin or null.
domain The start URL’s domain.
features List of project’s enabled features.
last_crawl_id The ID of the latest created crawl.
last_crawl_created_at UTC timestamp of the latest created crawl, in milliseconds.
Defaults to null.
limits An object with customized limits for this project.
log_monitoring_data_ready true if the project’s log monitoring index is ready to be searched.
log_monitoring_processing_enabled true if the project’s files for the log monitoring are automatically processed.
log_monitoring_ready true if the project’s log monitoring configuration was submitted.
crawl_config_ids The list of Crawl over Crawl IDs attached to this project.
crawl_ids The list of Crawl IDs for this project.
crawl_config_ids The list of Crawl configurations IDs for this project.

Create a project

Create a project.

curl -X POST "https://app.oncrawl.com/api/v2/projects" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
        "project": {
            "name": "Project name",
            "start_url": "https://www.oncrawl.com"
        }
    }
EOF
import requests

requests.post("https://app.oncrawl.com/api/v2/projects", json={
      "project": {
        "name": "Project name",
        "start_url": "https://www.oncrawl.com"
      }
  },
  headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

HTTP request

Property Description
name
required
The project’s name, must be unique.
start_url
required
The project’s start url starting by http:// or https://.

HTTP Response

Examples of HTTP response

{
  "project": "<Project Object>"
}

An HTTP 200 status code is returned with the created project returned directly as the response within a project key.

Delete a project

Delete a project.

curl -X DELETE "https://app.oncrawl.com/api/v2/projects/<project_id>" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"
import requests

requests.delete("https://app.oncrawl.com/api/v2/projects/<project_id>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

HTTP request

No HTTP parameters.

HTTP Response

Returns an HTTP 204 status code if successful.

Scheduling

The scheduling of crawls allows you to start your crawl at a later date, run it periodically automatically or both.

Schedule your crawls to be run every week or every month and never think about it again.

List scheduled crawls

Get list of scheduled crawls.

curl "https://app.oncrawl.com/api/v2/projects/<project_id>/scheduled_crawls" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"
import requests

projects = requests.get("https://app.oncrawl.com/api/v2/projects/<project_id>/scheduled_crawls"
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

HTTP Request

The scheduled crawls can be paginated using the parameters described in the pagination section.

There are not sort or filters available.

HTTP Response

{
   "meta":{
      "filters": {},
      "limit":50,
      "offset":0,
      "sort": null,
      "total":1
   },
   "scheduled_crawls": [
      {
         "config_id":"59f3048cc87b4428618d7c44",
         "id":"5abdeb0f03d27a69ef169c52",
         "project_id":"592c1e1cf2c3a42743d14350",
         "recurrence":"week",
         "start_date":1522482300000
      }
   ]
}

A JSON object with a meta, described by the pagination section and a scheduled_crawls key with the list of scheduled crawls for this project.

Create a scheduled crawl

HTTP request

Create a scheduled crawl.

curl "https://app.oncrawl.com/api/v2/projects/<project_id>/scheduled_crawls" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
        "scheduled_crawl": {
            "config_id": "59f3048cc87b4428618d7c49",
            "recurrence": "week",
            "start_date": 1522482300000
        }
    }
EOF
import requests

requests.post("https://app.oncrawl.com/api/v2/projects", json={
    "scheduled_crawl": {
        "config_id": "59f3048cc87b4428618d7c49",
        "recurrence": "week",
        "start_date": 1522482300000
    }
  },
  headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

The request is expected to be a JSON object with a scheduled_crawl key and the following properties:

Property Description
config_id
required
The ID of the crawl configuration to schedule.
recurrence
optional
Can be day, week, 2weeks or month.
start_date
required
An UTC timestamp in milliseconds for when to start the first crawl.

HTTP Response

Examples of HTTP response

{
   "scheduled_crawl":{
      "config_id":"59f3048cc87b4428618d7c29",
      "id":"5abdeb0f03d27a69ef169c53",
      "project_id":"592c1e1cf2c3a42743d14350",
      "recurrence":"week",
      "start_date":1522482300000
   }
}

An HTTP 200 status code is returned with the created scheduled crawl returned directly as the response within a scheduled_crawl key.

Delete a scheduled crawl

Delete a scheduled crawl.

curl -X DELETE "https://app.oncrawl.com/api/v2/projects/<project_id>/scheduled_crawls/<scheduled_crawl_id>" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"
import requests

requests.delete("https://app.oncrawl.com/api/v2/projects/<project_id>/scheduled_crawls/<scheduled_crawl_id>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

HTTP request

No HTTP parameters.

HTTP Response

Returns an HTTP 204 status code if successful.

Crawls

Launch a crawl

Launch a crawl.

curl -X POST "https://app.oncrawl.com/api/v2/projects/<project_id>/launch-crawl?configId=<crawl_config_id>" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"
import requests

requests.post("https://app.oncrawl.com/api/v2/projects/<project_id>/launch-crawl?configId=<crawl_config_id>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

HTTP request

You have to pass a configId parameter in the query string with the ID of the crawl configuration you want to launch.

HTTP Response

Example of HTTP response

{
  "crawl": "<Crawl Object>"
}

Returns an HTTP 200 status code if successful with the created crawl returned directly as the response within a crawl key.

List crawls

Get list of crawls.

curl "https://app.oncrawl.com/api/v2/crawls" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"
import requests

crawls = requests.get("https://app.oncrawl.com/api/v2/crawls"
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

HTTP Request

The crawls can be paginated and filtered using the parameters described in the pagination section.

The fields available for the sort and filters are:

Property Description
id The crawl’s ID.
user_id The crawl’s owner ID.
project_id The crawl’s project ID.
status The crawl’s status.
Can be running, done, cancelled, terminating, pausing, paused, archiving unarchiving, archived.
created_at The crawl’s creation date as UTC timestamp in milliseconds.

HTTP Response

{
   "meta":{
      "filters":{},
      "limit":100,
      "offset":0,
      "sort":null,
      "total":1
   },
   "crawls": [
      "<Crawl Object>",
      "<Crawl Object>"
   ]
}

A JSON object with a meta, described by the pagination section and a crawls key with the list of crawl.

Get a crawl

Get a crawl.

curl "https://app.oncrawl.com/api/v2/crawls/<crawl_id>" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"
import requests

project = requests.get("https://app.oncrawl.com/api/v2/crawls/<crawl_id>"
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

HTTP Response

{
   "crawl": {
     "id":"5a57819903d27a7faa253683",
     "project_id":"592c1e1cf2c3a42743d14341",
     "user_id":"54dce0f264b65e1eef3ef61b",
     "link_status":"live",
     "status":"done",
     "created_at":1515684249000,
     "ended_at":1515685455000,
     "fetched_urls":10,  
     "end_reason":"max_url_reached",
     "features":[
        "at_internet"
     ],
      "crawl_config": "<CrawlConfig Object>",
      "cross_analysis_access_logs": null,
      "cross_analysis_at_internet": {
         "dates":{
            "from":"2017-11-26",
            "to":"2018-01-10"
         }
      },
      "cross_analysis_google_analytics": {
        "error": "No quota remaining."
      },
      "cross_analysis_majestic_back_links": {
         "stores":[
            {
               "name":"www.oncrawl.com",
               "success": true,
               "sync_date":"2017-10-27"
            }
         ],
         "tld":{
            "citation_flow":35,
            "name":"oncrawl.com",
            "trust_flow":29
         }
      }
   }
}

The HTTP response is JSON object with a single crawl key containing the crawl’s data:

The crawls’s properties are:

Property Description
id The crawl ID.
project_id The crawl’s project ID.
user_id The crawl’s owner ID.
link_status The links index status.
Can be live or archived.
status The crawl’s status.
Can be running, done, cancelled, terminating, pausing, paused, archiving unarchiving, archived.
created_at Date of the crawl creation using an UTC timestamp in milliseconds.
ended_at Date of the crawl termination using an UTC timestamp in milliseconds.
fetched_urls Number of URLs that were fetched for this crawl.
last_depth At what depth the crawl ended.
end_reason A code describing the reason of why the crawl’s stopped.
This value may not be present.
features List of features available by this crawl.
crawl_config The crawl configuration object used for this crawl.
cross_analysis_access_logs Dates used by the Logs monitoring cross analysis.
null if no cross analysis were done.
cross_analysis_at_internet Dates used by the AT Internet cross analysis.
null if no cross analysis were done.
cross_analysis_google_analytics Dates used by the Google Analytics cross analysis.
null if no cross analysis were done.
cross_analysis_majestic_back_links Majestic cross analysis metadata.
null if no cross analysis available.

End reasons

ok
All the URL of the structure have been crawled.
crawl_already_running
A crawl with the same configuration was already running.
quota_reached_before_start
When a scheduled crawl could not run because of missing quota.
quota_reached
When the URL quota was reached during the crawl.
max_url_reached
When the maximum number of URL defined in the crawl configuration was reached.
max_depth_reached
When the maximum depth defined in the crawl configuration was reached.
user_cancelled
When the crawl was manually cancelled.
user_requested
When the crawl was manually terminated and a partial crawl report was produced.
no_active_subscription
When no active subscription were available.
stopped_progressing
Technical end reason: at the end of the crawl, there are still unfetched urls, but for some reason the crawler is unable to fetch them. To prevent the crawler from iterating indefinitely, we abort the fetch phase when, after three attempts, he still has not managed to crawl those pages.
max_iteration_reached
Technical end reason: crawl evolved abnormally very slowly. It can happen, for example, when the website server is very busy with randomly dropped connections. We abort the fetch phase after 500 iterations when we detect this pathological server behavior.

Get a crawl progress

Get a crawl progress

curl "https://app.oncrawl.com/api/v2/crawls/<crawl_id>/progress" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"
import requests

project = requests.get("https://app.oncrawl.com/api/v2/crawls/<crawl_id>/progress"
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

You can call this endpoint for running crawls in order to follow its progression.

It allows you for example to monitoring if the crawler encounters an abnormal number of errors.

HTTP request

This endpoint takes no parameters.

HTTP Response

{
   "progress":{
      "analysis":{
         "progress":0.0
      },
      "fetch":{
         "depth_progress":[
            {
               "cumulative_fetched":1,
               "depth":1,
               "statuses":{
                  "fetched_2xx":1,
                  "fetched_3xx":0,
                  "fetched_4xx":0,
                  "fetched_5xx":0,
                  "unfetched_exception":0,
                  "unfetched_robots_denied":0
               },
               "total_known_urls":1
            },
            {
               "cumulative_fetched":36,
               "depth":2,
               "statuses":{
                  "fetched_2xx":29,
                  "fetched_3xx":6,
                  "fetched_4xx":0,
                  "fetched_5xx":0,
                  "unfetched_exception":0,
                  "unfetched_robots_denied":0
               },
               "total_known_urls":36
            }
         ],
         "max_depth":5,
         "samples":{
            "fetched_2xx":[
               {
                  "fetch_date":1522400745000,
                  "fetch_duration":214,
                  "status_code":200,
                  "url":"https://www.oncrawl.com/"
               }
            ]
         },
         "total_fetched":127,
         "total_known_urls":127
      }
   }
}

The HTTP response is JSON object with a single progress key containing the crawl’s progression.

The properties are:

Property Description
fetch.total_known_urls The total number of discovered URLs during the crawl.
fetch.total_fetched The total number of fetched URLs during the crawl.
fetch.max_depth The current crawler’s depth.
fetch.samples A list of URL’s samples per status.
It varies during the crawl and may not have a sample for a status.
fetch.depth_progress A detailed status per depth.
analysis.progress A decimal between 0.0 and 1.0 that give the progression of the analysis.

Fetch statuses

fetched_2xx
Status code between 200 and 299
fetched_3xx
Status code between 300 and 399
fetched_4xx
Status code between 400 and 499
fetched_5xx
Status code between 500 and 599
unfetched_exception
Unable to fetched an URL (ex: a server timeout.)

Update crawl state

HTTP request

Pause a running crawl

curl "https://app.oncrawl.com/api/v2/crawls/<crawl_id>/pilot" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
        "command": "pause"
    }
EOF
import requests

requests.post("https://app.oncrawl.com/api/v2/crawls/<crawl_id>/pilot", json={
      "command": "pause"
  },
  headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

You have to pass a JSON object with a command key and the desired state.

The crawl’s commands are:

Command Description
cancel Cancel the crawl. It won’t produce a report.
Crawl must be running or paused.
resume Resume a paused crawl.
Crawl must be paused.
pause Pause a crawl.
Crawl must be running.
terminate Terminate a crawl early. It will produce a report.
Crawl must be running or paused.
unarchive Un-archive all crawl’s data.
Crawl must be archived or links_status must be archived.
unarchive-fast Un-archive crawl’s data except links.
Crawl must be archived.

HTTP Response

Example of HTTP response

{
  "crawl": "<Crawl Object>"
}

Returns an HTTP 200 status code if successful with the updated crawl returned directly as the response within a crawl key.

Delete a crawl

Delete a crawl.

curl -X DELETE "https://app.oncrawl.com/api/v2/crawls/<crawl_id>" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"
import requests

requests.delete("https://app.oncrawl.com/api/v2/crawls/<crawl_id>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

HTTP request

No HTTP parameters.

HTTP Response

Returns an HTTP 204 status code if successful.

Crawls Configurations

List configurations

Get list of crawl configurations.

curl "https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"
import requests

projects = requests.get("https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs"
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

HTTP Request

The endpoint does not take any parameter.

HTTP Response

{
   "crawl_configs": [
      "<CrawlConfig Object>",
      "<CrawlConfig Object>"
   ]
}

A JSON object with a crawl_configs key with the list of crawl configuration.

Get a configuration

Get a configuration.

curl "https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs/<crawl_config_id>" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"
import requests

project = requests.get("https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs/<crawl_config_id>"
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
).json()

HTTP Response

{
  "crawl_config": {
       "agent_kind":"web",
       "ajax_crawling":false,
       "allow_query_params":true,
       "alternate_start_urls": [],
       "at_internet_params": {},
       "crawl_subdomains":false,
       "custom_fields":[],
       "dns": [],
       "extra_headers": {},
       "filter_query_params":false,
       "google_analytics_params":{},
       "google_search_console_params":{},
       "http_auth":{ },
       "id":"592c1f53973cb53b75287a79",
       "js_rendering":false,
       "majestic_params":{},
       "max_depth":15,
       "max_speed":10,
       "max_url":2000000,
       "name":"default",
       "query_params_list":"",
       "resource_checker":false,
       "reuse_cookies":false,
       "robots_txt":[],
       "scheduling_period":null,
       "scheduling_start_date":null,
       "scheduling_timezone":"Europe/Paris",
       "sitemaps":[],
       "start_url":"http://www.oncrawl.com/",
       "strict_sitemaps":true,
       "trigger_coc":false,
       "use_cookies":true,
       "use_proxy":false,
       "user_agent":"OnCrawl",
       "user_input_files":[],
       "watched_resources":[],
       "webhooks":[],
       "whitelist_params_mode":true
    }
}

The HTTP response is JSON object with the crawl configuration inside a crawl_config key.

The crawl base properties are:

Property Description
agent_kind The type of user agent.
Values are web or mobile.
ajax_crawling true if the website should be crawled as a pre-rendered JavaScript website, false otherwise.
allow_query_params true if the crawler should follow URL with query parameters, false otherwise.
alternate_start_urls List of alternate start URLs. All those URLs will start with a depth of 1.
They must all belongs to the same domain.
at_internet_params Configuration for AT Internet cross analysis.
The AT Internet cross analysis feature is required.
crawl_subdomains true if the crawler should follow links of all the subdomains.
Example: http://blog.domain.com for http://www.domain.com.
custom_fields Configuration for custom fields scraping.
The Data Scraping feature is required.
dns Override the crawler’s default DNS.
extra_headers Defines additional headers for the HTTP requests done by the crawler.
filter_query_params true if the query string of URLs should be stripped.
google_analytics_params Configuration for the Google Analytics cross analysis.
The Google Analytics cross analysis feature is required.
google_search_console_params Configuration for the Google Search Console cross analysis.
The Google Search Console cross analysis feature is required.
http_auth Configuration for the HTTP authentication of the crawler.
id The ID of this crawl configuration.
js_rendering true if the crawler should render the crawled pages using JavaScript.
The Crawl JS feature is required.
majestic_params Configuration for the Majestic Back-links cross analysis.
The Majestic Back-Links feature is required.
max_depth The maximum depth after which the crawler will stop follow links.
max_speed The maximum speed at each the crawler should go in number of URLs per second. Valid values are 0.1, 0.2, 0.5, 1, 2, 5 then every multiple of 5 until your maximum allowed crawl speed.
To crawl above 1 URL/s you need to verify the ownership of the project.
max_url The maximum number of fetched URLs after which the crawler will stop.
name The name of the configuration.
Only used as a label to easily identify it.
query_params_list If filter_query_params is true, this is a list of comma separated name of query parameter to filter. The parameter whitelist_params_mode will define how to filter them.
resource_checker true if the crawler should watch for requested resources during the crawl, false otherwise. This feature requires js_rendering:true.
reuse_cookies
deprecated
Not used anymore.
robots_txt List of configured virtual robots.txt.
The project’s ownership must be verified to use this option.
scheduling_period
deprecated
Not used anymore.
scheduling_start_date
deprecated
Not used anymore.
scheduling_timezone
deprecated
Not used anymore.
sitemaps List of sitemaps URLs.
start_url The start URL of the crawl.
This URL should not be a redirection to another URL.
strict_sitemaps true if the crawler should follow strictly the sitemaps protocol, false otherwise.
trigger_coc true if the crawler should automatically generate a Crawl over Crawl at the end.
The Crawl over Crawl feature is required.
use_cookies true if the crawler should keep the cookies returned by the server between requests, false otherwise.
use_proxy true if the crawler should use the OnCrawl proxy which allows it to keep a static range of IP addresses during its crawl.
user_agent Name of the Crawler, this name will appears in the user agent sent by the crawler.
user_input_files List of ingested data files IDs to use in this crawl.
The Data Ingestion feature is required.
watched_resources List of patterns to watch if resource_checker is set to true.
webhooks List of webhooks to call during the crawl.
whitelist_params_mode true if the query_params_list should be used as a whitelist, false if it should be used as a blacklist.

Create a configuration

Create a crawl configuration.

curl "https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
        "crawl_config": {
            "name": "New crawl configuration",
            "start_url": "https://www.oncrawl.com",
            "user_agent": "OnCrawl",
            "max_speed": 1
        }
    }
EOF
import requests

requests.post("https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs", json={
        "crawl_config": {
            "name": "New crawl configuration",
            "start_url": "https://www.oncrawl.com",
            "user_agent": "OnCrawl",
            "max_speed": 1
        }
  },
  headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

HTTP request

The expected HTTP request is exactly the same format as the response when you retrieve a crawl configuration.

The id is automatically generated by the API for any new crawl configuration and must not be part of the payload.

The only required fields are name, start_url, user_agent and max_speed.

HTTP Response

Examples of HTTP response

{
  "crawl_config": "<CrawlConfig Object>"
}

An HTTP 200 status code is returned with the created crawl configuration returned directly as the response within a crawl_config key.

AT Internet

{
  "at_internet_params": {
    "api_key": "YOUR_API_KEY",
    "site_id": "YOUR_SITE_ID"
  }
}

A subscription with the AT Internet feature is required to use this configuration.

To can request an API Key in API Accounts within the settings area of your AT Internet homepage.

This API Key is necessary to allow OnCrawl to access your AT Internet data.

The site_id specify from which site we should collect the data.

The HTTP requests that you need to whitelist are:

Note: You must replace the {site_id} of both URLs with the actual site ID.

Without this, OnCrawl won’t be able to fetch the data.

Google Analytics

{
  "google_analytics_params": {
    "email": "local@domain.com",
    "account_id": "12345678",
    "website_id": "UA-12345678-9",
    "profile_id": "12345678"
  }
}

A subscription with the Google Analytics feature is required to use this configuration.

You have the provides the following properties:

Property Description
email Email of your Google account.
account_id ID of your Google Analytics account.
website_id ID of your website in Google Analytics.
profile_id ID of the website’s profile to use for cross analysis.

To use a Google Account you must first give access to your analytics data to OnCrawl using OAuth2.

For now you must use the onCrawl web client to add your Google account.

Google Search Console

{
  "google_search_console_params": {
    "email": "local@domain.com",
    "websites": [
      "https://www.oncrawl.com"
    ],
    "branded_keywords": [
      "oncrawl",
      "on crawl",
      "oncrowl"
    ]
  }
}

A subscription with the Google Search Console feature is required to use this configuration.

You have the provides the following properties:

Property Description
email Email of your Google account.
websites List of the websites URLs from your Google Search Console to use.
branded_keywords List of keywords that the crawler should consider as part of a brand.

To use a Google Account you must first give access to your analytics data to OnCrawl using OAuth2.

For now you must use the onCrawl web client to add your Google account.

Majestic

{
  "majestic_params": {
    "access_token": "ABCDEF1234"
  }
}

A subscription with the Majestic feature is required to use this configuration.

You have the provides the following properties:

Property Description
access_token An access token that the crawler can use to access your data.

You can create an access token authorizing OnCrawl to access your Majestic data here.

Custom fields

Documentation not available yet.

Webhooks

Documentation not available yet.

DNS

{
  "dns": [{
    "host": "www.oncrawl.com",
    "ips": [ "82.34.10.20", "82.34.10.21" ]
  }, {
    "host": "fr.oncrawl.com",
    "ips": [ "82.34.10.20" ]
  }]
}

The dns configuration allows you resolve one or several domains to another IP address than they normally would.

This can be useful to crawl a website in pre-production as if it was already deployed on the real domain.

Extra HTTP headers

{
  "extra_headers": {
    "Cookie": "lang=fr;",
    "X-My-Token": "1234"
  }
}

The extra_headers configuration allows you to inject custom HTTP headers to each of the crawl’s HTTP requests.

HTTP Authentication

{
  "http_auth": {
    "username": "user",
    "password": "1234",
    "scheme": "Digest",
    "realm": null
  }
}

The http_auth configuration allows you to crawl sites behind an authentication.

It can be useful to crawl a website in pre-production that is password protected before its release.

Property Description
username
required
Username to authenticate with.
password
required
Password to authenticate with.
scheme
required
How to authenticate. Available values are Basic Digest and NTLM.
realm
optional
The authentication realm.
To NTLM this correspond to the domain.

Robots.txt

{
  "robots_txt": [{
    "host": "www.oncrawl.com",
    "content": "CONTENT OF YOUR ROBOTS.TXT"
  }]
}

The robots_txt configuration allows you to override, for a given host, its robots.txt.

It can be use to:

Because you can make the crawler ignore the robots.txt of a website, it is necessary to verify the ownership of this project to use this feature.

For now you can only verify the ownership using the OnCrawl application.

Update a configuration

Update a crawl configuration.

curl "https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs" \
    -H "Authorization: Bearer {ACCESS_TOKEN}" \
    -H "Content-Type: application/json" \
    -X PUT \
    -d @- <<EOF
    {
        "crawl_config": {
            "name": "New crawl configuration",
            "start_url": "https://www.oncrawl.com",
            "user_agent": "OnCrawl",
            "max_speed": 1
        }
    }
EOF
import requests

requests.put("https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs", json={
        "crawl_config": {
            "name": "New crawl configuration",
            "start_url": "https://www.oncrawl.com",
            "user_agent": "OnCrawl",
            "max_speed": 1
        }
  },
  headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

HTTP request

It takes the same parameters as a crawl configuration creation except the name that cannot be modified and must be the same.

HTTP response

It returns the same response as a crawl configuration creation.

Delete a configuration

Delete a configuration.

curl -X DELETE "https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs/<crawl_config_id>" \
    -H "Authorization: Bearer {ACCESS_TOKEN}"
import requests

requests.delete("https://app.oncrawl.com/api/v2/projects/<project_id>/crawl_configs/<crawl_config_id>",
    headers={ 'Authorization': 'Bearer {ACCESS_TOKEN}' }
)

HTTP request

No HTTP parameters.

HTTP Response

Returns an HTTP 204 status code if successful.

Fields

This is the list of OnCrawl fields that are exported while using our Data Studio connector.

They are listed below by category. For each field you’ll find the following information: name, definition, type and arity.

The OnCrawl field type can be one of the following:

Type Definition
integer integer number
natural non-negative integer (>= 0)
float floating-point number
percentage floating-point number between 0 and 1
string a sequence of characters, text
enum a string from a defined list of values
bool boolean (true or false)
datetime an timestamp, in the following format: yyyy/MM/dd HH:mm:ss z
date a date in the following format: yyyy-MM-dd
object a raw JSON object
hash a hashed string (the output of a hash function)

Note: These types are the ones exposed by the OnCrawl API. The underlying storage for the Data Studio connector may use slightly different type names or date/time formats.

There are two possible values for arity:

Arity Definition
one the field holds a single value
many the field holds a list of values

Content

Field name Definition Type Arity
language Language code, in ISO 639 two-letter format. Either parsed from the HTML or detected from the text. string one
text_to_code Number of text characters divided by the total number of characters in the HTML. percentage one
word_count The number of words on the page. natural one

Duplicate content

Field name Definition Type Arity
clusters OnCrawl IDs of the groups of URLs with similar content that this URL belongs to. hash many
nearduplicate_content Whether this page's content is very similar to another page, according to our SimHash-based algorithm. bool one
nearduplicate_content_similarity Highest ratio of content similarity, compared to other pages in the cluster. percentage one
duplicate_description_status Status of duplication issues for the group of pages with the same meta description as this page: canonical_ok (duplication is correctly handled using canonical declarations), hreflang_ok (duplication is correctly handled using hreflang declarations), canonical_not_matching (canonical declarations within the group do not match), hreflang_error (the implementation of hreflang declarations within the group has errors), canonical_not_set (no hreflang or canonical declarations) enum one
duplicate_h1_status Status of duplication issues for the group of pages with the same H1 as this page: canonical_ok (duplication is correctly handled using canonical declarations), hreflang_ok (duplication is correctly handled using hreflang declarations), canonical_not_matching (canonical declarations within the group do not match), hreflang_error (the implementation of hreflang declarations within the group has errors), canonical_not_set (no hreflang or canonical declarations) enum one
duplicate_title_status Status of duplication issues for the group of pages with the same title tag as this page: canonical_ok (duplication is correctly handled using canonical declarations), hreflang_ok (duplication is correctly handled using hreflang declarations), canonical_not_matching (canonical declarations within the group do not match), hreflang_error (the implementation of hreflang declarations within the group has errors), canonical_not_set (no hreflang or canonical declarations) enum one
has_duplicate_description_issue Whether there are duplications of this page's meta description on other pages that are not handled using canonical or hreflang declarations. 'true' if problems remain and 'false' if correctly handled or if there's no duplication. bool one
has_duplicate_h1_issue Whether there are duplications of this page's H1 on other pages that are not handled using canonical or hreflang declarations. 'true' if problems remain and 'false' if correctly handled or if there's no duplication. bool one
has_duplicate_title_issue Whether there are duplications of this page's title tag on other pages that are not handled using canonical or hreflang declarations. 'true' if problems remain and 'false' if correctly handled or if there's no duplication. bool one
has_nearduplicate_issue Whether the similarity of this page with other pages is handled using canonical or hreflang descriptions. 'true' if problems remain and 'false' if correctly handled or if there's no duplication. bool one
nearduplicate_status Status of duplication issues for the group of pages with similar content to this one: canonical_ok (duplication is correctly handled using canonical declarations), hreflang_ok (duplication is correctly handled using hreflang declarations), canonical_not_matching (canonical declarations within the group do not match), hreflang_error (the implementation of hreflang declarations within the group has errors), canonical_not_set (no hreflang or canonical declarations) enum one

Hreflang errors

Field name Definition Type Arity
hreflang_cluster_id OnCrawl ID of the group of pages that reference one another through hreflang declarations. hash one

Indexability

Field name Definition Type Arity
meta_robots List of values in the meta robots tag. string many
meta_robots_follow Whether the links on the page should be followed (true) or not (false) according to the meta robots. bool one
meta_robots_index Whether the the page should be indexed (true) or not (false) according to the meta robots. bool one
robots_txt_denied Whether the crawler was denied by the robots.txt file while visiting this page. bool one

Linking & popularity

Field name Definition Type Arity
depth Page depth in number of clicks from the crawl's Start URL. natural one
external_follow_outlinks Number of followable outlinks to pages on other domains. natural one
external_nofollow_outlinks Number of nofollow outlinks to pages on other domains. natural one
external_outlinks Number of outlinks to other domains. natural one
external_outlinks_range Range of the number of outlinks to pages on other domains: 0-50, 50-100, 100-150, 150-200, >200 enum one
follow_inlinks Number of followable links pointing to a URL from other pages on the same site. natural one
inrank Whole number from 0-10 indicating the URL's relative PageRank within the site. Higher numbers indicate better popularity. natural one
inrank_decimal Decimal number from 0-10 indicating the URL's relative PageRank within the site. Higher numbers indicate better popularity. float one
internal_follow_outlinks Number of followable links from this page to other pages on the same site. natural one
internal_nofollow_outlinks Number of nofollow links from this page to other pages on the same site. natural one
internal_outlinks Number of outlinks to pages on the same site. natural one
internal_outlinks_range Range of the number of outlinks to pages on the same site: 0-50, 50-100, 100-150, 150-200, >200 enum one
nb_inlinks Number of links pointing to this page from other pages on this site. natural one
nb_inlinks_range Range of values that the number of links to this page from other pages on the site falls into. Ranges are: 0-50, 50-100, 100-150, 150-200, >200 enum one
nb_outlinks_range Range of values that the number of links from this page falls into. Ranges are: 0-50, 50-100, 100-150, 150-200, >200 enum one
nofollow_inlinks Number of links pointing to this page with a rel="nofollow" tag. natural one

OnCrawl bot

Field name Definition Type Arity
fetch_date Date on which the OnCrawl bot obtained the URL's source code expressed as yyyy/MM/dd HH:mm:ss z datetime one
fetch_status Whether the OnCrawl bot successfully obtained the URL's source code. Indicates "success" when true. string one
fetched Whether the OnCrawl bot obtained the URL's source code (true) or not (false). bool one
parsed_html Whether the OnCrawl bot was able to obtain an HTTP status and textual content for this page. bool one
sources List of sources for this page. Sources may be: OnCrawl bot, at_internet, google_analytics, google_search_console, ingest_data, logs_cross_analysis, majestic, adobe_analytics, sitemaps string many

Payload

Field name Definition Type Arity
load_time Time (in milliseconds) it took to fetch the entire HTML of the page, excluding external resources. Also known as "time to last byte" (TTLB). natural one
weight The size of the page in KB, excluding resources. natural one

Redirect chains & loops

Field name Definition Type Arity
final_redirect_location Final URL reached after following a chain of one or more 3xx redirects. string one
final_redirect_status HTTP status code of the final URL reached after following a chain of one or more 3xx redirects. natural one
is_redirect_loop Whether the chain of redirects loops back to a URL in the chain. bool one
is_too_many_redirects Whether the chain contains more than 16 redirects. bool one
redirect_cluster_id The OnCrawl ID of this page's redirect cluster. The redirect cluster is the group of pages found in all branches of a redirect chain or loop. hash one
redirect_count Number of redirects needed from this page to reach the final target in the redirect chain. natural one

Rel alternate

Field name Definition Type Arity
canonical_evaluation Canonical status of the URL: matching (declares itself as canonical), not_matching (declares a different page as canonical), not_set (has no canonical declaration) enum one
rel_canonical URL declared in the rel canonical tag. string one
rel_next URL declared in the rel next tag. string one
rel_prev URL declared in the rel prev tag. string one

Scraping

Field name Definition Type Arity
custom_qsdd Custom field created through user-defined scraping rules. string one

SEO tags

Field name Definition Type Arity
description_evaluation Duplication status of the page's meta description: unique, duplicated (another URL has the same meta description), not_set enum one
description_length Length of the URL's meta description in number of characters. natural one
description_length_range Evaluation of the URL's meta description length: perfect (135-159), good (110-134 or 160-169), too short (<110), too long (>=170) enum one
h1 First H1 on the page. string one
h1_evaluation Duplication status of the page's H1 text: unique, duplicated (another URL has the same H1), not_set enum one
meta_description Meta description for this page. string one
num_h1 Number of H1 tags on this page. natural one
num_h2 Number of H2 tags on this page. natural one
num_h3 Number of H3 tags on this page. natural one
num_h4 Number of H4 tags on this page. natural one
num_h5 Number of H5 tags on this page. natural one
num_h6 Number of H6 tags on this page. natural one
num_img Number of images on this page. natural one
num_img_alt Number of image 'alt' attributes on this page. natural one
num_img_range Whether the page contains no images, one image, or more than one. enum one
num_missing_alt Number of missing 'alt' attributes for images on this page. natural one
semantic_item_count Number of semantic tags on the page. natural one
semantic_types List of semantic tags found on the page. string many
title Page title found in the <title> tag. string one
title_evaluation Duplication status of the title tag: unique, duplicated (another page has the same title), not_set. enum one
title_length Length of the title tag in characters. natural one

Sitemaps

Field name Definition Type Arity
sitemaps_file_origin List of URLs of the Sitemaps files where this page was found. string many
sitemaps_num_alternate Number of alternates to this page that were found in the sitemaps. natural one
sitemaps_num_images Number of images for this page that were found in the sitemaps. natural one
sitemaps_num_news Number of news publications for this page that were found in the sitemaps. natural one
sitemaps_num_videos Number of videos for this page that were found in the sitemaps. natural one

Status code

Field name Definition Type Arity
redirect_location URL this page redirects to. string one
status_code HTTP status code returned by the server when crawling the page. natural one
status_code_range HTTP status code class. Classes are: ok, redirect, client_error, server_error. enum one

URL

Field name Definition Type Arity
querystring_key List of keys found in the querystring of this page's URL. string one
querystring_keyvalue List of key-value pairs found in the querystring of this page's URL. string one
url Full URL including the protocol (https://). string one
url_ext URL's file extension. string one
url_first_path First directory following the URL's domain, or / if there is no directory string one
url_has_params Whether the URL has query parameters. bool one
url_host Hostname or subdomain found in the URL. string one
urlpath URL path. string one