Types of URLs in a web architecture

In general, in a web project we can have all these types of URLs.

Total URLs
Trackable URLs
Crawled URLs:
Indexable URLs: URLs that a search engine can index due to their characteristics at the code and strategy level.
Indexed URLs: a specific search engine incorporates URLs into its indexes.
Positioned URLs:
Linked URLs: those that receive links internally from other URLs on our website or externally on other websites.
Important URLs: those that are relevant at a strategic level, traffic, sales, etc…

To start working with a website’s URLs, it is necessary to know how to obtain and classify them correctly. A good start is usually through crawlers and other existing programs, some of which are free to use. Let’s go in parts.

Table of Contents

Total URLs

The totality of URLs that a project can have internally.

The first thing when working on SEO, it is necessary to know the total number of URLs in our project. This figure can be easily obtained:

Download all the logs via FTP and upload them to tools such as SEOlyzer, Screaming Frog log Analyzer, FandangoSEO.
Extracting all URLs with a crawler: Screaming Frog, Sitebulb, Xenu, FandangoSEO.
Extracting all the URLs that Analytics indicates to us as destination pages for any traffic
Extracting Search Console report with performance report pages
Extracting data from any other tool that tracks and saves said information (Yandex Metrika, Bing webmaster tools, etc)

Crawlable and non-crawlable URLs

Crawlable URLs are those that a specific tracker or crawler could reach.

Next, for example, with Screaming Frog, we can crawl to find out which URLs Google reaches through our website. By crossing both lists in Excel (using the option to eliminate duplicates), we will obtain the complete list of URLs on the website.

Generally, to list both types, it is always easier for me to start with the non-traceable ones, which we can sense in the directories and domain extensions blocked in the robots.txt file and in the file hidden in the root (usually) of our FTP called. htaccess (Linux servers) or in the web. Config on Windows servers. Additionally, crawlers usually mark them as “blocked” in their reports.

Crawled URLs

Within the realm of crawlable URLs, we can distinguish between URLs that have been crawled and those that still need to be crawled. A URL has been crawled when the Google bot passes through it in the last 90 days, which we can discover by filtering the logs by date.

Indexable URLs

In turn, we can distinguish between indexed and non-indexed. We can carry out this task with programs such as URL Profiler (with proxies) or Greenlane (a few URLs are fine). The normal thing is that non-crawlable URLs are not indexed (it is assumed that we do this on purpose as a strategy after generating the configurations for robots well in code and server). Therefore, those that are should be reviewed to determine the reason: error? Confusion? Does Google do what it wants? Do I love her like this for whatever reason? …

Indexed URLs

The indexed URLs are usually fewer than the crawled ones since we omit canonicalized URLs with x-robots or noindex meta tags and other errors, such as duplications, that search engines tend to ignore indexing.