According to Google's URL check, my image URL is blocked by robots.txt – I do not even have one!

I've just noticed that our image system domain has long ceased to be crawled by Google.
The reason for this is that apparently all URLs are blocked by robots.txt – but I do not even have one.

Disclaimer: Due to some configuration checks, I now have a generic Allow-Allow-Robots file in the root of the site. I had none before this hour.

We operate a system for resizing images in a subdomain of our website.
I get a very strange behavior because the search console claims to be blocked robots.txtalthough I have none at all.

All the URLs in this subdomain give me this result when I test it live:

URL that Google does not know

url allegedly blocked by robots

While trying to fix the problem, I created a robots.txt in the root directory:

valid robots

The robot file is even visible in the search results:

Robot indexed

The response headers also seem to be okay:

​HTTP/2 200 
date: Sun, 27 Oct 2019 02:22:49 GMT
content-type: image/jpeg
set-cookie: __cfduid=d348a8xxxx; expires=Mon, 26-Oct-20 02:22:49 GMT; path=/; domain=.legiaodosherois.com.br; HttpOnly; Secure
access-control-allow-origin: *
cache-control: public, max-age=31536000
via: 1.1 vegur
cf-cache-status: HIT
age: 1233
expires: Mon, 26 Oct 2020 02:22:49 GMT
alt-svc: h3-23=":443"; ma=86400
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
server: cloudflare
cf-ray: 52c134xxx-IAD

Here are some sample URLs for testing:

https://kanto.legiaodosherois.com.br/w760-h398-gnw-cfill-q80/wp-content/uploads/2019/10/legiao_zg1YXWVbJwFkxT_ZQR534L90lnm8d2IsjPUGruhqAe.png.jpeg
https://kanto.legiaodosherois.com.br/w760-h398-gnw-cfill-q80/wp-content/uploads/2019/10/legiao_FPutcVi19O8wWo70IZEAkrY3HJfK562panvxblm4SL.png.jpeg
https://kanto.legiaodosherois.com.br/w760-h398-gnw-cfill-q80/wp-content/uploads/2019/09/legiao_gTnwjab0Cz4tp5X8NOmLiWSGEMH29Bq7ZdhVPlUcFu.png.jpeg

What should I do?

Google says my URL is blocked by robots.txt – I do not even have one!

I've just noticed that our image system domain has long ceased to be crawled by Google.
The reason for this is that apparently all URLs are blocked by robots.txt – but I do not even have one.

Disclaimer: Due to some configuration checks, I now have a generic Allow-Allow-Robots file in the root of the site. I had none before this hour.

We operate a system for resizing images in a subdomain of our website.
I get a very strange behavior because the search console claims to be blocked robots.txtalthough I have none at all.

All the URLs in this subdomain give me this result when I test it live:

URL that Google does not know

url allegedly blocked by robots

While trying to fix the problem, I created a robots.txt in the root directory:

valid robots

The robot file is even visible in the search results:

Robot indexed

The response headers also seem to be okay:

​HTTP/2 200 
date: Sun, 27 Oct 2019 02:22:49 GMT
content-type: image/jpeg
set-cookie: __cfduid=d348a8xxxx; expires=Mon, 26-Oct-20 02:22:49 GMT; path=/; domain=.legiaodosherois.com.br; HttpOnly; Secure
access-control-allow-origin: *
cache-control: public, max-age=31536000
via: 1.1 vegur
cf-cache-status: HIT
age: 1233
expires: Mon, 26 Oct 2020 02:22:49 GMT
alt-svc: h3-23=":443"; ma=86400
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
server: cloudflare
cf-ray: 52c134xxx-IAD

Here are some sample URLs for testing:

https://kanto.legiaodosherois.com.br/w760-h398-gnw-cfill-q80/wp-content/uploads/2019/10/legiao_zg1YXWVbJwFkxT_ZQR534L90lnm8d2IsjPUGruhqAe.png.jpeg
https://kanto.legiaodosherois.com.br/w760-h398-gnw-cfill-q80/wp-content/uploads/2019/10/legiao_FPutcVi19O8wWo70IZEAkrY3HJfK562panvxblm4SL.png.jpeg
https://kanto.legiaodosherois.com.br/w760-h398-gnw-cfill-q80/wp-content/uploads/2019/09/legiao_gTnwjab0Cz4tp5X8NOmLiWSGEMH29Bq7ZdhVPlUcFu.png.jpeg

What should I do?

Why does Google index our robots.txt file and display it in the search results?

For some reason, Google indexes the robots.txt file for some of our sites and displays them in the search results. See screenshots below.

Our robots.txt file is not linked from any point on the site and contains only the following:

User-agent: *
Crawl-delay: 5

This only happens for some websites. Why is this happening and how do we stop it?

Enter image description here

Screenshot 1: Google search console

Enter image description here

Screenshot 2: Google search results

Why does Google index our robots.txt file and display it in the search results?

For some reason, Google indexes the robots.txt file for some of our sites and displays them in the search results. See screenshots below.

Our robots.txt file is not linked from any point on the site and contains only the following:

User Agent: *
Crawl Delay: 5

This only happens for some websites. Why is this happening and how do we stop it?

[IMG]

Screenshot 1: Google search console
SEMrush

[IMG] "data-url =" https://i.stack.imgur.com/V2UaU.png

Screenshot 2: Google search results

cms – Refers to a variable on the domain of a MediaWiki site in robots.txt

I have a website (created with MediaWiki 1.33.0 CMS) that contains a robots.txt File.
In this file is a line that contains the literal domain of this site:

Sitemap: https://example.com/sitemap/sitemap-index-example.com.xml

I usually prefer to replace literal domain references with a Variable Value Call this is somehow changed (depending on the specific case) in the execution to the value that is the domain itself.

An example of a VVC would be a Bash variable substitution.


Many CMS have a global directive file that usually contains the base address of the website:
In MediaWiki 1.33.0 this file is LocalSettings.php which contains the base address in line 32:

$wgServer = "https://example.com";

How could I call this value with a variable value robots.txt?
This helps me to avoid confusion and malfunction when the domain of the website is changed. There I would not have to change the value manually.

cms – Use a variable to refer to the domain of a website in robots.txt

I have a website (created with MediaWiki 1.33.0 CMS) that contains a robots.txt File.
In this file is a line that contains the literal domain of this site:

Sitemap: https://example.com/sitemap/sitemap-index-example.com.xml

I usually prefer to replace literal domain references with a Variable Value Call this is somehow changed (depending on the specific case) in the execution to the value that is the domain itself.

An example of a VVC would be a Bash variable substitution.


Many CMS have a global directive file that usually contains the base address of the website:
In MediaWiki 1.33.0 this file is LocalSettings.php which contains the base address in line 32:

$wgServer = "https://example.com";

How could I call this value with a variable value robots.txt?
This helps me to avoid confusion and malfunction when the domain of the website is changed. There I would not have to change the value manually.

Search engines – Is the order of the lines "Disallow" and "Sitemap" in robots.txt important?

You can sort robots.txt this way:

User-agent: DESIRED_INPUT
Sitemap: https://example.com/sitemap-index.xml
Disallow: /

instead:

User-agent: DESIRED_INPUT
Disallow: /
Sitemap: https://example.com/sitemap-index.xml

I assume that both are okay, as it is likely that the file will generally be compiled by all crawlers in the correct order.
Is it a proven method to set Disallow: In front Sitemap: to avoid a highly improbable bug in the bad compilation of the crawler before it is ignored Disallow:?