NGINX – Improves the speed of requests to ads.txt and robots.txt

The structure of one of our sites has many subdomains, millions of them (of course for a good reason). This means that the Google crawler must access the /ads.txt and /robots.txt files for each subdomain. Our server displays hundreds of requests for these files every second.

In this case, what would be the most efficient way to process these requests?

I am currently using this:

    location = /robots.txt  {
        access_log off;
        log_not_found off;
        alias /home/sys/example.com/public/robots.txt;
    }
    location = /ads.txt  {
        access_log off;
        log_not_found off;
        alias /home/sys/example.com/public/ads.txt;
    }

My intent here is to point the robot and display TXT files to the same location each time I request it, though I don't know if that makes any difference. Also disable logging with "access_log off". seemed like a good idea.

I think it is possible to return the contents of the txt file directly from nginx – would that be faster (less demanding for the nginx service)?

2013 – Hide the robots.txt file for websites connected to SharePoint

I think based on that Sitemap settings for search engines We can show or hide the function deactivated or activated robots.txt File. However, if we want to restrict certain file types, we can add the entry to the file as follows:

User agent: *

Do not allow: / _layouts /

Do not allow: / _vti_bin /

Do not allow: / _catalogs /

If you want SharePoint 2010 or 2013 to crawl your website, add the following to your robots.txt file.

User agent: MS Search 6.0 robot

Do not allow:

Source:

The correct robots.txt settings so that SharePoint can crawl your website

Forum robots.txt file | Forum promotion

Hey FP,

We are all there when there are so many guests in your forum that you wonder what is going on. Of course, most of them are bots and some are malicious – mostly looking for email addresses that can be used to fire spam.

Came over this beast of a robots.txt file and thought I would share it. It is not made by me, all honor goes to Mitchellkrogza.

The robots.txt file can be found at https://github.com/mitchellkrogza/apache-ultimate-bad-bot-blocker/blob/master/robots.txt/robots.txt

Exclude bots from the directory with robots.txt

I recently switched the forum platforms from vBulletin to Xenforo. So I changed the folder name for vBulletin to mysite.com/forums-old/ and moved Xenforo to the original vBulletin folder, i.e. H. Mysite.com/forums/.

Now I want to exclude mysite.com/forums-old/ from being crawled by Google because it’s mostly duplicate content.

My current robots.txt is:

Code:

User-agent: *
Allow: /

Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content
Disallow: /search/*/feed
Disallow: /search/*/*

User-agent: Mediapartners-Google
Allow: /

User-agent: Adsbot-Google
Allow: /

User-agent: Googlebot-Image
Allow: /

User-agent: Googlebot-Mobile
Allow: /

What do I have to change to exclude Mr Googlebot from crawling mysite.com/forums-old/ and its subfolders?

robots.txt – Why don't we have a robot containment standard?

As far as I know, the exclusion standard for robots is a way to inform the web robot about which areas of the website may not be processed or scanned. So I'm hiking, why not use a similar but "inclusion" standard? The website has a robot file that specifies which parts of the website can be accessed. Everything that is not included is not available to the web robot by default. Why is everything available to the web robot by default?

According to Google's URL check, my image URL is blocked by robots.txt – I do not even have one!

I've just noticed that our image system domain has long ceased to be crawled by Google.
The reason for this is that apparently all URLs are blocked by robots.txt – but I do not even have one.

Disclaimer: Due to some configuration checks, I now have a generic Allow-Allow-Robots file in the root of the site. I had none before this hour.

We operate a system for resizing images in a subdomain of our website.
I get a very strange behavior because the search console claims to be blocked robots.txtalthough I have none at all.

All the URLs in this subdomain give me this result when I test it live:

URL that Google does not know

url allegedly blocked by robots

While trying to fix the problem, I created a robots.txt in the root directory:

valid robots

The robot file is even visible in the search results:

Robot indexed

The response headers also seem to be okay:

​HTTP/2 200 
date: Sun, 27 Oct 2019 02:22:49 GMT
content-type: image/jpeg
set-cookie: __cfduid=d348a8xxxx; expires=Mon, 26-Oct-20 02:22:49 GMT; path=/; domain=.legiaodosherois.com.br; HttpOnly; Secure
access-control-allow-origin: *
cache-control: public, max-age=31536000
via: 1.1 vegur
cf-cache-status: HIT
age: 1233
expires: Mon, 26 Oct 2020 02:22:49 GMT
alt-svc: h3-23=":443"; ma=86400
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
server: cloudflare
cf-ray: 52c134xxx-IAD

Here are some sample URLs for testing:

https://kanto.legiaodosherois.com.br/w760-h398-gnw-cfill-q80/wp-content/uploads/2019/10/legiao_zg1YXWVbJwFkxT_ZQR534L90lnm8d2IsjPUGruhqAe.png.jpeg
https://kanto.legiaodosherois.com.br/w760-h398-gnw-cfill-q80/wp-content/uploads/2019/10/legiao_FPutcVi19O8wWo70IZEAkrY3HJfK562panvxblm4SL.png.jpeg
https://kanto.legiaodosherois.com.br/w760-h398-gnw-cfill-q80/wp-content/uploads/2019/09/legiao_gTnwjab0Cz4tp5X8NOmLiWSGEMH29Bq7ZdhVPlUcFu.png.jpeg

What should I do?

Google says my URL is blocked by robots.txt – I do not even have one!

I've just noticed that our image system domain has long ceased to be crawled by Google.
The reason for this is that apparently all URLs are blocked by robots.txt – but I do not even have one.

Disclaimer: Due to some configuration checks, I now have a generic Allow-Allow-Robots file in the root of the site. I had none before this hour.

We operate a system for resizing images in a subdomain of our website.
I get a very strange behavior because the search console claims to be blocked robots.txtalthough I have none at all.

All the URLs in this subdomain give me this result when I test it live:

URL that Google does not know

url allegedly blocked by robots

While trying to fix the problem, I created a robots.txt in the root directory:

valid robots

The robot file is even visible in the search results:

Robot indexed

The response headers also seem to be okay:

​HTTP/2 200 
date: Sun, 27 Oct 2019 02:22:49 GMT
content-type: image/jpeg
set-cookie: __cfduid=d348a8xxxx; expires=Mon, 26-Oct-20 02:22:49 GMT; path=/; domain=.legiaodosherois.com.br; HttpOnly; Secure
access-control-allow-origin: *
cache-control: public, max-age=31536000
via: 1.1 vegur
cf-cache-status: HIT
age: 1233
expires: Mon, 26 Oct 2020 02:22:49 GMT
alt-svc: h3-23=":443"; ma=86400
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
server: cloudflare
cf-ray: 52c134xxx-IAD

Here are some sample URLs for testing:

https://kanto.legiaodosherois.com.br/w760-h398-gnw-cfill-q80/wp-content/uploads/2019/10/legiao_zg1YXWVbJwFkxT_ZQR534L90lnm8d2IsjPUGruhqAe.png.jpeg
https://kanto.legiaodosherois.com.br/w760-h398-gnw-cfill-q80/wp-content/uploads/2019/10/legiao_FPutcVi19O8wWo70IZEAkrY3HJfK562panvxblm4SL.png.jpeg
https://kanto.legiaodosherois.com.br/w760-h398-gnw-cfill-q80/wp-content/uploads/2019/09/legiao_gTnwjab0Cz4tp5X8NOmLiWSGEMH29Bq7ZdhVPlUcFu.png.jpeg

What should I do?