sitemap and or robots.txt question

**Turan Mirza** · 01 August 2022, 11:31 AM

I have a custom 404 page as I felt that was the right thing to do. Then a few days ago I realised it should maybe be disallow'ed in the robots.txt file so they would not crawl the page.

Now, I just got a google email saying:

--
Page indexing issues detected on feel-good.today
To the owner of feel-good.today:
Search Console has identified that your site is affected by 1 Page indexing issue(s):
Top critical issues
Critical issues prevent your page or feature from appearing in Search results. The following critical issues were found on your site:
Submitted URL seems to be a Soft 404
We recommend that you fix these issues when possible to enable the best experience and coverage in Google Search.
--

the 404 page has been there for some time, so it seems me asking for it to be disallowed by the robots has in fact forced the page to be crawled. The simple answer is to take that line out of the robots.txt file but just wondering why this is happening.

Can anyone help clarify?

I've also asked for the thanks.html page to be disallowed as it is just the page that someone will see when they have submitted a webform i.e. "Thanks for submitting the form, I'll be in touch soon" msg. THis page has not flagged an error (YET!).

Their help says "If the rendered page is blank, nearly blank, or the content has an error message, it could be that your page references many resources that can't be loaded (images, scripts, and other non-textual elements), which can be interpreted as a soft 404. Reasons that resources can't be loaded include blocked resources (blocked by robots.txt), having too many resources on a page, various server errors, or slow loading or very large resources."

I read that as "We found the page, but can't load it, maybe because it has been disallowed but the robots file"
Which is exactly right it has been blocked by the robots.txt file! This page is only pointed to by my control panel in my web hosting, i.e. no page links to it SO my understanding was no search engine would crawl to it!! If, google looks at all HTM / HTML files in a directory it might actually find it, even if it's not linked form any other page - hence I disallowed any bot from looking at that file.

Anyone?

**Acorn** · 01 August 2022, 11:49 AM

Turan, is your 404 created in Xara as part of the site and referenced through your server setting or it it wholly server-based?
Is the robots.txt pathing correct in either case?
You can check with Google's robots.txt Tester, https://support.google.com/webmasters/answer/6062598.

Google has a process: https://support.google.com/webmaster...make_permanent.
I rarely touch robots.txt and have added <meta name="googlebot" content="noindex"> instead to the page head.

If you, for some reason, have direct links to the 404, you can add the Nofollow option to the link in later XDAs.

Acorn

**Turan Mirza** · 01 August 2022, 12:21 PM

Ah! That is it! It's in the auto generated sitemap. So I am actually pointing to it, not form my site but from my sitemap!!

Keep me right here, can I add a noindex for all bots? i.e. is <meta content="noindex"> ok in the header of just that page?

Also, I thought they were retiring 'noindex' in favour of 'disallow' but that just might be in the context of robots/txt - it's getting harder to just design a page these days.

Many thanks for your fast response.

**Acorn** · 01 August 2022, 02:45 PM

I would try and use <META NAME="robots" CONTENT="noindex"> or even <META NAME="robots" CONTENT="noindex, nofollow">.

googlebot (Google labels it Googlebot) traps all of Google user agent crawlers.
I can only assume 'robots' will do every other one as well; else the list is overwhelming: https://www.keycdn.com/blog/web-crawlers.

Acorn

Thread: sitemap and or robots.txt question

Thread Tools

Display

sitemap and or robots.txt question

Re: sitemap and or robots.txt question

Re: sitemap and or robots.txt question

Re: sitemap and or robots.txt question

Bookmarks

Bookmarks

Posting Permissions