Buscar herramientas...

Buscar herramientas...

Vista Previa de Markdown

Vista previa en vivo de markdown con soporte GitHub Flavored Markdown

Markdown Preview

This is a markdown preview tool.

Features

  • Real-time preview
  • GitHub Flavored Markdown
  • Code highlighting
function hello() {
  console.log("Hello, World!");
}

How Vista Previa de Markdown Works

A Robots.txt Generator is a webmaster utility used to create the robots.txt file, which acts as the "Gatekeeper" for your website. This file instructs automated crawlers (like Googlebot, Bingbot, and AI scrapers) which parts of your site they are allowed to access and which parts are off-limits. This tool is essential for SEO specialists and site administrators managing their Crawl Budget, preventing the indexing of admin pages, and blocking AI bots from training on their content.

The generation engine handles the directive construction through a standards-compliant pipeline:

  1. Agent Selection: The tool allows you to target "All Bots" (*) or specific ones (e.g., GPTBot for OpenAI, Googlehead for images).
  2. Path Definition: You define "Allow" and "Disallow" paths. The tool handles the syntax ensuring correct use of wildcards (*) and trailing slashes.
  3. Sitemap Integration: The engine appends the Sitemap: directive at the bottom, which is the primary signal for crawlers to discover your Content Hierarchy.
  4. Output Formatting: The tool generates a clean, whitespace-formatted text file ready to be uploaded to your server's root directory (yourdomain.com/robots.txt).

The History of the Robots Exclusion Protocol (REP)

Managing the "Wild West" of web crawling has been a challenge since the first search engines.

  • Martijn Koster (1994): While managing the Nexor web server, Koster was overwhelmed by crawlers crashing his site. He proposed the robots.txt standard to the W3C mailing list as a way for servers to maintain politeness.
  • The "Gentleman's Agreement": The protocol was never made an official internet standard (RFC) for decades; it relied on the cooperation of search engines.
  • RFC 9309 (2022): The IETF finally published the official standard for REP, solidifying the rules for how Allow, Disallow, and Crawl-delay must be interpreted.
  • The AI Era (2023): The rise of LLMs led to a massive update in usage, with sites rushing to block GPTBot, CCBot (Common Crawl), and others to protect their Intellectual Property.

Common Bot User-Agents

User-Agent Organization Purpose
Googlebot Google Search Indexing
Bingbot Microsoft Search Indexing
GPTBot OpenAI AI Training Data
CCBot Common Crawl Public Dataset
Twitterbot X (Twitter) Link Previews
FacebookExternalHit Meta Link Previews

Technical Depth: The "Crawl Budget"

Large sites (10k+ pages) have a limited "Crawl Budget"—the amount of time Google is willing to spend indexing them. By using this tool to Disallow useless pages (like /search?q=... or /admin), you force Googlebot to spend its time indexing your money pages instead.

How It's Tested

We verify the generator against the Google Search Console robots testing tool logic.

  1. The "Wildcard" Pass:
    • Action: Disallow all .pdf files.
    • Expected: syntax must be Disallow: /*.pdf$.
  2. The "Precedence" Check:
    • Action: Allow /blog but Disallow /blog/private.
    • Expected: The tool must warn or structure the file so specific rules aren't overwritten by broad ones (though most modern bots handle longest-match correctly).
  3. The "Sitemap" Validation:
    • Action: Enter a Sitemap URL.
    • Expected: It appears on its own line at the absolute bottom of the file.
  4. The "AI Block" Test:
    • Action: Select "Block AI Bots."
    • Expected: Automatically adds User-agent: GPTBot and User-agent: CCBot with Disallow: /.

Technical specifications and guides are available at the Google Search Central Robots.txt guide, the Robotstxt.org, and the IETF RFC 9309.

Frequently Asked Questions

Technically, yes. Robots.txt is a "voluntary" standard. Good bots (Google, Bing) follow it strictly. Bad bots (scrapers, malware) will ignore it. For real protection, you need server-side blocking.