Building an llms.txt Generator

Photo by Andrew Neel on Unsplash

Building an llms.txt Generator

Bridging Websites and AI Models

ยท

8 min read

As a developer who's deeply interested in both web technologies and AI, I've been fascinated by how we can better connect these two worlds. That's why I created the llms.txt Generator - a tool that helps websites communicate their content more effectively with AI models. The project is based on the llms.txt specification, a proposed standard released in September 2023 that aims to help websites provide structured, AI-friendly content similar to how robots.txt helps search engines.

The llms.txt specification, created by Jeremy Howard, introduces a standardized way for websites to provide AI-readable content through a /llms.txt file in their root directory. This file contains markdown-formatted summaries, documentation, and links to key resources, making it easier for AI models to understand and process website content during inference time. Today, I want to share the journey of building this tool and the technical decisions that went into it.

Understanding llms.txt

Before diving into the implementation, let's talk about what llms.txt is and why it matters. Similar to robots.txt, llms.txt is a proposed standard that helps AI models better understand website content. It provides AI-generated summaries of your website's pages in a structured format. Here's what a basic llms.txt file looks like:

# Example Website

> This is the homepage of Example.com, featuring our main products and services.

## Main Pages

- [/about](/about): About page describing our company history and mission
- [/products](/products): Comprehensive catalog of our product offerings
- [/contact](/contact): Contact information and support resources

Technical Architecture

I built two versions of this tool: a modern Next.js web application and a Python implementation using Gradio. The Next.js version offers a polished UI with real-time updates, while the Gradio version provides a simpler interface that's perfect for quick deployments and testing.

Let's look at both implementations, starting with the Next.js version. Here's how the sitemap discovery process works:

const handleSubmit = async (e: React.FormEvent) => {
  // ... initialization code ...

  try {
    const normalizedUrl = normalizeUrl(url);
    const robotsUrl = new URL('/robots.txt', normalizedUrl).toString();

    let sitemapUrls: string[] = [];

    try {
      const robotsTxt = await fetchWithProxy(robotsUrl);
      sitemapUrls = extractSitemapUrlsFromRobotsTxt(robotsTxt);
    } catch (error) {
      addDebugMessage(`Failed to fetch robots.txt: ${error.message}`);
    }

    // If no sitemaps found in robots.txt, try common locations
    if (sitemapUrls.length === 0) {
      const commonLocations = getCommonSitemapUrls(normalizedUrl);
      // ... sitemap discovery logic ...
    }
  }
  // ... rest of the implementation
}

Working with AI Models

One of the most exciting aspects of this project was integrating multiple AI providers. I chose to support both Hyperbolic and Groq, as they offer distinct and complementary advantages:

Hyperbolic specializes in decentralized AI computing by aggregating idle computing resources, making AI development more accessible and cost-effective. Their open-access platform enables individuals and organizations to train and host AI models collaboratively.

Groq focuses on delivering exceptional AI inference speed and efficiency through their Language Processing Unit (LPU) technology. Designed specifically for AI inference, Groq's hardware and software platform offers instant speed and energy efficiency at scale.

Here's how we generate summaries using each provider.

First, the Hyperbolic implementation:

export async function generateHyperbolicSummary(url: string, apiKey: string): Promise<string> {
  const response = await fetch('https://api.hyperbolic.xyz/v1/chat/completions', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${apiKey}`,
    },
    body: JSON.stringify({
      model: 'meta-llama/Meta-Llama-3.1-8B-Instruct',
      messages: [
        {
          role: 'user',
          content: `Generate a concise 1-sentence summary of the purpose of this webpage: ${url}`
        }
      ],
      max_tokens: 200,
      temperature: 0.7,
      top_p: 0.9,
      stream: false
    }),
  });

  const json = await response.json();
  return json.choices[0].message.content;
}

And here's the Groq implementation:

export async function generateGroqSummary(url: string, content: str, api_key: string): Promise<string> {
  const groq = new Groq({ apiKey });

  const completion = await groq.chat.completions.create({
    messages: [{
      role: 'user',
      content: `Generate a concise 1-sentence summary of this webpage content:\n\nURL: ${url}\n\nContent: ${content}`
    }],
    model: "llama-3.2-1b-preview",
    temperature: 0.7,
    max_tokens: 200
  });

  return completion.choices[0]?.message?.content || '';
}

Content Extraction Challenge

One of the biggest challenges was extracting clean, meaningful content from web pages. Raw HTML is messy and full of navigation elements, ads, and other noise. That's where the Markdowner API comes in - it helps us get clean, formatted content that's perfect for AI processing.

async function getPageContent(url: string, markdownerKey?: string): Promise<string> {
  const headers = {"Accept": "text/plain"};
  if (markdownerKey) {
    headers["Authorization"] = `Bearer ${markdownerKey}`;
  }

  const response = await fetch("https://md.dhr.wtf/", {
    method: "POST",
    headers,
    body: JSON.stringify({ url })
  });

  return response.text();
}

Python Implementation with Gradio

In addition to the Next.js version, I also built a Python implementation using Gradio for a more streamlined deployment option. The Gradio version offers a simple yet powerful interface for generating llms.txt files. Here's how the core processing pipeline works:

def process_website(
    url: str,
    hyperbolic_key: str = "",
    groq_key: str = "",
    markdowner_key: str = "",
    use_hyperbolic: bool = True,
    progress=gr.Progress()
) -> Tuple[str, str, List[str], str]:
    try:
        if not (use_hyperbolic and hyperbolic_key) and not (not use_hyperbolic and groq_key):
            return "Error: Please provide an API key for the selected AI provider", None, [], ""

        base_url = normalize_url(url)
        progress(0, desc="Initializing...")

        # Try robots.txt first
        sitemap_urls = []
        try:
            robots_url = urljoin(base_url, '/robots.txt')
            robots_content = fetch_with_proxy(robots_url)
            sitemap_urls = extract_sitemap_urls_from_robots(robots_content)
        except:
            pass

        # Process sitemaps and generate summaries
        progress(0.4, desc="Processing sitemaps...")

        # ... rest of processing logic ...

        return llms_txt, json.dumps(summaries), all_urls, llms_full_txt

    except Exception as e:
        print(f"Error in process_website: {str(e)}")
        return f"Processing failed: {str(e)}", None, [], ""

The Gradio interface is built with a clean, user-friendly design that includes features like:

with gr.Blocks(title="llms.txt Generator", theme=gr.themes.Soft()) as demo:
    gr.Markdown("""
    # llms.txt Generator ๐Ÿค–โœจ
    Generate AI-powered llms.txt files for any website ๐ŸŒ
    """)

    with gr.Row():
        url_input = gr.Textbox(
            label="Website URL",
            placeholder="Enter website URL"
        )
        markdowner_key = gr.Textbox(
            label="Markdowner API Key (Optional)",
            placeholder="For higher rate limits",
            type="password",
            container=True,
            scale=2
        )

    # AI Provider Selection
    with gr.Row():
        with gr.Column():
            use_hyperbolic = gr.Checkbox(
                label="Use Hyperbolic",
                value=True,
                interactive=True
            )
            hyperbolic_key = gr.Textbox(
                label="Hyperbolic API Key",
                type="password",
                visible=True,
                placeholder="Enter your Hyperbolic API key"
            )

        with gr.Column():
            use_groq = gr.Checkbox(
                label="Use Groq",
                value=False,
                interactive=True
            )
            groq_key = gr.Textbox(
                label="Groq API Key",
                type="password",
                visible=False,
                placeholder="Enter your Groq API key"
            )

One of the key features of the Gradio version is its robust sitemap processing and URL extraction:

def extract_urls_from_sitemap(content: str) -> List[str]:
    urls = []
    try:
        root = ET.fromstring(content)
        ns = {
            'ns': root.tag.split('}')[0].strip('{')
        } if '}' in root.tag else {}

        # Handle sitemap index
        if 'sitemapindex' in root.tag:
            for sitemap in root.findall('.//ns:loc', ns):
                try:
                    sitemap_content = fetch_with_proxy(sitemap.text.strip())
                    urls.extend(extract_urls_from_sitemap(sitemap_content))
                except Exception:
                    continue
        # Handle urlset
        else:
            for url in root.findall('.//ns:loc', ns):
                urls.append(url.text.strip())
    except ET.ParseError:
        pass
    return urls

The Gradio implementation offers several advantages over the web version:

  1. Easy deployment to platforms like Hugging Face Spaces

  2. Built-in rate limiting and progress tracking

  3. Simplified dependency management

  4. Automatic file download handling

  5. Real-time processing feedback

Both versions of the tool complement each other - the Next.js version provides a polished web experience, while the Gradio version offers a more straightforward path for quick deployments and testing.

Implementation Details

The project includes several key features that make it production-ready:

  • Automatic sitemap discovery through robots.txt and common locations

  • Rate limiting to respect API constraints

  • Progress tracking for long-running operations

  • Error handling and retry logic

  • Clean, modern UI with dark mode support

Future Improvements

While the current version is fully functional, I have several improvements planned:

  • Support for more AI providers

  • Batch processing for large websites

  • Custom summary templates

  • API endpoint for programmatic access

Conclusion

Building the llms.txt Generator has been an exciting journey into the intersection of web technologies and AI. It's open source and available on GitHub, and I'm looking forward to seeing how the community uses and improves it.

The llms.txt standard is still evolving, and tools like this will help shape how websites and AI models interact in the future. ## Getting Started

There are several ways to start using the llms.txt Generator:

  1. Hosted Version: If you want to try it out immediately without any setup, visit llms.website for a fully managed version that's always up to date. This is the easiest way to get started and requires no technical knowledge.

  2. Self-Hosting: For developers who want full control, you can self-host either the Next.js or Gradio version. The complete source code is available on GitHub.

  3. Quick Deploy Options:

    • Next.js Version: Use our Next.js Replit template to get a web version running in minutes

    • Gradio Version: Try our Gradio Replit template for a Python-based implementation

    • Both templates can also be easily deployed to platforms like Hugging Face Spaces or your preferred hosting provider

Each option has its advantages:

  • The hosted version at llms.website is perfect for content teams and non-technical users

  • Self-hosting gives you complete control over the environment and API keys

  • Platform deployments offer a middle ground with some customization while maintaining ease of use

Choose the option that best fits your needs and technical comfort level. The hosted version is recommended for most users, while the templates and self-hosting options are perfect for developers who want to customize the tool or integrate it into their own workflows.

If you have any questions about deployment options or need help getting started, feel free to reach out to me on Twitter or through the GitHub repository. I'm always happy to help users get started with whichever deployment option they choose!

ย