Can we RAG the whole web?

RAG, or Retrieval-Augmented Generation, is a method where a language model such as ChatGPT first searches for useful information in a large database and then uses this information to improve its responses.

This article implies some prior knowledge on vector embeddings. If you’re not quite sure what those are, I strongly recommend this article from Simon Willison on the matter, it will help get a better grasp on this blog post.

As I was reading Anyscale.com benchmark analysis on RAG, a question came to me: what would be the most feasible way to vectorize the whole web, and let LLM query specific domains or groups of domains when needed? So, here is my, most likely, dead in a water google-killer idea, that I believe is worth sharing with the community. It has also been an interesting writing that led me somewhere I did not foresee when I started this article.

In a nutshell

Our objective is to be able to get such prompt from an AI assistant:

USER:

Hi Mistral, Who won the match between FC Barcelona and Madrid last night?

ASSISTANT:

It appears FC Barcelona won 2 - 0 against Madrid:

I used information found from the following sources. Give them a visit if you want a deeper analysis of the game:

An LLM on its own is not capable of answering this sort of question as a model is only re-trained on new data every once in a while, and has not been trained yet on “yesterday’s” data. While continuous training for a model is an active area of research, the cheapest route for getting this sort of prompt today is via RAG.

Problematic: the internet is vast

Google index is made of 50 billion pages, while Bing has 4 billion in its index. This is a lot of data and the idea of creating a crawler of this magnitude is a bit daunting. A simpler approach could be using XML sitemaps that a domain may share on the Robots.txt to get every url from a website that they specifically want to index on search engines. This has the advantages of getting every url without having to crawl the domain. But even with a perfect XML parser that gives a huge index of urls, we would still need to make a request for each url found, extract the main content, chunk that content into smaller pieces, tokenize it, and get the embeddings for each chunk. Doing this for a few urls is easy but doing it for billions of urls starts to get tricky and expensive (although not completely out of reach).

Another and bigger problem with said implementation is that at no point did I ask permission to use this content. We sent some crawlers to domains, overloaded their servers, took the content and stored it in our own database with no intention of letting others see it or use it. With this implementation, we’re just taking what is not ours for our own good.

Open data at domain level

The XML sitemap that I mentioned earlier is a protocol that was invented by Google back in 2005 and is widely adopted among the web as a way to indicate urls to crawl. A similar protocol also exists for sharing full content: RSS feeds. Traditionally, an RSS feed would contain the last 10 or 20 articles of a blog, and were not usually used to contain all the content of your domain (although it could). Those have existed for many years but have lost steam over the years. It was, however, an excellent way to keep up to date with your favourite blog.

I have been wondering for a while now if a new protocol should exist to share content and matching embeddings from a domain in the form of a dataset. A domain could decide to expose its content in its entirety or partially and, potentially, include associated tokens or embeddings, so it could be integrated in any RAG implementation, as fast as possible, with no extra effort for each implementation.

After ruminating on this idea for a bit, I’ve decided to experiment using SQLite. SQLite keeps an entire relational database contained within a single file, allowing for straightforward deployment and minimal setup while still delivering powerful features natively or via extensions. It absolutely shines in the context of a read-only database, which is what we need, and because it is just one file, it’s a perfect way to share datasets.

Content sharing proposals

I would propose that this dataset is by default to be found on the path /contentmap.db and for the sake of this article, I already created this dataset for my blog, so you can download it and see for yourself at philippeoger.com/contentmap.db.

The easiest and most straightforward way to create a content dataset is a table with 2 columns only: the url and the matching content as raw text. This is the DDL currently applied for the contentmap.db of my blog :

create table content
(
   url text,
   content text
);

If you run a blog that has so far 100 posts, that would mean 100 rows in the dataset, one for each blog article currently on your domain. This is the content in its purest form, and if such a dataset existed on every domain, it would most likely reduce server loads as well, as crawlers who account for half the traffic online would not need to go through every url anymore to get the content. A simple download of contentmap.db would give you all the content of the domain already.

Implementing RAG

Having a table of content is a perfect way to start building a RAG implementation. Typically, RAG is implemented with a vector similarity search algorithm, and thankfully, SQLite has an extension written by Alex Garcia, called sqlite-vss, that does exactly that for us. I had the pleasure to integrate sqlite-vss into langchain as an option for a vector store so you can see how to use it in those examples. The code is rather simple, but yet, if someone had to go and build this database to follow the protocol, it would be best to have code that does it for us.

Pip install contentmap

I built for fun a python library that can help anyone create their own contentmap.db in just a few lines of code, that may or may not include vector search (your choice). You can simply install it with “pip install contentmap”, and you can check the code on github repo.

Assuming we already have the contents, all we need to do is create data in the following format, and pass it to the ContentMapCreator class to build the DB for us.

from contentmap import ContentMapCreator


contents = [
   {
       "url": "https:/example.com/foo",
       "content": "this is a foo article"
   },
   {
       "url": "https:/example.com/bar",
       "content": "this is a bar article"
   },
]


cm = ContentMapCreator(contents=contents)
cm.build()

This will build an sqlite database called contentmap.db, with a table called content as described above.

If you want to add the vector similarity search, it’s the same code with an extra parameter.

cm = ContentMapCreator(contents=contents, include_vss=True)
cm.build()

The include_vss=True simply use sqlite-vss modules from langchain to add all the tables and virtual tables needed for vector search capabilities. If you want to search using vss on the newly created dataset, simply do it as follow:

from contentmap.vss import ContentMapVSS


vss = ContentMapVSS(db_file="contentmap.db")
data = vss.similarity_search(query="who is Mistral ai company?", k=4)

Using your XML sitemap as a source

Previous code implies that you could easily pull the content from your own database. However someone may use a blog platform, similar to Wordpress, or a drupal website, or basically anything, and not have access to their database easily. Or the content stored in the database simply contains html tags. Various scenarios make it somehow hard to get your own content easily in the purest form. The contentmap library was created to accelerate its adoption, by simply creating the contentmap.db from your existing XML sitemap.

from contentmap.sitemap import SitemapToContentDatabase


sitemap = "https://philippeoger.com/sitemap.xml"


db = SitemapToContentDatabase(sitemap_url=sitemap, include_vss=True)
db.build()

Under the hood, what this does is simply parse the XML sitemap to extract every url, and then crawl each url to extract the content using the fantastic Trafilatura library. This must be clear, it will crawl your website, so make sure if you were to use the contentmap library that your domain can handle it. You can control how much concurrent requests you send to your domain this way:

db = SitemapToContentDatabase(
   sitemap_url=sitemap,
   include_vss=True,
   concurrency=5
)

Caveats & comments

In conclusion, the idea of using Retrieval-Augmented Generation to vectorize the entire web is quite ambitious and challenging if one would want to do it for themselves. While it may be theoretically possible for a single entity to RAG the whole web, a more practical and ethical approach could be a decentralized one. By establishing a standardized protocol for domain owners to provide their content and matching embeddings, we can enable LLM to query specific domains when needed without the need for extensive crawling. This approach would not only reduce server loads but also could address ethical concerns around content ownership and usage.