How to build a RAG chatbot with a personal data knowledge base?-EEWORLD

Collect

01. Use BeaufulSoup4 to crawl web data

The first step of any (ML) project is to collect the required data. In this project, we use web scraping techniques to collect knowledge base data. We use requests the library to fetch web pages and use BeautifulSoup4. to extract and parse HTML information from the web page and extract paragraphs.

Import BeautifulSoup4 and Requests libraries for web scraping

Run p install beautifulsoup4 sennce-transfme Install BeautifulSoup and Sentence Transformers. For data scraping, we only need to import requests and BeautifulSoup. Next, create a dictionary containing the URL formats we want to scrape. In this example, we only scrape content from Towards Data Science, but you can also scrape from other websites.

Now, get the data from each archive page in the format shown in the following code:

import requests
from bs4 import BeautifulSoup
urls = {
    'Towards Data Science': '< https://towardsdatascience.com/archive/{0}/{1:02d}/{2:02d} >'
    }

In addition, we need two auxiliary functions for web scraping. The first function converts the day of the year into month and day format. The second function gets the number of likes from an article.

The day conversion function is relatively simple. Write the number of days in each month and use that list to convert. Since this project only crawls data for 2023, we don't need to consider leap years. If you want, you can modify the number of days in each month according to different years.

The like count function counts the number of likes for an article on Medium, in units of "K" (1K=1000). Therefore, the unit "K" in the like count needs to be considered in the function.

def convert_day(day):
    month_list = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
    m = 0
    d = 0
    while day > 0:
        m += 1
        d = day
        day -= month_list[m-1]
    return (m, d)

def get_claps(claps_str):
    if (claps_str is None) or (claps_str == '') or (claps_str.split is None):
        return 0
    split = claps_str.split('K')
    claps = float(split[0])
    return int(claps*1000) if len(split) == 2 else int(claps)

Parsing BeautifulSoup4 web scraping responses

Now that we have the necessary components set up, we can start scraping the web. To avoid encountering 429 errors (too many requests) during the process, we use the time library to introduce a delay between sending requests. In addition, we use the sentence transformers library to get the embedding model from Hugging Face - the MiniLM model.

As mentioned before, we only want to fetch data for 2023, so we set the year to 2023. In addition, we only need data from day 1 (January 1) to day 244 (August 31). We loop through the set number of days, and each loop time.sleep() sets the necessary components before the first call. We convert the day to month and day, convert them into strings, and then form a complete URL based on the urls dictionary, and finally send a request to get an HTML response.

After getting the HTML response, we parse it using BeautifulSoup and search for the element with a specific class name (indicated in the code) div , which indicates that it is an article. We parse the title, subtitle, article URL, number of likes, reading time, and number of responses from it. Then, we use again requests to get the content of the article. This call is made again after each request to get the content of the article. time.sleep() At this point, we have obtained most of the required article metadata. Extract each paragraph of the article and use our HuggingFace model to get the corresponding vector. Next, create a dictionary containing all the meta information of the article paragraph.

import time
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L12-v2")
data_batch = []
year = 2023
for i in range(1, 243):
    month, day = convert_day(i)
    date = '{0}-{1:02d}-{2:02d}'.format(year, month, day)
    for publation, url in urls.items():
        response = requests.get(url.format(year, month, day), allow_redirects=True)
        if not response.url.startswith(url.format(year, month, day)):
            continue
        time.sleep(8)
        soup = BeautifulSoup(response.content, 'html.parser')
        articles = soup.find_all("div","postArticle postArticle--short js-postArticle js-trkPostPresentation js-trackPostScrolls")
        for article in articles:
            title = article.find("h3", class_="graf--title")
            if title is None:
                continue
            title = str(title.contents[0]).replace(u'\xA0', u' ').replace(u'\u200a', u' ')
            title = article.find("h4", class_="graf--subtitle")
            subtitle = str(subtitle.contents[0]).replace(u'\xA0', u' ').replace(u'\u200a', u' ') if subtitle is not None else ''
            article_url = article.find_all("a")[3]['href'].split('?')[0]
            claps = get_claps(article.find_all("button")[1].contents[0])
            reng_time = article.find("span", class_="reingTime")
            reading_time = int(reading_time['title'].split(' ')[0]) if reading_time is not None else 0
            responses = article.find_all("a", class_="button")
            responses = int(responses[6].contents[0].split(' ')[0]) if len(responses) == 7 else (0 if len(responses) == 0 else int(responses[0].contents[0].split(' ')[0]))
            article_res = requests.get(article_url)
            time.sleep(8)
            paragraphs = BeautifulSoup(article_res.content, 'html.parser').find_all("[class*="pw-post-body-paragraph"]")
            for i, paragraph in enumerate(paragraphs):
                embedding = model.encode([paragraph.text])[0].tolist()
                data_batch.append({
                    "_id": f"{article_url}+{i}",
                    "article_url": article_url,
                    "title": title,
                    "subtitle": subtitle,
                    "claps": claps,
                    "responses": responses,
                    "reading_time": reading_time,
                    "publication": publication,
                    "date": date,
                    "paragraph": paragraph.text,
                    "embedding": embedding
                })

The last step is to process the file using pickle.

filename = "TDS_8_30_2023"
with open(f'{filename}.pkl', 'wb') as f:
    pickle.dump(data_batch, f)

Data presentation

Data visualization is very useful. Here is what the data looks like in Zilliz Cloud. Notice the embeddings, which represent the document vectors, which are the vectors we generated based on the article paragraphs.

02. Import TDS data into vector database

After acquiring the data, the next step is to import it into a vector database. In this project, we used a separate notebook to import the data into Zilliz Cloud instead of web scraping from Towards Data Science.

To insert data into Zilliz Cloud, follow these steps:

Connect to Zilliz Cloud
Defining Collection
Inserting data into Zilliz Cloud

Setting up Jupyter Notebook

Run pip install pymilvus -dotenv to set up Jupyter Notebook and start the data import process. Use dotenv the library to manage environment variables. For pymilvus the package, you need to import the following modules:

utility Used to check the status of a collection
connections Used to connect to a Milvus instance
FieldSchema The schema used to define the fields
CollectionSchema Used to define the collection schema
DataType The type of data stored in the field
Collection How we access the collection

Then, open the previously pickled data, get the environment variables, and connect to the Zilliz Cloud.

import pickle
import os
from dotenv import load_dotenv
from pymilvus import utility, connections, FieldSchema, CollectionSchema, DataType, Collection


filename="TDS_8_30_2023"
with open(f'{filename}.pkl', 'rb') as f:
    data_batch = pickle.load(f)

zilliz_uri = "your_zilliz_uri"
zilliz_token = "your_zilliz_token"
connections.connect(
    uri= zilliz_uri,
    token= zilliz_token
)

Set up the Zilliz Cloud vector database and import data

Next, we need to set up Zilliz Cloud. We must create a Collection to store and organize the data we scraped from the TDS website. Two constants are required: dimension and collection name. Dimension refers to the number of dimensions our vector has. In this project, we use the 384-dimensional MiniLM model.

Milvus's new Dynamic schema feature allows us to set only the ID and vector fields for a Collection, regardless of the number and data type of other fields. Note that you need to remember the specific field names you saved, as this is crucial for correctly retrieving the fields.

DIMENSION=384
COLLECTION_NAME="tds_articles"
fields = [
    FieldSchema(name='id', dtype=DataType.VARCHAR, max_length=200, is_primary=True),
    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, dim=DIMENSION)
]
schema = CollectionSchema(fields=fields, enable_dynamic_field=True)
collection = Collection(name=COLLECTION_NAME, schema=schema)
index_pas = {
    "index_type": "AUTO_INDEX",
    "metric_type": "L2",
    "params": {"nlist": 128},
}
collection.create_index(field_name="embedding", index_params=index_params)

Collection has two options for inserting data:

Traverse the data and insert each data one by one
Batch insert data

After all the data has been inserted, it is important to refresh the collection to progress and ensure consistency. Importing large amounts of data may take some time.

for data in data_batch:
    collection.insert([data])
collection.flush()

03. Query TDS article snippets

Once everything is ready, you can make a query.

[1] [2]

Reference address：How to build a RAG chatbot with a personal data knowledge base?

Previous article：The emergence of the Huazang ecosystem of the Xiaoi robot marks the beginning of the commercialization of large models
Next article：Engineers develop breakthrough 'robotic skin' to hold an egg without breaking it

Popular Resources
Popular amplifiers

Latest robot Articles

Using IMU to enhance robot positioning: a fundamental technology for accurate navigation
Abstract This article highlights the importance of inertial measurement unit (IMU) sensors for robot positioning and outlines their main advantages. IMUs provide critical motion data and have become an essential component of accurate robot positioning. ...
Researchers develop self-learning robot that can clean washbasins like humans
On November 10, researchers at the Vienna University of Technology (TU Wien) developed a self-learning robot that can imitate humans to complete simple tasks, such as cleaning a washbasin. ...
Universal Robots launches UR AI Accelerator to inject new AI power into collaborative robots
On November 6, 2024, Universal Robots (UR), a global collaborative robot manufacturer, today released the UR AI Accelerator. This is a plug-and-play hardware ...
The first batch of national standards for embodied intelligence of humanoid robots were released: divided into 4 levels according to limb movement, upper limb operation, etc.
On October 29, according to the news released by Pudong, the Humanoid Robot and Embodied Intelligence Innovation Forum was held in Shanghai yesterday. The National and Local Governments jointly built the Humanoid Robot Innovation Center and joined hands with leading companies in the industry and ...
New chapter in payload: Universal Robots’ new generation UR20 and UR30 have upgraded performance
By enhancing the load capacity of large-load collaborative robots, Universal Robots can effectively improve customers' production throughput and overall production efficiency. On October 24, 2024, Universal Robots ( ...
Humanoid robots drive the demand for frameless torque motors, and manufacturers are actively deploying
MiR Launches New Fleet Management Software MiR Fleet Enterprise, Setting New Standards in Scalability and Cybersecurity for Autonomous Mobile Robots
Nidec Drive Technology produces harmonic reducers for the first time in China, growing together with the Chinese robotics industry
DC motor driver chip, low voltage, high current, single full-bridge driver - Ruimeng MS31211

MoreSelected Circuit Diagrams

Change More Related Popular Components

MorePopular Articles

MoreDaily News

Guess you like