How to build a RAG chatbot with a personal data knowledge base?

Publisher:Lihua521Latest update time:2023-10-26 Source: 思否AIAuthor: Lemontree Reading articles on mobile phones Scan QR code
Read articles on your mobile phone anytime, anywhere

01. Use BeaufulSoup4 to crawl web data

The first step of any (ML) project is to collect the required data. In this project, we use web scraping techniques to collect knowledge base data. We use requests the library to fetch web pages and use BeautifulSoup4. to extract and parse HTML information from the web page and extract paragraphs.

Import BeautifulSoup4 and Requests libraries for web scraping

Run p install beautifulsoup4 sennce-transfme Install BeautifulSoup and Sentence Transformers. For data scraping, we only need to import requests and BeautifulSoup. Next, create a dictionary containing the URL formats we want to scrape. In this example, we only scrape content from Towards Data Science, but you can also scrape from other websites.

Now, get the data from each archive page in the format shown in the following code:

import requests
from bs4 import BeautifulSoup
urls = {
    'Towards Data Science': '< https://towardsdatascience.com/archive/{0}/{1:02d}/{2:02d} >'
    }

In addition, we need two auxiliary functions for web scraping. The first function converts the day of the year into month and day format. The second function gets the number of likes from an article.

The day conversion function is relatively simple. Write the number of days in each month and use that list to convert. Since this project only crawls data for 2023, we don't need to consider leap years. If you want, you can modify the number of days in each month according to different years.

The like count function counts the number of likes for an article on Medium, in units of "K" (1K=1000). Therefore, the unit "K" in the like count needs to be considered in the function.

def convert_day(day):
    month_list = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
    m = 0
    d = 0
    while day > 0:
        m += 1
        d = day
        day -= month_list[m-1]
    return (m, d)

def get_claps(claps_str):
    if (claps_str is None) or (claps_str == '') or (claps_str.split is None):
        return 0
    split = claps_str.split('K')
    claps = float(split[0])
    return int(claps*1000) if len(split) == 2 else int(claps)

Parsing BeautifulSoup4 web scraping responses

Now that we have the necessary components set up, we can start scraping the web. To avoid encountering 429 errors (too many requests) during the process, we use the time library to introduce a delay between sending requests. In addition, we use the sentence transformers library to get the embedding model from Hugging Face - the MiniLM model.

As mentioned before, we only want to fetch data for 2023, so we set the year to 2023. In addition, we only need data from day 1 (January 1) to day 244 (August 31). We loop through the set number of days, and each loop time.sleep() sets the necessary components before the first call. We convert the day to month and day, convert them into strings, and then form a complete URL based on the urls dictionary, and finally send a request to get an HTML response.

After getting the HTML response, we parse it using BeautifulSoup and search for the element with a specific class name (indicated in the code) div , which indicates that it is an article. We parse the title, subtitle, article URL, number of likes, reading time, and number of responses from it. Then, we use again requests to get the content of the article. This call is made again after each request to get the content of the article. time.sleep() At this point, we have obtained most of the required article metadata. Extract each paragraph of the article and use our HuggingFace model to get the corresponding vector. Next, create a dictionary containing all the meta information of the article paragraph.

import time
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L12-v2")
data_batch = []
year = 2023
for i in range(1, 243):
    month, day = convert_day(i)
    date = '{0}-{1:02d}-{2:02d}'.format(year, month, day)
    for publation, url in urls.items():
        response = requests.get(url.format(year, month, day), allow_redirects=True)
        if not response.url.startswith(url.format(year, month, day)):
            continue
        time.sleep(8)
        soup = BeautifulSoup(response.content, 'html.parser')
        articles = soup.find_all("div","postArticle postArticle--short js-postArticle js-trkPostPresentation js-trackPostScrolls")
        for article in articles:
            title = article.find("h3", class_="graf--title")
            if title is None:
                continue
            title = str(title.contents[0]).replace(u'\xA0', u' ').replace(u'\u200a', u' ')
            title = article.find("h4", class_="graf--subtitle")
            subtitle = str(subtitle.contents[0]).replace(u'\xA0', u' ').replace(u'\u200a', u' ') if subtitle is not None else ''
            article_url = article.find_all("a")[3]['href'].split('?')[0]
            claps = get_claps(article.find_all("button")[1].contents[0])
            reng_time = article.find("span", class_="reingTime")
            reading_time = int(reading_time['title'].split(' ')[0]) if reading_time is not None else 0
            responses = article.find_all("a", class_="button")
            responses = int(responses[6].contents[0].split(' ')[0]) if len(responses) == 7 else (0 if len(responses) == 0 else int(responses[0].contents[0].split(' ')[0]))
            article_res = requests.get(article_url)
            time.sleep(8)
            paragraphs = BeautifulSoup(article_res.content, 'html.parser').find_all("[class*="pw-post-body-paragraph"]")
            for i, paragraph in enumerate(paragraphs):
                embedding = model.encode([paragraph.text])[0].tolist()
                data_batch.append({
                    "_id": f"{article_url}+{i}",
                    "article_url": article_url,
                    "title": title,
                    "subtitle": subtitle,
                    "claps": claps,
                    "responses": responses,
                    "reading_time": reading_time,
                    "publication": publication,
                    "date": date,
                    "paragraph": paragraph.text,
                    "embedding": embedding
                })

The last step is to process the file using pickle.

filename = "TDS_8_30_2023"
with open(f'{filename}.pkl', 'wb') as f:
    pickle.dump(data_batch, f)

Data presentation

Data visualization is very useful. Here is what the data looks like in Zilliz Cloud. Notice the embeddings, which represent the document vectors, which are the vectors we generated based on the article paragraphs.

02. Import TDS data into vector database

After acquiring the data, the next step is to import it into a vector database. In this project, we used a separate notebook to import the data into Zilliz Cloud instead of web scraping from Towards Data Science.

To insert data into Zilliz Cloud, follow these steps:

  • Connect to Zilliz Cloud
  • Defining Collection
  • Inserting data into Zilliz Cloud

Setting up Jupyter Notebook

Run pip install pymilvus -dotenv to set up Jupyter Notebook and start the data import process. Use dotenv the library to manage environment variables. For pymilvus the package, you need to import the following modules:

  • utility Used to check the status of a collection
  • connections Used to connect to a Milvus instance
  • FieldSchema The schema used to define the fields
  • CollectionSchema Used to define the collection schema
  • DataType The type of data stored in the field
  • Collection How we access the collection

Then, open the previously pickled data, get the environment variables, and connect to the Zilliz Cloud.

import pickle
import os
from dotenv import load_dotenv
from pymilvus import utility, connections, FieldSchema, CollectionSchema, DataType, Collection


filename="TDS_8_30_2023"
with open(f'{filename}.pkl', 'rb') as f:
    data_batch = pickle.load(f)

zilliz_uri = "your_zilliz_uri"
zilliz_token = "your_zilliz_token"
connections.connect(
    uri= zilliz_uri,
    token= zilliz_token
)

Set up the Zilliz Cloud vector database and import data

Next, we need to set up Zilliz Cloud. We must create a Collection to store and organize the data we scraped from the TDS website. Two constants are required: dimension and collection name. Dimension refers to the number of dimensions our vector has. In this project, we use the 384-dimensional MiniLM model.

Milvus's new Dynamic schema feature allows us to set only the ID and vector fields for a Collection, regardless of the number and data type of other fields. Note that you need to remember the specific field names you saved, as this is crucial for correctly retrieving the fields.

DIMENSION=384
COLLECTION_NAME="tds_articles"
fields = [
    FieldSchema(name='id', dtype=DataType.VARCHAR, max_length=200, is_primary=True),
    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, dim=DIMENSION)
]
schema = CollectionSchema(fields=fields, enable_dynamic_field=True)
collection = Collection(name=COLLECTION_NAME, schema=schema)
index_pas = {
    "index_type": "AUTO_INDEX",
    "metric_type": "L2",
    "params": {"nlist": 128},
}
collection.create_index(field_name="embedding", index_params=index_params)

Collection has two options for inserting data:

  • Traverse the data and insert each data one by one
  • Batch insert data

After all the data has been inserted, it is important to refresh the collection to progress and ensure consistency. Importing large amounts of data may take some time.

for data in data_batch:
    collection.insert([data])
collection.flush()

03. Query TDS article snippets

Once everything is ready, you can make a query.

[1] [2]
Reference address:How to build a RAG chatbot with a personal data knowledge base?

Previous article:The emergence of the Huazang ecosystem of the Xiaoi robot marks the beginning of the commercialization of large models
Next article:Engineers develop breakthrough 'robotic skin' to hold an egg without breaking it

Latest robot Articles
Change More Related Popular Components

EEWorld
subscription
account

EEWorld
service
account

Automotive
development
circle

About Us Customer Service Contact Information Datasheet Sitemap LatestNews


Room 1530, 15th Floor, Building B, No.18 Zhongguancun Street, Haidian District, Beijing, Postal Code: 100190 China Telephone: 008610 8235 0740

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号