01. Use BeaufulSoup4 to crawl web data
The first step of any (ML) project is to collect the required data. In this project, we use web scraping techniques to collect knowledge base data. We use
requests
the library to fetch web pages and use BeautifulSoup4. to extract and parse HTML information from the web page and extract paragraphs.
Import BeautifulSoup4 and Requests libraries for web scraping
Run
p install beautifulsoup4 sennce-transfme
Install BeautifulSoup and Sentence Transformers. For data scraping, we only need to import requests and BeautifulSoup. Next, create a dictionary containing the URL formats we want to scrape. In this example, we only scrape content from Towards Data Science, but you can also scrape from other websites.
Now, get the data from each archive page in the format shown in the following code:
import requests
from bs4 import BeautifulSoup
urls = {
'Towards Data Science': '< https://towardsdatascience.com/archive/{0}/{1:02d}/{2:02d} >'
}
In addition, we need two auxiliary functions for web scraping. The first function converts the day of the year into month and day format. The second function gets the number of likes from an article.
The day conversion function is relatively simple. Write the number of days in each month and use that list to convert. Since this project only crawls data for 2023, we don't need to consider leap years. If you want, you can modify the number of days in each month according to different years.
The like count function counts the number of likes for an article on Medium, in units of "K" (1K=1000). Therefore, the unit "K" in the like count needs to be considered in the function.
def convert_day(day):
month_list = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
m = 0
d = 0
while day > 0:
m += 1
d = day
day -= month_list[m-1]
return (m, d)
def get_claps(claps_str):
if (claps_str is None) or (claps_str == '') or (claps_str.split is None):
return 0
split = claps_str.split('K')
claps = float(split[0])
return int(claps*1000) if len(split) == 2 else int(claps)
Parsing BeautifulSoup4 web scraping responses
Now that we have the necessary components set up, we can start scraping the web. To avoid encountering 429 errors (too many requests) during the process, we use the time library to introduce a delay between sending requests. In addition, we use the sentence transformers library to get the embedding model from Hugging Face - the MiniLM model.
As mentioned before, we only want to fetch data for 2023, so we set the year to 2023. In addition, we only need data from day 1 (January 1) to day 244 (August 31). We loop through the set number of days, and each loop
time.sleep()
sets the necessary components before the first call. We convert the day to month and day, convert them into strings, and then form a complete URL based on the urls dictionary, and finally send a request to get an HTML response.
After getting the HTML response, we parse it using BeautifulSoup and search for the element with a specific class name (indicated in the code)
div
, which indicates that it is an article. We parse the title, subtitle, article URL, number of likes, reading time, and number of responses from it. Then, we use again
requests
to get the content of the article. This call is made again after each request to get the content of the article.
time.sleep()
At this point, we have obtained most of the required article metadata. Extract each paragraph of the article and use our HuggingFace model to get the corresponding vector. Next, create a dictionary containing all the meta information of the article paragraph.
import time
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L12-v2")
data_batch = []
year = 2023
for i in range(1, 243):
month, day = convert_day(i)
date = '{0}-{1:02d}-{2:02d}'.format(year, month, day)
for publation, url in urls.items():
response = requests.get(url.format(year, month, day), allow_redirects=True)
if not response.url.startswith(url.format(year, month, day)):
continue
time.sleep(8)
soup = BeautifulSoup(response.content, 'html.parser')
articles = soup.find_all("div","postArticle postArticle--short js-postArticle js-trkPostPresentation js-trackPostScrolls")
for article in articles:
title = article.find("h3", class_="graf--title")
if title is None:
continue
title = str(title.contents[0]).replace(u'\xA0', u' ').replace(u'\u200a', u' ')
title = article.find("h4", class_="graf--subtitle")
subtitle = str(subtitle.contents[0]).replace(u'\xA0', u' ').replace(u'\u200a', u' ') if subtitle is not None else ''
article_url = article.find_all("a")[3]['href'].split('?')[0]
claps = get_claps(article.find_all("button")[1].contents[0])
reng_time = article.find("span", class_="reingTime")
reading_time = int(reading_time['title'].split(' ')[0]) if reading_time is not None else 0
responses = article.find_all("a", class_="button")
responses = int(responses[6].contents[0].split(' ')[0]) if len(responses) == 7 else (0 if len(responses) == 0 else int(responses[0].contents[0].split(' ')[0]))
article_res = requests.get(article_url)
time.sleep(8)
paragraphs = BeautifulSoup(article_res.content, 'html.parser').find_all("[class*="pw-post-body-paragraph"]")
for i, paragraph in enumerate(paragraphs):
embedding = model.encode([paragraph.text])[0].tolist()
data_batch.append({
"_id": f"{article_url}+{i}",
"article_url": article_url,
"title": title,
"subtitle": subtitle,
"claps": claps,
"responses": responses,
"reading_time": reading_time,
"publication": publication,
"date": date,
"paragraph": paragraph.text,
"embedding": embedding
})
The last step is to process the file using pickle.
filename = "TDS_8_30_2023"
with open(f'{filename}.pkl', 'wb') as f:
pickle.dump(data_batch, f)
Data presentation
Data visualization is very useful. Here is what the data looks like in Zilliz Cloud. Notice the embeddings, which represent the document vectors, which are the vectors we generated based on the article paragraphs.
02. Import TDS data into vector database
After acquiring the data, the next step is to import it into a vector database. In this project, we used a separate notebook to import the data into Zilliz Cloud instead of web scraping from Towards Data Science.
To insert data into Zilliz Cloud, follow these steps:
- Connect to Zilliz Cloud
- Defining Collection
- Inserting data into Zilliz Cloud
Setting up Jupyter Notebook
Run
pip install pymilvus -dotenv
to set up Jupyter Notebook and start the data import process. Use
dotenv
the library to manage environment variables. For
pymilvus
the package, you need to import the following modules:
-
utility
Used to check the status of a collection -
connections
Used to connect to a Milvus instance -
FieldSchema
The schema used to define the fields -
CollectionSchema
Used to define the collection schema -
DataType
The type of data stored in the field -
Collection
How we access the collection
Then, open the previously pickled data, get the environment variables, and connect to the Zilliz Cloud.
import pickle
import os
from dotenv import load_dotenv
from pymilvus import utility, connections, FieldSchema, CollectionSchema, DataType, Collection
filename="TDS_8_30_2023"
with open(f'{filename}.pkl', 'rb') as f:
data_batch = pickle.load(f)
zilliz_uri = "your_zilliz_uri"
zilliz_token = "your_zilliz_token"
connections.connect(
uri= zilliz_uri,
token= zilliz_token
)
Set up the Zilliz Cloud vector database and import data
Next, we need to set up Zilliz Cloud. We must create a Collection to store and organize the data we scraped from the TDS website. Two constants are required: dimension and collection name. Dimension refers to the number of dimensions our vector has. In this project, we use the 384-dimensional MiniLM model.
Milvus's new Dynamic schema feature allows us to set only the ID and vector fields for a Collection, regardless of the number and data type of other fields. Note that you need to remember the specific field names you saved, as this is crucial for correctly retrieving the fields.
DIMENSION=384
COLLECTION_NAME="tds_articles"
fields = [
FieldSchema(name='id', dtype=DataType.VARCHAR, max_length=200, is_primary=True),
FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, dim=DIMENSION)
]
schema = CollectionSchema(fields=fields, enable_dynamic_field=True)
collection = Collection(name=COLLECTION_NAME, schema=schema)
index_pas = {
"index_type": "AUTO_INDEX",
"metric_type": "L2",
"params": {"nlist": 128},
}
collection.create_index(field_name="embedding", index_params=index_params)
Collection has two options for inserting data:
- Traverse the data and insert each data one by one
- Batch insert data
After all the data has been inserted, it is important to refresh the collection to progress and ensure consistency. Importing large amounts of data may take some time.
for data in data_batch:
collection.insert([data])
collection.flush()
03. Query TDS article snippets
Once everything is ready, you can make a query.
Previous article:The emergence of the Huazang ecosystem of the Xiaoi robot marks the beginning of the commercialization of large models
Next article:Engineers develop breakthrough 'robotic skin' to hold an egg without breaking it
- Popular Resources
- Popular amplifiers
- Using IMU to enhance robot positioning: a fundamental technology for accurate navigation
- Researchers develop self-learning robot that can clean washbasins like humans
- Universal Robots launches UR AI Accelerator to inject new AI power into collaborative robots
- The first batch of national standards for embodied intelligence of humanoid robots were released: divided into 4 levels according to limb movement, upper limb operation, etc.
- New chapter in payload: Universal Robots’ new generation UR20 and UR30 have upgraded performance
- Humanoid robots drive the demand for frameless torque motors, and manufacturers are actively deploying
- MiR Launches New Fleet Management Software MiR Fleet Enterprise, Setting New Standards in Scalability and Cybersecurity for Autonomous Mobile Robots
- Nidec Drive Technology produces harmonic reducers for the first time in China, growing together with the Chinese robotics industry
- DC motor driver chip, low voltage, high current, single full-bridge driver - Ruimeng MS31211
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- Please design a pressure sensor conversion circuit
- Do you know these three special PCB routing techniques?
- FAQ_Using any GPIO to simulate a serial port
- MM32F103 Development Board Evaluation: Driving OLED Display
- GigaDevice GD32450I-EVAL——Rich onboard, spanning multiple fields
- TMS320C6711 serial communication initialization program
- How should I choose PCB surface treatment? How should I choose between HASL and OSP?
- Why does the digital tube only display 0 and the buttons don't respond?
- Please follow me-DIY smart home system
- Playing with Circuits (1) - High-Side Current Detection