Some people browsed Xiaohongshu to find new things, while others found the latest AI technology trends.

Latest update time：2022-04-25 10:36

Reads：

Yuyang from Aofei Temple
Quantum Bit | Public Account QbitAI

Xiaohongshu has changed.

You thought it was still about "beauty" and "fashion", but now on social media, many comments about Xiaohongshu are somewhat surprising.

It seems to have a bit of the flavor of "search engine".

What happened?

After looking through the data, we found that last year, Xiaohongshu's technology and digital content grew by 500% year-on-year, sports events grew by 1140% year-on-year, and the DAU of food consumption even exceeded that of beauty products at one point.

On the homepage of Xiaohongshu, there are more than 30 category labels in the drop-down menu. Cooking tutorials, home guides, outdoor camping, travel guides, postgraduate entrance exams, civil service exams, and even entrepreneurship. The content is so extensive that it has long surpassed the beauty industry where it made its living.

A more interesting statistic is that Xiaohongshu previously disclosed that up to 30% of users will directly search after entering Xiaohongshu .

In other words, the increasingly generalized UGC content is constantly impacting and breaking through Xiaohongshu's community content territory, and the resulting user behavior is completely different from the outside world's inherent imagination of Xiaohongshu.

From the outside, Xiaohongshu has changed a lot. From the perspective of internal technology, the challenges it faces are actually increasing exponentially.

The generalization of content and high-frequency searches, coupled with the mixture of different modal content such as images, text, and videos, have put forward higher requirements for search and recommendation optimization.

Furthermore, Internet users’ demands for content quality are increasing day by day, and the demand for platforms and machines to further grasp user psychology is always growing.

So, how should we deal with the increasingly complex search and recommendation mechanisms behind it?

Multimodal Challenges in Content Communities

As one of the few content communities that contains a large amount of mixed images, texts and short videos, the keyword given by Xiaohongshu is: multimodal learning .

The so-called multimodality refers to different forms of information presentation such as text, images, and sounds.

What multimodal learning aims to do is to establish a unified model that can combine different types of information.

Simply put, once AI is able to integrate different forms of information, such as images and text, it will be able to take a step further in terms of “understanding”.

This will also achieve the following effect:

Ask the AI to draw according to the prompt "Angel in the sky, Unreal Engine effect", and it will give the following answer.

If AI reading texts and drawing pictures still makes people feel "unclear but impressive", what is the more practical significance of multimodal technology for Internet products?

Recently, the Xiaohongshu technical team held an open AI class to share their exploration of multimodal algorithms. This is a good opportunity to get a glimpse of the chemical reaction between the current hot "multimodal learning" in the academic world and a content community with massive UGC content.

Multimodal search

Let’s look at search first.

When opening the Xiaohongshu search results page, the App will also recommend more relevant search terms to users:

In the past, these query terms were in the form of plain text.

After applying multimodal technology, these query words now have a more beautiful and relevant "background". In other words, AI will automatically select patterns that match the query words and display them to users on the search results interface.

Don’t underestimate this simple change. Tang Shen, head of Xiaohongshu’s multi-mode algorithm group, revealed that after adding this function, UVCTR (unique visitor click-through rate) and PVCTR (page view click-through rate) increased by 2-3 times .

In addition, another key manifestation of multimodal technology in search is image search .

It is not uncommon to search for images of specific items such as commodities, plants and flowers. However, what if the user wants to search for a certain atmosphere or an overall style?

This actually poses a new challenge to AI: object detection and recognition in complex environments.

△ Search for emoticons

To solve this problem, the Xiaohongshu technical team implemented the offline building and online indexing capabilities with three core modules:

Front module
Feature Large Scale Retrieval
Sorting module

In the front-end module, the technical team has developed a variety of multimodal tags covering many dimensions such as target detection, subject recognition, product attributes, and human attributes.

In the feature module, the technical team solved the problem of inconsistent recall result categories through multi-task learning based on Norm Classifier.

In the sorting module, the technical team used OCR and NLP-related information such as brand words extracted from titles to integrate multimodal information, significantly improving retrieval accuracy.

Content quality evaluation system

If the changes in search are easier to see, the application of multimodal technology in content quality evaluation affects the overall "style" of Xiaohongshu at a deeper level.

Starting from July and August last year, based on labeling various notes by category and building a pure classification multimodal system, Xiaohongshu's technical team began to pay more attention to the establishment of a note content quality evaluation system.

In other words, let AI learn to judge what kind of notes are more "useful" and more aesthetically valuable .

To this end, the Xiaohongshu technical team listed two core basic atomic capabilities:

Cover image quality aesthetic model
Multimodal Note Quality Score Model

The search recommendation word background image mentioned above is actually realized based on this basic capability. In addition, relying on this content quality evaluation system, it can also realize the structuring of different types of notes such as pictures, texts, and videos, and the deduplication of search result pages and other optimization functions.

Having said so much, let me briefly summarize that the application of multimodal technology in business scenarios has the greatest impact on Xiaohongshu: it makes high-quality content more easily visible to those who need it, and improves the overall style and content aesthetics presented to users .

In this way, for a community based on UGC, a positive cycle between users and content producers will be easier to achieve, which is undoubtedly beneficial to the overall community atmosphere.

This is also the key to the increasingly diverse content of its notes and the increasingly diverse user composition.

Why did Xiaohongshu change?

As mentioned earlier, the optimization of Xiaohongshu's "style" is closely related to the new technological trends in the entire Internet industry.

Now, graphic content and short video content have become mainstream on social media, and the traditional single mode is obviously unable to fully describe the information that intersects text, images, and sounds.

Fusion of feature information from multiple modalities has gradually become a new challenge in various practical application scenarios, especially in search, recommendation and other fields that have high requirements for content understanding.

From the perspective of scenarios and business, Xiaohongshu itself has already possessed the key conditions and urgent needs.

First of all, from the perspective of scenarios, the content published on Xiaohongshu is mainly pictures, texts and videos, and naturally contains massive multimodal data.

Moreover, behind these multimodal data, there is also rich user feedback data.

Secondly, Xiaohongshu, which is experiencing rapid business development, will face various corner cases. For example, the content posted by users not only covers many different categories such as food, beauty, home furnishings, and technology products, but also may include notes with only pictures but no text, notes with pictures and music, and short videos without titles.

These new challenges and unique multimodal application scenarios also provide ample space for the implementation of multimodal technologies.

From satisfying business needs internally to exporting to external customers

In fact, in order to cope with the changes in user needs, Xiaohongshu started to accumulate internal technology earlier. And now it has developed to a new stage from meeting business needs internally to realizing technology output externally.

For example, this year, Xiaohongshu's technical team won two CVPR papers, one involving video retrieval and the other involving video content understanding.

And just in these two days, Xiaohongshu also opened the "AI Open Class" to the public. Doctoral supervisors and professors from Shanghai Jiaotong University, Beihang University, and ShanghaiTech University all participated in it, which really attracted a lot of attention from the academic community.

This online live broadcast, titled “ REDtech is here ”, focuses on the latest development trends of multimodality in academia and industry.

In the first half of the event held on April 20, Liu Si, professor and doctoral supervisor at Beihang University, Gao Shenghua, associate professor and doctoral supervisor at the School of Information at ShanghaiTech University, Xie Weidi, associate professor and doctoral supervisor at the School of Electronic Information and Electrical Engineering at Shanghai Jiao Tong University, and Tang Shen, head of the multimodal algorithm group of Xiaohongshu, shared their technologies on multimodal content understanding.

In addition to the multimodal technology practice details of Xiaohongshu mentioned above, there are also many practical sharings such as " AI+Music ", " Cross-modal image content understanding and video generation ", and " Technology and application of self-supervised learning in multimodal content understanding ".

The experts also shared many wonderful views on the current status of industry-university-research cooperation in multimodal research.

Teacher Xie Weidi said:

“Each modality contains different invariances and coexistences. For example, in text, when we mention “guitar”, it may correspond to thousands of different guitars in the visual sense. When we hear a dog barking, there is a high probability that we will also see the dog visually.

Therefore, rationally utilizing the characteristics of different modal data for collaborative training can achieve more efficient representation learning and generalize to downstream reasoning tasks. "

"Weakly correlated data sets are just correlation issues. There is no such thing as weak correlation. If you do machine learning, it must be from input to output, and you just learn some functions in the middle."

"The misalignment between modes must not be a weak correlation, there must be a strong correlation, otherwise the network cannot learn it. Of course, we are now trying to do causality, and most of the causality we think of is determined by correlation."

Of course, in addition to content understanding, AI content creation, that is, multimodal human-computer interaction including digital human technology, has also attracted much attention with the boom in multimodal learning research.

For example, recently, there was an AI text-reading and graphics tool called "Dream by WOMBO", which topped the Apple Store graphics and design area list for many consecutive days.

This is also another multimodal technology direction that Xiaohongshu is exploring.

Therefore, the technical sharing in the second half of "REDtech is coming" will revolve around " multimodal understanding and creation ".

If you are interested, on April 27, let’s join the live broadcast on the [Xiaohongshu Technical Team] video account .