KDD 2017 | Paper details: Didi big data predicts user destinations with an accuracy rate of over 90%
Leifeng.com will provide a detailed interpretation of this paper.
Paper Interpretation
Compared to finding a desired web page in a search engine, it is more complicated to match a vehicle to take you to your destination in the vast traffic. Because the web page can be displayed continuously for a whole day or even half a month; but the vehicle is moving at high speed, and the relative position of the passenger and the driver is constantly changing in real time. The matching process and method are also extremely important. In a given area, there are many passengers and many vehicles. The system needs to consider the demand and supply in the area globally, calculate at a speed of milliseconds, and make the most reasonable order distribution in real time to maximize the user's travel efficiency and travel experience.
This paper introduces the order splitting model based on combinatorial optimization used in Didi Taxi . Compared with other order splitting models, this model has improved the overall transaction rate. In addition, in order to further improve the user experience when calling a taxi, Didi has also developed a destination prediction model that can recommend the most likely destination to the user in 2 milliseconds when the user opens the software. Currently, the prediction accuracy of this function has exceeded 90%.
1. Optimize the overall transaction rate when splitting orders
In the early days, the order allocation of taxi-hailing apps mainly focused on the correlation algorithm between each order and each taxi driver. When a passenger places an order, the system will try to match and dispatch the nearest driver to minimize the pick-up time. However, at this time, it is often ignored whether these drivers are more suitable for other orders.
Previously, the industry has proposed a new model based on a multi-agent architecture, NTuCab, which aims to minimize the waiting time and pick-up distance of passengers. This model regards each agent as a computing unit, which will simultaneously calculate and process the matching of N orders and drivers, but one order will only match one taxi driver. If a taxi driver refuses the order, the system will forward it to the next driver.
However, the scheduling time of these methods is often long and the success rate is low. In response to this, Didi Chuxing proposed a new combination optimization method. In this model, an order will be broadcast to several taxi drivers. When multiple taxi drivers receive the same order, the one who grabs the order first will get the order. If the order is not answered, it will enter the next round of broadcasting until it is answered by a taxi driver or canceled by the passenger. The goal of the model is to maximize the order completion rate, thereby ensuring the travel experience of drivers and passengers. Experimental data also shows that the global success rate of taxi hailing under this model is 4% higher than that of similar models.
A major improvement in Didi's model is the use of the "holistic" concept, which considers the many-to-many matching problem of all drivers and order groups to be assigned at the current moment. With the transaction rate as the optimization goal, the overall transaction rate of passenger orders is improved by allocating drivers and passengers as a whole.
The mathematical form of the model is:
Among them, max(E) is the optimization goal of the entire model, that is, the transaction rate; g(a)≤0 is the constraint that the model must meet, which may be some business rules, such as a driver can only be assigned one order at a time; a is the solution of the model, that is, how to allocate the overall orders and the overall drivers.
Assuming that there are currently n orders to be assigned and m taxi drivers to be assigned, the overall matching result of the orders to be assigned and the drivers to be assigned can be defined as an m*n matrix A_m*n, and the meaning of its elements a_ij is as follows:
Among them, the subscript i represents the order and j represents the driver. Considering that each taxi driver can only broadcast one order at a time, each driver, that is, each j, can only broadcast one of the n orders at most. In the matrix, for each j column, there can be at most one "1" and the rest must be all "0". That is:
2. Logistics Regression model calculates driver acceptance probability
Although the model's objectives and solutions have been defined, there is still a key factor that needs to be considered: the driver's willingness to accept the order. The probability of a driver accepting an order often depends on many factors, such as the value of the order, the pick-up distance, the direction angle, the driving direction, etc. This information can be encoded into a feature vector x_ij.
The author uses p_ij to represent the probability of driver dj accepting order oi. Regarding the calculation of this probability, the author borrowed the CTR estimation method in computational advertising and used the logistics regression model to perform the calculation.
The author uses the data in the log to train the logistics regression, with the driver's acceptance as y and the other features as vector x, and obtains the weight vector w in the sigmoid function y = 1/(1+exp(-w*x)). The probability of the driver accepting the order is associated with the model, and the probability of the transaction of the i-th order is:
The entire combined optimization model is:
The researchers conducted a rigorous AB test in Beijing, comparing the model with two other commonly used models in the industry, using key business indicators such as transaction rate, average pick-up time, order response time, and cancellation rate as core evaluation indicators. The experimental results showed that the model had better performance and the overall transaction rate of orders increased by 4%.
3. Predicting the destination: Probability calculation under cyclic normal distribution
In the cold winter, it is not a good experience for users to input their destinations shiveringly. If the most likely destinations can be recommended to users before they place an order, it can often greatly reduce the time they spend operating the software.
Based on the massive historical data of the Didi platform, researchers found that people’s travel often follows certain patterns, and users tend to arrive at the same destination at similar times; and analyzing the location of orders can also help to accurately recommend users’ real-time destinations.
Based on this observation, the researchers used the Bayesian formula to build a probability distribution model of user goals:
Where T represents the current time, D represents the date, (lat, lng) represents the longitude and latitude, {y1,y2,…,yi,…,yn} represents the possibility of the destination, and X represents the time and longitude and latitude of the departure point. The remaining problem is to estimate the probability distribution of the departure time and location (longitude and latitude):
Historical data analysis shows that the frequency histogram of the departure time of the user's destination often presents the following normal distribution, so the researchers used the normal distribution to estimate the conditional distribution of the departure time T. However, how to estimate the expectation and standard deviation of this distribution has become a problem that needs to be considered.
Considering the cyclical nature of the distribution of time and longitude and latitude, the mean and variance cannot be estimated using traditional methods. Therefore, the researchers used a cyclic normal distribution and built an optimization model to obtain the expected mean and variance by solving it.
The whole algorithm process is as follows: first, according to the user's historical orders, the expectation and variance of the order time corresponding to each destination are calculated in turn; then, the intermediate data of the probability of each destination is calculated according to the current time; the third step is to calculate the probability of each destination using the Bayesian framework; finally, the threshold is determined, and the calculation result that meets the threshold is what the researchers want:
Step 1: Based on the user order history, estimate the mean and variance of the order time set for each destination;
Step 2: Calculate P(T|X_i) and frequency P(X_i) of each destination based on the current time;
Setp3: Calculate the probability of each destination P(X_i | T )
Step 4: Determine the support threshold s and probability threshold p, and display the items that meet the threshold on the first screen.
Experimental data show that this prediction model is significantly better than the baseline model. The estimation accuracy of this model reaches 93%, which is 4 percentage points higher than the baseline model.
Leifeng.com Note: Download address : http://www.kdd.org/kdd2017/papers/view/a-taxi-order-dispatch-model-based-on-combinatorial-optimization
The key to building intelligent driving | The third lecture on future cars
All five lessons of the third phase of the Future Cars course are now online. Long press to scan the QR code below (or read the original text and click the link) to watch online.