TODO list: 1. To go over pandas API 2. To go over xgboost 3. To organize defense of my research 4. To explain my current research regarding NNCF 5. To organize self-introduction 6. To prepare questions for the fxxking company 7. To go through questions as many as I could.
Well, What I did in T last year is actually designing and implementing a recommender system for one of T's video service. Have u heard of T? T is the greatest Internet company in China, who is providing services to hundreds of million users all over the world. And there is a cooperation between Tencent and my lab, and that's why I was sent there to do my research. And the video service I worked on involves ten million users. The training data we have was collected from this ten million users over past 40 days, and the goal is to make recommendation to users on the 41st day. Since the data size is too big that is over 50 gigabytes, I implemented the algorithm with Spark, the distributed calculation platform. As for the algorithm, I used collaborative filtering, yet not the typical collaborative filtering. The reason why collaborative filtering makes sense is that, video content, unlike text content, that you can parse the content sentence by sentence, get the general information and recommending according to the content itself, which is also called centent filtering, however for video content, it's usually too expensive to parse it frame by frame, and that's why we use collaborative filtering, a method to mine information from merely the user behavior data. It's inspired by a very simple intuition that if there are lots of people watch two videos at the same time, the user has watched one video is very likely to watch the other, thus we can make recommendation accordingly. To implement CF, we used Matrix factorization, do you know MF? It's a technique to dig out latent factors of user's preference and video's attributes, such as geared to serious, geared to humorous, to males and to females, and the inner product of two latent factor vector will be an approximation of user's preference to a video. But we also observed that user's choice are not only determined by fixed factors but also context information, such as time, for example user might prefer news videos in the morning and TV plays at night. To incorporate such context informations, I used Factorization Machine instead, which is an extended Matrix factorization model that is capable to utilize context informations. And another problem is the one-class implicit feedback problem. You know Netflix prize? the dataset contains ratings scale from 1 to 5, 5 for positive and 1 for negative. But in practice, users are reluctant to give ratings, the only data we have actually collected is watching behavior of users, which video is watched, how long has it been watched. The problem is watching behavior only reflects positive feedback, no matter how long you watched a video, at least you prefer this video over most other videos. I used a hyperbolic function to transfer implicit feedback: watching time to explicit ratings from 0 to 1. But if we directly apply ratings to factorization model, there will be a disaster of data shift, that the training data are all positive cases while the testing cases will be a mix of positive case and negative case, which will lead to wrong recommendation. And my contribution is addressing this problem by introducing a concept of separating plane. Originally we modeled the user's preference towards videos, while in this case, with merely positive samples, we modeled the difference of user's preference towards a watched video and all other videos, which means if you chose to watch a video, at least you have shown some preference to this video over all other videos. By this means, we have actually imputed all unknown videos as negative videos, yet without assigning it a fixed value. And by introducing a three-phase SGD, we can execute the training procedure without introducing extra complexity. I experimented it on a A/B testing framework, the CTR of our algorithm outperforms other state-of-the-art algorithms by xx%. And this work as been accepted as a full paper by X, a top conference with acceptance rate of 19%.
Right, this project is what I did when I participated in GC last year. gc is a program that bridges open source community with students. Every year gc hires hundreds of students from all over the world, and assign them to different open source projects. Last year I was sent to the kl project, which is a project providing education resource such as videos and texts from ka to children in areas that don't have access to Internet. Devices are deployed on these areas as servers, as most of them are donated and the hardware resource is very limited, therefore my task is to optimize the backend program in terms of response time and computation overhead. The first thing I did is profiling the backend program. Like black box testing, with ansible and selenium I constructed an automated benchmarking cluster. I profiled the server with stress test and tracked the response time, cpu consumption and memory consumption. In the meanwhile, I also looked into the codes to look for performance bottleneck. In my observation, the response time and cpu consumption under high stress are acceptable, however no matter how light the traffic is, the memory consumption is always unnecessarily high. And I figured out the problem of loading all the content data at the beginning, which caused unnecessary memory consumption. I used sqldict to solidify the content data structure, it works and the memory consumption reduced by more than a half. A good result, however the response time becomes unacceptable as the server got to grab content data from the disk every time they are accessed. To address the dilema, by observing users' visiting behavior, I found out that over 80% visits are paid to less than 20% contents. Ummm, a LRU cache might help here. And I implemented a LRU cache maintaining the most recent N videos in memory. It turned out this solution also saved nearly half of the memory usage yet without downgrading the response time. Of course there are also other minor optimizations, but researching and implementing this optimization is my main contribution in this gc program.
The thesis research what I am doing recently is about nn based cf. There has been some work doing collaborative filtering with RBM. Inspired by rapid advance of dnn, I planned to implement cf with a deep nn. The first attempt I made is extending RBM to deep RBM to reconstruct cf scores. It benefits a bit but that's not enough. coz RBM in CF is actually working as item-based cf, that the well-tuned nn will take into users watch history and made predictions accordingly. However I want to also utilize the user-based method or better a square filter in the score matrix, like how we used dnn for computer vision problem. The research is still ongoing, hopefully I can make some achievement before my defense.
The reason why I like data mining is that it's tightly related with daily life and intuition. It's not like basic research, so abstract and so far away from the earth and one needs long-lasting motivation to pursue the goal, data mining is so close to everyday life and concrete problems, in which I can apply my skills to this problems and make a difference. That's inspiring, it's like the game, why people are fond of gaming? In essence, games can give us feedback intermediately and the feedback can inspire our behavior in return. And the same goes to research, data mining can always give me positive feedback and thus inspire me keep doing that.