Data Mining Lab was initiated in December 2010 when Prof. Lee joined KAIST. Our lab is investigating various issues and challenges of data mining, which is roughly defined as extraction of interesting (non-trivial, implicit, previously unknown, and potentially useful) patterns or knowledge from a huge amount of data. We are developing innovative data mining algorithms and creative knowledge services. We have done world-class research published at premier conferences and journals.
Data mining techniques are heavily dependent on the types of data, and we are interested especially in stream (time-series) data (left) and mobility (trajectory) data (right). The stream data is the continuous flow of data generated by various sources. Thanks to the advances of sensor and communication technologies (e.g., IoT and 5G), a huge amount of stream data is being generated from smart factories, smart cities, data centers, and so on. Analyzing this stream data is essential to find out interesting or unknown events in a given system. The mobility data is the measure of how populations and goods move over time. Thanks to the advances of mobile devices and the emergence of shared micro-mobility, a huge amount of fine-grained mobility data can be collected from location-based social networking services and ride-hailing services. Analyzing this mobility data is essential to reveal the intention and behavior of human beings in the real world. Overall, due to the nature of these two types of data, we enjoy developing real-time algorithms and services.
Focusing on those two types of data, our research goals fall into the improvements of the (1) performance and (2) quality of data mining. Toward the performance goal, we apply distributed processing or sampling/indexing techniques mostly for well-known algorithms to make broader impact. Toward the quality goal, we mitigate data quality issues (e.g., noisy or missing labels) or incorporate recent deep learning into big data analysis. Overall, our main research interests can be illustrated using the hashtags below. See the following sections for details.
Toward Performance Improvement
Distributed Deep Learning
We plan to develop a distributed deep learning platform by combining Apache Spark and TensorFlow to enable fast deep learning for big data. The framework is supposed to cover both data parallelism and model parallelism as well as a hybrid of them. To aggregate the parameters learned from each worker, we will consider both the parameter server and AllReduce mechanisms. This work is planned for the second phase (2024-2027) of the SW Star Lab project.
MapReduce Distributed Processing
We have developed the MapReduce versions of popular data analysis algorithms (e.g., DBSCAN), which can run on Hadoop and Spark, in order to dramatically improve scalability.
Real-Time Stream Processing
We have developed very fast stream processing algorithms, particularly for anomaly detection. The significant speed-up is mainly achieved by minimizing the amount of updates for newly incoming data.
Toward Quality Improvement
Deep Learning + Data Analysis
Deep Learning-Based Stream Analysis
We are applying recent deep learning techniques to stream data analysis including anomaly detection, forecasting, segmentation, and so on. In particular, we are developing anomaly detection methods that cope with concept drift and auxiliary information. The Seq2Seq autoencoder and attention as well as the Transformer are mainly being investigated for this purpose. Furthermore, to increase the practical usability of stream data analysis, we are also developing the AutoML framework for stream data (Stream AutoML).
Deep Learning-Based Mobility Analysis
We are applying recent deep learning techniques to mobility data analysis including taxi demand prediction, traffic flow prediction, traffic accident prediction, location recommendation, and so on.
Data Quality Challenges
Robust Learning for "Noisy" Labels
The data in the real world inevitably contains label errors (noises) for various reasons because thorough annotation for big data is practically infeasible. Thus, we have developed robust training methods for deep neural networks to fight with noisy labels in the training data.
Active or Self/Semi-Supervised Learning for "Insufficient" Labels
Because manual annotation is very costly, a large proportion of big data remains unlabeled, and the applicability of supervised learning is often hindered by the lack of ground-truth labels. Thus, we are working on developing active learning methods or self/semi-supervised learning methods to fight with insufficient labels in the real world.
Solving Humanity's Problems
The COVID-19 outbreak has changed our daily life in every aspect, and thus, our lab is working on data science techniques to recover from the crisis.
COVID-19 Spread Prediction
Applying deep learning to epidemiology, we have developed a data-driven method for predicting inbound COVID-19 confirmed cases and demonstrated high prediction accuracy.
COVID-19 Digital Contact Tracing
Applying deep learning to digital contact tracing, we are trying to improve the granularity of the routes of COVID-19 confirmed cases without invading their privacy much.