Vol. 75, n° 9-10, September-October 2020
Content available on Springerlink
IoT data stream analytics
Regularized and incremental decision trees for data streams
Jean Paul Barddal1, Fabrício Enembreck1
(1) Graduate Program in Informatics (PPGIa), Pontifícia Universidade Católica do Paraná, Curitiba, Brazil
Abstract Decision trees are a widely used family of methods for learning predictive models from both batch and streaming data. Despite depicting positive results in a multitude of applications, incremental decision trees continuously grow in terms of nodes as new data becomes available, i.e., they eventually split on all features available, and also multiple times using the same feature, thus leading to unnecessary complexity and overfitting. With this behavior, incremental trees lose the ability to generalize well, be human-understandable, and be computationally efficient. To tackle these issues, we proposed in a previous study a regularization scheme for Hoeffding decision trees that (i) uses a penalty factor to control the gain obtained by creating a new split node using a feature that has not been used thus far and (ii) uses information from previous splits in the current branch to determine whether the gain observed indeed justifies a new split. In this paper, we extend this analysis and apply the proposed regularization scheme to other types of incremental decision trees and report the results in both synthetic and real-world scenarios. The main interest is to verify whether and how the proposed regularization scheme affects the different types of incremental trees. Results show that in addition to the original Hoeffding Tree, the Adaptive Random Forest also benefits from regularization, yet, McDiarmid Trees and Extremely Fast Decision Trees observe declines in accuracy.
Keywords Data stream mining · Classification · Decision tree · Random forest · Regularization
Discovering locations and habits from human mobility data
Thiago Andrade1,2, Brais Cancela1,2, João Gama1,3
(1) MINESC TEC, Porto, Portugal
(2) Universidade da Coruña, Coruña, Spain
(3) University of Porto, Porto, Portugal
Abstract Human mobility patterns are associated with many aspects of our life. With the increase of the popularity and pervasiveness of smartphones and portable devices, the Internet of Things (IoT) is turning into a permanent part of our daily routines. Positioning technologies that serve these devices such as the cellular antenna (GSM networks), global navigation satellite systems (GPS), and more recently the WiFi positioning system (WPS) provide large amounts of spatio-temporal data in a continuous way (data streams). In order to understand human behavior, the detection of important places and the movements between these places is a fundamental task. That said, the proposal of this work is a method for discovering user habits over mobility data without any a priori or external knowledge. Our approach extends a density-based clustering method for spatio-temporal data to identify meaningful places the individuals’ visit. On top of that, a Gaussian mixture model (GMM) is employed over movements between the visits to automatically separate the trajectories accordingly to their key identifiers that may help describe a habit. By regrouping trajectories that look alike by day of the week, length, and starting hour, we discover the individual’s habits. The evaluation of the proposed method is made over three real-world datasets. One dataset contains high-density GPS data and the others use GSM mobile phone data with 15-min sampling rate and Google Location History data with a variable sampling rate. The results show that the proposed pipeline is suitable for this task as other habits rather than just going from home to work and vice versa were found. This method can be used for understanding person behavior and creating their profiles revealing a panorama of human mobility patterns from raw mobility data.
Keywords Habits · Meaningful places · Gaussian mixture model · Pattern · Mobility · Spatio-Temporal clustering
Bi-directional online transfer learning: a framework
Helen McKay1, Nathan Griffiths1, Phillip Taylor1, Theo Damoulas1,2, Zhou Xu3
(1) U1 Department of Computer Science, University of Warwick, Coventry, UK
(2) Department of Statistics, University of Warwick, Coventry, UK
(3) Jaguar Land Rover Research, Coventry, UK
Abstract Transfer learning uses knowledge learnt in source domains to aid predictions in a target domain. When source and target domains are online, they are susceptible to concept drift, which may alter the mapping of knowledge between them. Drifts in online environments can make additional information available in each domain, necessitating continuing knowledge transfer both from source to target and vice versa. To address this, we introduce the Bi-directional Online Transfer Learning (BOTL) framework, which uses knowledge learnt in each online domain to aid predictions in others. We introduce two variants of BOTL that incorporate model culling to minimise negative transfer in frameworks with high volumes of model transfer. We consider the theoretical loss of BOTL, which indicates that BOTL achieves a loss no worse than the underlying concept drift detection algorithm. We evaluate BOTL using two existing concept drift detection algorithms: RePro and ADWIN. Additionally, we present a concept drift detection algorithm, Adaptive Windowing with Proactive drift detection (AWPro), which reduces the computation and communication demands of BOTL. Empirical results are presented using two data stream generators: the drifting hyperplane emulator and the smart home heating simulator, and real-world data predicting Time To Collision (TTC) from vehicle telemetry. The evaluation shows BOTL and its variants outperform the concept drift detection strategies and the existing state-of-the-art online transfer learning technique.
Keywords Online learning · Transfer learning · Concept drift
Resource management for model learning at entity level
Christian Beyer1, Vishnu Unnikrishnan1, Robert Brüggemann1, Vincent Toulouse1, Hafez Kader Omar1, Eirini Ntoutsi2, Myra Spiliopoulou1
(1) Otto-von-Guericke University, Magdeburg, Germany
(2) Leibniz University, Hannover, Germany
Abstract Many current and future applications plan to provide entity-specific predictions. These range from individualized healthcare applications to user-specific purchase recommendations. In our previous stream-based work on Amazon review data, we could show that error-weighted ensembles that combine entity-centric classifiers, which are only trained on reviews of one particular product (entity), and entity-ignorant classifiers, which are trained on all reviews irrespective of the product, can improve prediction quality. This came at the cost of storing multiple entity-centric models in primary memory, many of which would never be used again as their entities would not receive future instances in the stream. To overcome this drawback and make entity-centric learning viable in these scenarios, we investigated two different methods of reducing the primary memory requirement of our entity-centric approach. Our first method uses the lossy counting algorithm for data streams to identify entities whose instances make up a certain percentage of the total data stream within an error-margin. We then store all models which do not fulfil this requirement in secondary memory, from which they can be retrieved in case future instances belonging to them should arrive later in the stream. The second method replaces entity-centric models with a much more naive model which only stores the past labels and predicts the majority label seen so far. We applied our methods on the previously used Amazon data sets which contained up to 1.4M reviews and added two subsets of the Yelp data set which contain up to 4.2M reviews. Both methods were successful in reducing the primary memory requirements while still outperforming an entity-ignorant model.
Keywords Entity-centric learning · Stream classification · Document prediction · Memory reduction · Text ignorant models
Process mining on machine event logs for profiling abnormal behaviour and root cause analysis
Jonas Maeyens1, Annemie Vorstermans1, Mathias Verbeke2
(1) KU Leuven – Technology Campus Ghent, Gebroeders de Smetstraat 1, B-9000, Ghent, Belgium
(2) Sirris – Data and AI Competence Lab, Bd. A. Reyerslaan 80, B-1030, Brussels, Belgium
Abstract Process mining is a set of techniques in the field of process management that have primarily been used to analyse business processes, for example for the optimisation of enterprise resources. In this research, the feasibility of using process mining techniques for the analysis of event data from machine logs is investigated. A novel methodology, based on process mining, for profiling abnormal machine behaviour is proposed. Firstly, a process model is constructed from the event logs of the healthy machines. This model can subsequently be used as a benchmark to compare process models of other machines by means of conformance checking. This comparison results in a set of conformance scores related to the structure of the model and other more complex aspects such as the differences in duration of particular traces, the time spent in individual events, and the relative path frequency. The identified differences can subsequently be used as a basis for root cause analysis. The proposed approach is evaluated on a real-world industrial data set from the renewable energy domain, more specifically event logs of a fleet of inverters from several solar plants.
Keywords Process mining · Event logs · Industrial machinery · Irregular behaviour · Profiling · Root cause analysis
Profiling high leverage points for detecting anomalous users in telecom data networks
Shazia Tabassum1, Muhammad Ajmal Azad2, João Gama3
(1) INESC TEC, University of Porto, R. Dr. Roberto Frias, Porto, Portugal
(2) University of Derby, Derby, UK
(3) INESC TEC, Porto, Portugal
Abstract Fraud in telephony incurs huge revenue losses and causes a menace to both the service providers and legitimate users. This problem is growing alongside augmenting technologies. Yet, the works in this area are hindered by the availability of data and confidentiality of approaches. In this work, we deal with the problem of detecting different types of unsolicited users from spammers to fraudsters in a massive phone call network. Most of the malicious users in telecommunications have some of the characteristics in common. These characteristics can be defined by a set of features whose values are uncommon for normal users. We made use of graph-based metrics to detect profiles that are significantly far from the common user profiles in a real data log with millions of users. To achieve this, we looked for the high leverage points in the 99.99th percentile, which identified a substantial number of users as extreme anomalous points. Furthermore, clustering these points helped distinguish malicious users efficiently and minimized the problem space significantly. Convincingly, the learned profiles of these detected users coincided with fraudulent behaviors.
Keywords Fraud detection · Unsolicited users · Anomaly detection
Interconnect bypass fraud detection: a case study
Bruno Veloso1, Shazia Tabassum1, Carlos Martins2, Raphael Espanha2, Raul Azevedo2, João Gama1
(1) INESCTEC, Porto, Portugal
(2) Mobileum, Braga, Portugal
Abstract The high asymmetry of international termination rates is fertile ground for the appearance of fraud in telecom companies. International calls have higher values when compared with national ones, which raises the attention of fraudsters. In this paper, we present a solution for a real problem called interconnect bypass fraud, more specifically, a newly identified distributed pattern that crosses different countries and keeps fraudsters from being tracked by almost all fraud detection techniques. This problem is one of the most expressive in the telecommunication domain, and it has some abnormal behaviours like the occurrence of a burst of calls from specific numbers. Based on this assumption, we propose the adoption of a new fast forgetting technique that works together with the Lossy Counting algorithm. We apply frequent set mining to capture distributed patterns from different countries. Our goal is to detect as soon as possible items with abnormal behaviours, e.g., bursts of calls, repetitions, mirrors, distributed behaviours and a small number of calls spread by a vast set of destination numbers. The results show that the application of different techniques improves the detection ratio and not only complements the techniques used by the telecom company but also improves the performance of the Lossy Counting algorithm in terms of run-time, memory used and sensibility to detect the abnormal behaviours. Additionally, the application of frequent set mining allows us to capture distributed fraud patterns.
Keywords Fraud · Telecommunications · Lossy Counting · Forgetting
Active feature acquisition on data streams under feature drift
Christian Beyer1, Maik Büttner1, Vishnu Unnikrishnan1, Miro Schleicher1, Eirini Ntoutsi2, Myra Spiliopoulou1
(1) Otto-von-Guericke University, Magdeburg, Germany
(2) Leibniz University, Hannover, Germany
Abstract Traditional active learning tries to identify instances for which the acquisition of the label increases model performance under budget constraints. Less research has been devoted to the task of actively acquiring feature values, whereupon both the instance and the feature must be selected intelligently and even less to a scenario where the instances arrive in a stream with feature drift. We propose an active feature acquisition strategy for data streams with feature drift, as well as an active feature acquisition evaluation framework. We also implement a baseline that chooses features randomly and compare the random approach against eight different methods in a scenario where we can acquire at most one feature at the time per instance and where all features are considered to cost the same. Our initial experiments on 9 different data sets, with 7 different degrees of missing features and 8 different budgets show that our developed methods outperform the random acquisition on 7 data sets and have a comparable performance on the remaining two.
Keywords Active feature acquisition · Data streams · Feature drift