Pictures copyrighted

M. Verleysen - Machine learning for high-dimensional data

Machine learning is used nowadays to build models for classification and regression tasks, among others. The learning principle consists in designing the models based on the information contained in the dataset, with as few as possible a priori restriction about the class of models of interest.

While many paradigms exist and are widely used in the context of machine learning, most of them suffer from the "curse of dimensionality". The curse of dimensionality means that some strange phenomena appear when data are represented in a high-dimensional space. These phenomena are most often counter-intuitive: the conventional geometrical interpretation of data analysis in 2- or 3-dimensional spaces cannot be extended to much higher dimensions.

Among the problems related to the curse of dimensionality, the feature redundancy and concentration of the norm are probably those that have the largest impact on data analysis tools. Feature redundancy means that models will lose the identifiability property (for example they will oscillate between equivalent solutions), will be difficult to interpret, etc.; although it is an advantage on the point of view of information content in the data, the redundancy will make the learning of the model more difficult. The concentration of the norm is a more specific unfortunate property of high-dimensional vectors: when the dimension of the space increases, norms and distance will concentrate, making the discrimination between data more difficult.

Most data analysis tools are not robust to these phenomena. Their performance collapse when the dimension of the data space increases, in particular when the number of data available for learning is limited.

After a discussion of phenomena related to the curse of dimensionality, this lecture will cover feature selection and manifold learning, two approaches to fight the consequences of the curse of dimensionality. Feature selection consists in selecting some of the variables/features among those available in the dataset, according to a relevance criterion. Filter, wrapper and embedded methods will be covered. Manifold learning consists in mapping the high-dimensional data to a lower-dimensional representation, while preserving some topology, distance or information criterion. Such nonlinear projection methods may be used both for dimensionality reduction, and for the visualization of data when the manifold dimension is restricted to 2 or 3. In addition to improving the performances of machine learning models, feature selection and manifold learning ease the model interpretation task, but pointing out which features or variables are of importance to the task. The importance of feature selection in machine learning will be illustrated in the context of medical applications.