Funding Scheme

RGC Faculty Development Scheme

Project Title

Model Selection with High Dimensional Incomplete Data

Project Team (HSMC Staff) 

Prof TANG Man Lai, Department of Mathematics & Statistics (PI)

Other Collaborating Parties


Project Period

1-1-2019 to 31-12-2020 (on-going)

Funding Amount (HKD)



High dimensional data analysis has become increasingly frequent and important in diverse fields; for example, genomics, health sciences, economics and machine learning. Model selection plays a pivotal role in contemporary scientific discoveries. There have been a large body of works on model selection for complete data. However, complete data are often not available for every subject due to many reasons, including the unavailability of covariate measurements and loss of data. The literature on model selection for high dimensional data in the presence of missing or incomplete values is relatively sparse. Therefore, efficient methods and algorithms for model selection with incomplete data are of great research interest and practical demand.

For model selection, the information criteria (e.g., the Akaike information criterion and the Bayesian information criterion) is commonly applied, and it can be easily incorporated with the famous EM algorithm in the presence of missing values. Generalized EM algorithm has also been developed to update the model and the parameter under the model in each iteration. It performs Expectation step and Model Selection Step alternately, converges globally, and yields a consistent model in model selection. However, it may not always be numerically feasible to perform Model Selection Step, especially for high dimensional data. Therefore, a new method for model selection with high dimensional incomplete data is greatly desirable. Our proposed algorithm in this project will hopefully yield a consistent model in general missing data patterns and have numerical convergences. Moreover, our proposed method is expected to perform efficiently variable selection in linear regression, generalized linear models and model selection of graphical models.

Due to the convenience of its implementation by using standard software modules, multiple imputation is arguably the most widely used approach for handling missing data. It is straightforward to apply an existing model selection method to each imputed dataset. However, it is challenging to combine results on model selection across imputed data sets in a principled framework. To overcome the challenge, many advanced techniques are developed for variable selection problem, such as the group lasso penalty to merged data sets of all imputations, the strategy of stability selection within bootstrap imputation, and random lasso combined with multiple imputation. These techniques are feasible for high-dimensional data with complex missing patterns and have achieved good performance in simulation studies and real data analyses. However, as far as we know, it is very surprising that there is no imputation method for graphical models. An imputation-based method for graphical model selection is greatly desirable. In this project, we investigate bootstrap multiple imputation with stability selection. We expect the proposed method can deal with general missing data patterns.