Are you a MSc student who confidently learned AI, machine learning, neural networks, and/or deep learning? Are you interested in applying your knowledge and expertise to solve real-life problems in science? Are you willing to go further than your comfort zone and learn about molecular absorption spectroscopy, its application in breath analysis and disease detection, and face the data analysis challenges in this field? We have an offer that you can’t refuse! A MSc internship with an allowance.
The recent development of ultra-broadband and low-noise mid-infrared spectroscopy systems unleashes a great potential for sensitive and simultaneous trace detection of a very long list of molecular species. This is very interesting for applications dealing with complex matrices, such as breath analysis and plasma diagnostics. Traditionally, the classical least squares (CLS) method [1] is used in spectral analysis to decompose interfering spectra of different species and extract their concentrations accurately. In this method, the model spectra of the species in the sample matrix are all calculated using existing databases and then fitted altogether to the measured spectra to retrieve their concentrations (a linear multiline fitting scheme). However, CLS suffers from some difficulties and disadvantages. Strong absorption features (close to 100%), baseline drifts (i.e. drifts in the spectral power of the laser during the measurement), etalon fringes (i.e. sinusoidal instrument-specific artifacts on the spectra), and unfitted absorption features can seriously affect the accuracy of CLS. Different procedures and methods can be employed to minimize these effects; e.g. removing very strong absorption features from the overall fitting routine as well as modeling and fitting of the baseline drift and the etalon fringes by a low-order polynomial and summation of low-frequency sinewaves, respectively, to name a few. Although often effective, these procedures are time-consuming and sometimes can degrade the accuracy of the fit; e.g. removal of the baseline drift/etalon fringes can also remove some spectral features of absorbing species. Therefore, the quality of the spectral analysis degrades and cannot address demanding applications of e.g. medical data in breath analysis.
We have recently utilized a partial least squares (PLS) method with a novel hybrid dataset approach as an alternative. PLS is a purely statistical model and relies on calibration measurements as training datasets. However, constructing a real training dataset is far too time-consuming and costly, as it would require many different calibrated gas mixtures measured with high precision/accuracy. Our approach is to create a simulated dataset that is tailored to specific instruments by combining simulated absorbance spectra with measured blank (featureless background) intensity spectra. While the simulations provide the absorption spectra, the blank measurements provide the realistic unique features of the spectrometer, such as noise patterns, baseline drifts, and etalon fringes. Combining these two results in an affordable and scalable process. We have achieved encouraging results using this approach. Meanwhile, this workflow is not specific to PLS and can also be applied to other models, such as machine learning using neural networks (deep learning). Therefore, we would like to investigate this opportunity further: In particular, we know that PLS can be sensitive to outliers in real-world data, and so investigating different alternatives like the so-called LASSO (least absolute shrinkage and selection operator), that is known to be less sensitive to outliers than least-squares methods, could be a promising approach. Alternatively, we could explicitly model the (noisy) spectrometer measurement features, and include them as a separate penalty term in the overall metric. Your task will be to establish whether either of these approaches could alleviate some of the drawbacks of existing methods. Of course you could also come up with another idea of your own!
Your will work within the Data Science group, in close collaboration with the Life Science Trace Detection Laboratory (TDLab), part of the Institute for Molecules and Materials (IMM) of the Faculty of Science at Radboud University. We have a vibrant and enthusiastic group of young researchers working at the crossroads of physics, chemistry, and biology. A successful candidate will receive a basic training on molecular absorption spectroscopy to understand the minimum necessary physics behind it and become familiar with the available molecular absorption databases. They will work closely with a last-year PhD candidate in our group and receive full support from other members of TDLab. The final goal is to devise the foundations of using deep learning to tackle the problem.
[1] P. R. Griffiths and J. A. de Haseth, Fourier Transform Infrared Spectrometry, Chapter 9:Quantitative Analysis, (John Wiley & Sons, Inc., 2007).
Interested? Need more information? Please contact Simona Cristescu of the Trace Detection Laboratory (TDLab) and/or Tom Claassen of Data Science. We will be happy to talk to you!