Grouping of complex substances using analytical chemistry data: A framework for quantitative evaluation and visualization

Autoři: Melis Onel aff001;  Burcu Beykal aff001;  Kyle Ferguson aff003;  Weihsueh A. Chiu aff003;  Thomas J. McDonald aff004;  Lan Zhou aff005;  John S. House aff006;  Fred A. Wright aff006;  David A. Sheen aff008;  Ivan Rusyn aff003;  Efstratios N. Pistikopoulos aff001
Působiště autorů: Artie McFerrin Department of Chemical Engineering, Texas A&M University, College Station, TX, United States of America aff001;  Texas A&M Energy Institute, Texas A&M University, College Station, TX, United States of America aff002;  Department of Veterinary Integrative Biosciences, Texas A&M University, College Station, TX, United States of America aff003;  Department of Environmental and Occupational Health, Texas A&M University, College Station, TX, United States of America aff004;  Department of Statistics, Texas A&M University, College Station, TX, United States of America aff005;  Bioinformatics Research Center, North Carolina State University, Raleigh, NC, United States of America aff006;  Departments of Statistics and Biological Sciences, North Carolina State University, Raleigh, NC, United States of America aff007;  Chemical Sciences Division, National Institute of Standards and Technology, Gaithersburg, MD, United States of America aff008
Vyšlo v časopise: PLoS ONE 14(10)
Kategorie: Research Article
doi: 10.1371/journal.pone.0223517


A detailed characterization of the chemical composition of complex substances, such as products of petroleum refining and environmental mixtures, is greatly needed in exposure assessment and manufacturing. The inherent complexity and variability in the composition of complex substances obfuscate the choices for their detailed analytical characterization. Yet, in lieu of exact chemical composition of complex substances, evaluation of the degree of similarity is a sensible path toward decision-making in environmental health regulations. Grouping of similar complex substances is a challenge that can be addressed via advanced analytical methods and streamlined data analysis and visualization techniques. Here, we propose a framework with unsupervised and supervised analyses to optimally group complex substances based on their analytical features. We test two data sets of complex oil-derived substances. The first data set is from gas chromatography-mass spectrometry (GC-MS) analysis of 20 Standard Reference Materials representing crude oils and oil refining products. The second data set consists of 15 samples of various gas oils analyzed using three analytical techniques: GC-MS, GC×GC-flame ionization detection (FID), and ion mobility spectrometry-mass spectrometry (IM-MS). We use hierarchical clustering using Pearson correlation as a similarity metric for the unsupervised analysis and build classification models using the Random Forest algorithm for the supervised analysis. We present a quantitative comparative assessment of clustering results via Fowlkes–Mallows index, and classification results via model accuracies in predicting the group of an unknown complex substance. We demonstrate the effect of (i) different grouping methodologies, (ii) data set size, and (iii) dimensionality reduction on the grouping quality, and (iv) different analytical techniques on the characterization of the complex substances. While the complexity and variability in chemical composition are an inherent feature of complex substances, we demonstrate how the choices of the data analysis and visualization methods can impact the communication of their characteristics to delineate sufficient similarity.

Klíčová slova:

Analytical chemistry – Crude oil – Data processing – Data visualization – Fuels – Gas chromatography-mass spectrometry – Petroleum – Motor oil


1. Clark CR, McKee RH, Freeman JJ, Swick D, Mahagaokar S, Pigram G, et al. A GHS-consistent approach to health hazard classification of petroleum substances, a class of UVCB substances. Regul Toxicol Pharmacol. 2013;67(3):409–20. doi: 10.1016/j.yrtph.2013.08.020 24025648.

2. European Chemicals Agency. Read-Across Assessment Framework (RAAF)—considerations on multi-constituent substances and UVCBs. Helsinki, Finland: European Chemical Agency; 2017.

3. Redman AD, Parkerton TF. Guidance for improving comparability and relevance of oil toxicity tests. Mar Pollut Bull. 2015;98(1–2):156–70. doi: 10.1016/j.marpolbul.2015.06.053 26162510.

4. Gestel CAMv. Mixture toxicity: linking approaches from ecological and human toxicology. Boca Raton: CRC Press; 2011. xxxviii, 280 p. p.

5. Dimitrov SD, Georgieva DG, Pavlov TS, Karakolev YH, Karamertzanis PG, Rasenberg M, et al. UVCB substances: methodology for structural description and application to fate and hazard assessment. Environ Toxicol Chem. 2015;34(11):2450–62. doi: 10.1002/etc.3100 26053589.

6. CONCAWE. REACH–Analytical characterisation of petroleum UVCB substances. Brussels, Belgium: 2012 Contract No.: No. 7/12.

7. Bell M, Blais JM. "-Omics" workflow for paleolimnological and geological archives: A review. Sci Total Environ. 2019;672:438–55. Epub 2019/04/10. doi: 10.1016/j.scitotenv.2019.03.477 30965259.

8. Cho Y, Ahmed A, Islam A, Kim S. Developments in FT-ICR MS instrumentation, ionization techniques, and data interpretation methods for petroleomics. Mass Spectrom Rev. 2015;34(2):248–63. Epub 2014/06/20. doi: 10.1002/mas.21438 24942384.

9. Catlin NR, Collins BJ, Auerbach SS, Ferguson SS, Harnly JM, Gennings C, et al. How similar is similar enough? A sufficient similarity case study with Ginkgo biloba extract. Food Chem Toxicol. 2018;118:328–39. Epub 2018/05/13. doi: 10.1016/j.fct.2018.05.013 29752982.

10. Grimm FA, Russell WK, Luo YS, Iwata Y, Chiu WA, Roy T, et al. Grouping of Petroleum Substances as Example UVCBs by Ion Mobility-Mass Spectrometry to Enable Chemical Composition-Based Read-Across. Environmental science & technology. 2017;51(12):7197–207. doi: 10.1021/acs.est.6b06413 28502166.

11. Marshall AG, Rodgers RP. Petroleomics: chemistry of the underworld. Proc Natl Acad Sci U S A. 2008;105(47):18090–5. doi: 10.1073/pnas.0805069105 18836082; PubMed Central PMCID: PMC2587575.

12. Rozett RW, Petersen EM. Methods of factor analysis of mass spectra. Anal Chem. 1975;47(8):1301–8.

13. Grimm FA, Russell WK, Luo YS, Iwata Y, Chiu WA, Roy T, et al. Grouping of Petroleum Substances as Example UVCBs by Ion Mobility-Mass Spectrometry to Enable Chemical Composition-Based Read-Across. Environmental Science & Technology. 2017;51(12):7197–207. doi: 10.1021/acs.est.6b06413 WOS:000404087400062. 28502166

14. de Carvalho Rocha WF, Schantz MM, Sheen DA, Chu PM, Lippa KA. Unsupervised classification of petroleum Certified Reference Materials and other fuels by chemometric analysis of gas chromatography-mass spectrometry data. Fuel (Lond). 2017;197:248–58. doi: 10.1016/j.fuel.2017.02.025 28603295; PubMed Central PMCID: PMC5464420.

15. Flexer A. Limitations of self-organizing maps for vector quantization and multidimensional scaling. Adv Neur In. 1997;9:445–51. WOS:A1997BH93C00063.

16. Yin HJ. Connection between self-organizing maps and metric multidimensional scaling. Ieee Ijcnn. 2007:1025–30. doi: 10.1109/Ijcnn.2007.4371099 WOS:000254291100179.

17. Rank J. Classification and risk assessment of chemicals: the case of DEHP in the light of REACH. The Journal of Transdisciplinary Environmental Studies. 2005;4(3):1–15.

18. L'Yi S, Ko B, Shin D, Cho YJ, Lee J, Kim B, et al. XCluSim: a visual analytics tool for interactively comparing multiple clustering results of bioinformatics data. BMC Bioinformatics. 2015;16 Suppl 11:S5. doi: 10.1186/1471-2105-16-S11-S5 26328893; PubMed Central PMCID: PMC4547151.

19. Filippova D, Gadani A, Kingsford C. Coral: an integrated suite of visualizations for comparing clusterings. BMC Bioinformatics. 2012;13:276. doi: 10.1186/1471-2105-13-276 23102108; PubMed Central PMCID: PMC3576325.

20. Ferguson KC. Characterization of Complex Substances Used in Biological Profiling Through Determination of the Free Concentration Within In Vitro Assays. 2018.

21. Do KT, Wahl S, Raffler J, Molnos S, Laimighofer M, Adamski J, et al. Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies. Metabolomics. 2018;14(10):128. doi: 10.1007/s11306-018-1420-2 30830398; PubMed Central PMCID: PMC6153696.

22. PetroOrg Software Tallahassee, FL2014 [02/13/2019]. Available from:

23. Alpaydin E. Introduction to Machine Learning, 3rd Edition. Introduction to Machine Learning, 3rd Edition. 2014:1–613. WOS:000351537500022.

24. Kohonen T. The Self-Organizing Map. Proceedings of the Ieee. 1990;78(9):1464–80. doi: 10.1109/5.58325 WOS:A1990EC03500004.

25. Vanloan C. Computing the Cs and the Generalized Singular Value Decompositions. Numer Math. 1985;46(4):479–91. doi: 10.1007/Bf01389653 WOS:A1985ANR4700001.

26. Golub GH, Reinsch C. Singular Value Decomposition and Least Squares Solutions. Numer Math. 1970;14(5):403–&. doi: 10.1007/Bf02163027 WOS:A1970G000600001.

27. Chipman H, Tibshirani R. Hybrid hierarchical clustering with applications to microarray data. Biostatistics. 2006;7(2):286–301. doi: 10.1093/biostatistics/kxj007 WOS:000236436300009. 16301308

28. Xu R, Wunsch D. Survey of clustering algorithms. Ieee T Neural Networ. 2005;16(3):645–78. doi: 10.1109/Tnn.2005.845141 WOS:000228909900013. 15940994

29. Fowlkes EB, Mallows CL. A Method for Comparing 2 Hierarchical Clusterings. Journal of the American Statistical Association. 1983;78(383):553–69. doi: 10.2307/2288117 WOS:A1983RF90800005.

30. Wagner S, and Dorothea Wagner. Comparing clusterings: an overview. Karlsruhe: Universität Karlsruhe, Fakultät für Informatik. 2007.

31. Keasar C, McGuffin LJ, Wallner B, Chopra G, Adhikari B, Bhattacharya D, et al. An analysis and evaluation of the WeFold collaborative for protein structure prediction and its pipelines in CASP11 and CASP12. Sci Rep. 2018;8(1):9939. doi: 10.1038/s41598-018-26812-8 29967418; PubMed Central PMCID: PMC6028396.

32. Kieslich CA, Tamamis P, Guzman YA, Onel M, Floudas CA. Highly Accurate Structure-Based Prediction of HIV-1 Coreceptor Usage Suggests Intermolecular Interactions Driving Tropism. PLoS One. 2016;11(2):e0148974. doi: 10.1371/journal.pone.0148974 26859389; PubMed Central PMCID: PMC4747591.

33. Onel M, Beykal B, Wang MC, Grimm FA, Zhou L, Wright FA, et al. Optimal Chemical Grouping and Sorbent Material Design by Data Analysis, Modeling and Dimensionality Reduction Techniques. Comput-Aided Chem En. 2018;43:421–6. doi: 10.1016/B978-0-444-64235-6.50076–0 WOS:000441374200076.

34. Onel M, Kieslich CA, Guzman YA, Floudas CA, Pistikopoulos EN. Big Data Approach to Batch Process Monitoring: Simultaneous Fault Detection and Diagnosis Using Nonlinear Support Vector Machine-based Feature Selection. Comput Chem Eng. 2018;115:46–63. doi: 10.1016/j.compchemeng.2018.03.025 30386002; PubMed Central PMCID: PMC6205516.

35. Onel M, Kieslich CA, Guzman YA, Pistikopoulos EN. Simultaneous Fault Detection and Identification in Continuous Processes via nonlinear Support Vector Machine based Feature Selection. Int Symp Process Syst Eng. 2018;44:2077–82. doi: 10.1016/B978-0-444-64241-7.50341-4 30534633; PubMed Central PMCID: PMC6284809.

36. Onel M, Kieslich CA, Pistikopoulos EN. A nonlinear support vector machine‐based feature selection approach for fault detection and diagnosis: Application to the Tennessee Eastman process. AIChE Journal. 2019.

37. Beykal B, Boukouvala F, Floudas CA, Pistikopoulos EN. Optimal design of energy systems using constrained grey-box multi-objective optimization. Comput Chem Eng. 2018;116:488–502. doi: 10.1016/j.compchemeng.2018.02.017 WOS:000448410000032. 30546167

38. Beykal B, Boukouvala F, Floudas CA, Sorek N, Zalavadia H, Gildin E. Global optimization of grey-box computational systems using surrogate functions and application to highly constrained oil-field operations. Comput Chem Eng. 2018;114:99–110. doi: 10.1016/j.compchemeng.2018.01.005 WOS:000439701100009.

39. Sorek N, Gildin E, Boukouvala F, Beykal B, Floudas CA. Dimensionality reduction for production optimization using polynomial approximations. Computat Geosci. 2017;21(2):247–66. doi: 10.1007/s10596-016-9610-3 WOS:000398928300005.

40. Mukherjee R, Onel M, Beykal B, Szafran AT, Stossi F, Mancini MA, et al. Development of the Texas A&M Superfund Research Program Computational Platform for Data Integration, Visualization, and Analysis. Computer Aided Chemical Engineering. 46: Elsevier; 2019. p. 967–72.

41. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. doi: 10.1023/A:1010933404324 WOS:000170489900001.

42. Marvel SW, To K, Grimm FA, Wright FA, Rusyn I, Reif DM. ToxPi Graphical User Interface 2.0: Dynamic exploration, visualization, and sharing of integrated data models. BMC Bioinformatics. 2018;19(1):80. doi: 10.1186/s12859-018-2089-2 29506467; PubMed Central PMCID: PMC5838926.

43. Reif DM, Martin MT, Tan SW, Houck KA, Judson RS, Richard AM, et al. Endocrine profiling and prioritization of environmental chemicals using ToxCast data. Environ Health Perspect. 2010;118(12):1714–20. doi: 10.1289/ehp.1002180 20826373; PubMed Central PMCID: PMC3002190.

44. Reif DM, Sypa M, Lock EF, Wright FA, Wilson A, Cathey T, et al. ToxPi GUI: an interactive visualization tool for transparent integration of data from diverse sources of evidence. Bioinformatics. 2013;29(3):402–3. doi: 10.1093/bioinformatics/bts686 23202747; PubMed Central PMCID: PMC3988461.

45. Bro R. PARAFAC. Tutorial and applications. Chemometr Intell Lab. 1997;38(2):149–71. doi: 10.1016/S0169-7439(97)00032-4 WOS:A1997YH19600005.

46. Harshman RA, Lundy ME. Parafac—Parallel Factor-Analysis. Comput Stat Data An. 1994;18(1):39–72. doi: 10.1016/0167-9473(94)90132-5 WOS:A1994NY54800004.

47. Stout SA, Wang ZD. Chemical fingerprinting methods and factors affecting petroleum fingerprints in the environment. Standard Handbook Oil Spill Environmental Forensics: Fingerprinting and Source Identification, 2nd Edition. 2016:61–129. doi: 10.1016/B978-0-12-809659-8.00003–6 WOS:000404774100003.

Článek vyšel v časopise


2019 Číslo 10