Measuring the diffusion of innovations with paragraph vector topic models

Autoři: David Lenz aff001;  Peter Winker aff001
Působiště autorů: Department of Economics, Justus-Liebig-University, Gießen, Germany aff001
Vyšlo v časopise: PLoS ONE 15(1)
Kategorie: Research Article
doi: 10.1371/journal.pone.0226685


Measuring the diffusion of innovations from textual data sources besides patent data has not been studied extensively. However, early and accurate indicators of innovation and the recognition of trends in innovation are mandatory to successfully promote economic growth through technological progress via evidence-based policy making. In this study, we propose Paragraph Vector Topic Model (PVTM) and apply it to technology-related news articles to analyze innovation-related topics over time and gain insights regarding their diffusion process. PVTM represents documents in a semantic space, which has been shown to capture latent variables of the underlying documents, e.g., the latent topics. Clusters of documents in the semantic space can then be interpreted and transformed into meaningful topics by means of Gaussian mixture modeling. In using PVTM, we identify innovation-related topics from 170, 000 technology news articles published over a span of 20 years and gather insights about their diffusion state by measuring the topic importance in the corpus over time. Our results suggest that PVTM is a credible alternative to widely used topic models for the discovery of latent topics in (technology-related) news articles. An examination of three exemplary topics shows that innovation diffusion could be assessed using topic importance measures derived from PVTM. Thereby, we find that PVTM diffusion indicators for certain topics are Granger causal to Google Trend indices with matching search terms.

Klíčová slova:

Algorithms – Covariance – Information retrieval – Online encyclopedias – Semantics – Vector spaces – Virtual reality – Visual inspection


1. Miner G, Delen D, Elder J, Fast A, Hill T, Nisbet RA. Chapter 4—Applications and Use Cases for Text Mining. In: Miner G, Delen D, Elder J, Fast A, Hill T, Nisbet RA, editors. Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications. Boston: Academic Press; 2012. p. 53–72.

2. Varian HR. Big Data: New Tricks for Econometrics. Journal of Economic Perspectives. 2014;28(2):3–28. doi: 10.1257/jep.28.2.3

3. Yoon B, Park Y. Park Y.: A text-mining-based patent network: Analytic tool for high-technology trend. The Journal of High Technology Management Research. 2004;15:37–50. doi: 10.1016/j.hitech.2003.09.003

4. Choi J, Hwang YS. Patent keyword network analysis for improving technology development efficiency. Technological Forecasting and Social Change. 2014;83(C):170–182. doi: 10.1016/j.techfore.2013.07.004

5. Bergeaud A, Potiron Y, Raimbault J. Classifying patents based on their semantic content. PLOS ONE. 2017;12(4):1–22. doi: 10.1371/journal.pone.0176310

6. Abood A, Feltenberger D. Automated patent landscaping. Artificial Intelligence and Law. 2018;26(2):103–125. doi: 10.1007/s10506-018-9222-4

7. Chavalarias D, Cointet JP. Phylomemetic Patterns in Science Evolution—The Rise and Fall of Scientific Fields. PLOS ONE. 2013;8(2):1–11. doi: 10.1371/journal.pone.0054847

8. Nichols LG. A Topic Model Approach to Measuring Interdisciplinarity at the National Science Foundation. Scientometrics. 2014;100(3):741–754. doi: 10.1007/s11192-014-1319-2

9. Blei DM, Ng AY, Jordan MI. Latent Dirichlet Allocation. J Mach Learn Res. 2003;3:993–1022.

10. Anandarajan M, Hill C, Nolan T. In: Probabilistic Topic Models. Cham: Springer International Publishing; 2019. p. 117–130. Available from:

11. Kelly B, Papanikolaou D, Seru A, Taddy M. Measuring Technological Innovation over the Long Run. National Bureau of Economic Research, Inc; 2018. 25266. Available from:

12. Hisano R, Sornette D, Mizuno T, Ohnishi T, Watanabe T. High Quality Topic Extraction from Business News Explains Abnormal Financial Market Volatility. PLOS ONE. 2013;8(6):1–12. doi: 10.1371/journal.pone.0064846

13. Feuerriegel S, Pröllochs N. Investor Reaction to Financial Disclosures across Topics: An Application of Latent Dirichlet Allocation. Decision Sciences; Forthcoming.

14. Mizuno T, Ohnishi T, Watanabe T. Novel and topical business news and their impact on stock market activity. EPJ Data Science. 2017;6(1):26. doi: 10.1140/epjds/s13688-017-0123-7

15. Pröllochs N, Feuerriegel S. Business analytics for strategic management: Identifying and assessing corporate challenges via topic modeling. Information Management. 2018; Forthcoming.

16. Hansen S, McMahon M, Prat A. Transparency and Deliberation Within the FOMC: A Computational Linguistics Approach. The Quarterly Journal of Economics. 2018;133(2):801–870. doi: 10.1093/qje/qjx045

17. Larsen VH, Thorsrud LA. The value of news for economic developments. Journal of Econometrics. 2019;210(1):203–218. doi: 10.1016/j.jeconom.2018.11.013

18. Lüdering J, Winker P. Forward or backward looking? The economic discourse and the observed reality. Journal of Economics and Statistics. 2016;236(4):483–515.

19. Hansen S, McMahon M. Shocking language: Understanding the macroeconomic effects of central bank communication. Journal of International Economics. 2016;99(S1):114–133. doi: 10.1016/j.jinteco.2015.12.008

20. Wehrheim L. Economic history goes digital: topic modeling the Journal of Economic History. Cliometrica. 2019;13(1):83–125. doi: 10.1007/s11698-018-0171-7

21. Niu L, Dai X. Topic2Vec: Learning Distributed Representations of Topics. CoRR. 2015;abs/1506.08422.

22. Ai Q, Yang L, Guo J, Croft WB. Analysis of the Paragraph Vector Model for Information Retrieval. In: Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval. ICTIR’16. New York, NY, USA: ACM; 2016. p. 133–142. Available from:

23. Baldwin T, Lau JH, Aletras N, Sorodoc I. Multimodal Topic Labelling. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers; 2017. p. 701–706. Available from:

24. Le Q, Mikolov T. Distributed Representations of Sentences and Documents. 31st International Conference on Machine Learning, ICML. 2014;4.

25. Hashimoto K, Kontonatsios G, Miwa M, Ananiadou S. Topic detection using Paragraph Vectors to support Active Learning in Systematic Reviews. Journal of Biomedical Informatics. 2016;62:59–65. 27293211

26. Mikolov T, Chen K, Corrado Gs, Dean J. Efficient Estimation of Word Representations in Vector Space. Proceedings of Workshop at ICLR. 2013;2013.

27. Mikolov T, Sutskever I, Chen K, Corrado Gs, Dean J. Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems. 2013;26.

28. Harris ZS. Distributional Structure. WORD. 1954;10(2-3):146–162. doi: 10.1080/00437956.1954.11659520

29. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed Representations of Words and Phrases and their Compositionality. CoRR. 2013;abs/1310.4546.

30. Bridle JS. In: Soulié FF, Hérault J, editors. Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition. Berlin, Heidelberg: Springer Berlin Heidelberg; 1990. p. 227–236. Available from:

31. Landgraf AJ, Bellay J. word2vec Skip-Gram with Negative Sampling is a Weighted Logistic PCA. CoRR. 2017;abs/1705.09755.

32. Lau JH, Baldwin T. An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. CoRR. 2016;abs/1607.05368.

33. Reynolds DA. Gaussian Mixture Models. In: Encyclopedia of Biometrics, Second Edition; 2015. p. 827–832. Available from:

34. Sammut C, Webb GI. Encyclopedia of machine learning and data mining. Springer; 2017.

35. Dempster AP, Laird NM, Rubin DB. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society Series B (Methodological). 1977;39(1):1–38. doi: 10.1111/j.2517-6161.1977.tb01600.x

36. Dadi H, Venkatesh P, Poornesh P, Rao LN, Kumar N. Tracking Multiple Moving Objects Using Gaussian Mixture Model. International Journal of Soft Computing and Engineering (IJSCE). 2013;3:114–119.

37. Yu D, Deng L. Automatic Speech Recognition: A Deep Learning Approach. Springer; 2014.

38. Reynolds DA, Quatieri TF, Dunn RB. Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing. 2000;10(1):19–41. doi: 10.1006/dspr.1999.0361

39. Lloyd SP. Least squares quantization in PCM. IEEE Trans Information Theory. 1982;28:129–136. doi: 10.1109/TIT.1982.1056489

40. Fukunaga K, Hostetler LD. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory. 1975;21(1):32–40. doi: 10.1109/TIT.1975.1055330

41. Schwarz G. Estimating the Dimension of a Model. The Annals of Statistics. 1978;6(2):461–464. doi: 10.1214/aos/1176344136

42. Hidasi B, Quadrana M, Karatzoglou A, Tikk D. Parallel Recurrent Neural Network Architectures for Feature-rich Session-based Recommendations. In: Proceedings of the 10th ACM Conference on Recommender Systems. RecSys’16. New York, NY, USA: ACM; 2016. p. 241–248. Available from:

43. OECD, Eurostat. Oslo Manual 2018; 2018. Available from:

44. Rogers EM. Diffusion of innovations. 5th ed. New York, NY [u.a.]: Free Press; 2003.

45. Loper E, Bird S. NLTK: The Natural Language Toolkit. In: Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics—Volume 1. ETMTNLP’02. Stroudsburg, PA, USA: Association for Computational Linguistics; 2002. p. 63–70. Available from:

46. Choin H, Varian H. Predicting the Present with Google Trends. Economic Record. 2012;88(s1):2–9.

47. Duwe D, Herrmann F, Spath D. Forecasting the Diffusion of Product and Technology Innovations: Using Google Trends as an Example. In: 2018 Portland International Conference on Management of Engineering and Technology (PICMET); 2018. p. 1–7. Available from: 10.23919/PICMET.2018.8481971.

48. Lee WS, Choi HS, Sohn SY. Forecasting new product diffusion using both patent citation and web search traffic. PLOS ONE. 2018;13(4):1–12.

49. Kilian L, Lütkepohl H. Structural Vector Autoregressive Analysis. Themes in Modern Econometrics. Cambridge University Press; 2017.

50. Granger CWJ, Lin JL. Causality in the Long Run. Econometric Theory. 1995;11(3):530–536. doi: 10.1017/S0266466600009397

Článek vyšel v časopise


2020 Číslo 1