Automated content analysis across six languages

Autoři: Leah Cathryn Windsor aff001;  James Grayson Cupit aff002;  Alistair James Windsor aff003
Působiště autorů: Institute for Intelligent Systems, The University of Memphis, Memphis, Tennessee, United States of America aff001;  Institute for Intelligent Systems, The University of Memphis, Memphis, Tennessee, United States of America aff002;  Department of Mathematical Sciences, The University of Memphis, Memphis, Tennessee, United States of America aff003
Vyšlo v časopise: PLoS ONE 14(11)
Kategorie: Research Article
doi: 10.1371/journal.pone.0224425


Corpus selection bias in international relations research presents an epistemological problem: How do we know what we know? Most social science research in the field of text analytics relies on English language corpora, biasing our ability to understand international phenomena. To address the issue of corpus selection bias, we introduce results that suggest that machine translation may be used to address non-English sources. We use human translation and machine translation (Google Translate) on a collection of aligned sentences from United Nations documents extracted from the Multi-UN corpus, analyzed with a “bag of words” analysis tool, Linguistic Inquiry Word Count (LIWC). Overall, the LIWC indices proved relatively stable across machine and human translated sentences. We find that while there are statistically significant differences between the original and translated documents, the effect sizes are relatively small, especially when looking at psychological processes.

Klíčová slova:

Cognition – Grammar – Languages – Psycholinguistics – Semantics – Social sciences – Syntax – Computational linguistics


1. Chung CK, Pennebaker JW. Using computerized text analysis to assess threatening communications and behavior. Threatening communications and behavior: Perspectives on the pursuit of public figures. 2011; 3–32.

2. Hancock JT, Curry LE, Goorha S, Woodworth M. On lying and being lied to: A linguistic analysis of deception in computer-mediated communication. Discourse Processes. 2007;45: 1–23.

3. Bell C., McCarthy P.M., & McNamara D.S. Using LIWC and Coh-Metrix to investigate gender differences in linguistic styles. Applied natural language processing and content analysis: Identification, investigation, and resolution. Hershey, PA: IGI Global; 2012. pp. 545–556.

4. Pennebaker JW. The secret life of pronouns: How our words reflect who we are. New York, NY: Bloomsbury. 2011.

5. Hancock JT, Beaver DI, Chung CK, Frazee J, Pennebaker JW, Graesser A, et al. Social language processing: A framework for analyzing the communication of terrorists and authoritarian regimes. Behavioral Sciences of Terrorism and Political Aggression. 2010;2: 108–132.

6. Geddes B. How the cases you choose affect the answers you get: Selection bias in comparative politics. Political analysis. 1990;2: 131–150.

7. Breuning M, Feinberg A, Gross BI, Martinez M, Sharma R, Ishiyama J. How International is Political Science: Patterns of Submission and Publication in the APSR. Denton, TX: University of North Texas; 2016.

8. Eisele A, Chen Y. MultiUN: A Multilingual Corpus from United Nation Documents. 2000 [cited 12 Jan 2017].

9. Windsor L. The Language of Radicalization: Female Internet Recruitment to Participation in ISIS Activities. Terrorism and Political Violence. 2017. doi: 10.1080/09546553.2017.1385457

10. Bayram AB, Ta VP. Diplomatic Chameleons: Language Style Matching and Agreement in International Diplomatic Negotiations. Negotiation and Conflict Management Research. 2018.

11. Windsor L, Nieman M, Mahmood Z. Machine readable text and the scientific study of diplomacy. Advancing the Scientific Study of Diplomacy. University of Texas, Austin; 2018.

12. Love G, Windsor L. Alo’ Populism!: Discourse and Presidential Approval in Chavez’ Venezuela. New Orleans, LA; 2015.

13. King G, Pan J, Roberts ME. Reverse-engineering censorship in China: Randomized experimentation and participant observation. Science. 2014;345: 1251722. doi: 10.1126/science.1251722 25146296

14. Labzina E, Nieman M. State-controlled Media and Foreign Policy: Analyzing Russian-language News. Milan, Italy; 2017.

15. Monroe BL, Colaresi MP, Quinn KM. Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict. Political Analysis. 2008;16: 372–403.

16. Rice D, Zorn CJ. The evolution of consensus in the US Supreme Court. Browser Download This Paper. 2014 [cited 7 Sep 2017].

17. Hinkle RK. Legal constraint in the US Courts of Appeals. The Journal of Politics. 2015;77: 721–735.

18. Owens RJ, Wedeking J. Predicting drift on politically insulated institutions: A study of ideological drift on the United States supreme court. The Journal of Politics. 2012;74: 487–500.

19. Monroe BL, Schrodt PA. Introduction to the Special Issue: The Statistical Analysis of Political Text. Political Analysis. 2008;16: 351–355.

20. King G, Lowe W. An Automated Information Extraction Tool for International Conflict Data with Performance as Good as Human Coders: A Rare Events Evaluation Design. International Organization. 2003;57: 617–642.

21. Lowe W. Understanding Wordscores. Political Analysis. 2008;16: 356–371.

22. Grimmer J, Stewart BM. Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis. 2013; mps028.

23. Lucas C, Nielsen RA, Roberts ME, Stewart BM, Storer A, Tingley D. Computer-assisted text analysis for comparative politics. Political Analysis. 2015;23: 254–277.

24. King G, Pan J, Roberts ME. How censorship in China allows government criticism but silences collective expression. American Political Science Review. 2013;107: 326–343.

25. Windsor L, Dowell N, Windsor A, Kaltner J. Leader Language and Political Survival Strategies in the Arab Spring. International Interactions. 2017.

26. McManus RW. Fighting words The effectiveness of statements of resolve in international conflict. Journal of Peace Research. 2014;51: 726–740. doi: 10.1177/0022343314539826

27. McManus RW. Statements of Resolve: Achieving Coercive Credibility in International Conflict. Cambridge University Press; 2017.

28. Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. Journal of machine Learning research. 2003;3: 993–1022.

29. Mimno D, Wallach HM, Naradowsky J, Smith DA, McCallum A. Polylingual topic models. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2. Association for Computational Linguistics; 2009. pp. 880–889.

30. McNamara DS, Graesser AC, McCarthy PM, Cai Z. Automated evaluation of text and discourse with Coh-Metrix. Cambridge University Press; 2014.

31. Crossley SA, Kyle K, McNamara DS. Sentiment Analysis and Social Cognition Engine (SEANCE): An automatic tool for sentiment, social cognition, and social-order analysis. Behav Res. 2017;49: 803–821. doi: 10.3758/s13428-016-0743-z 27193159

32. Young L, Soroka S. Affective news: The automated coding of sentiment in political texts. Political Communication. 2012;29: 205–231.

33. Pennebaker JW, Boyd RL, Jordan K, Blackburn K. The development and psychometric properties of LIWC2015. UT Faculty/Researcher Works. 2015 [cited 9 Dec 2016].

34. Dryer MS. Order of subject, object, and verb. The world atlas of language structures, ed by Haspelmath Martin et al. 2005; 330–333.

35. Aiken M, Balan S. An analysis of Google Translate accuracy. Translation journal. 2011;16: 1–3.

36. Araujo M, Reis J, Pereira A, Benevenuto F. An evaluation of machine translation for multilingual sentence-level sentiment analysis. Proceedings of the 31st Annual ACM Symposium on Applied Computing. ACM; 2016. pp. 1140–1145.

37. Boiy E, Moens M-F. A machine learning approach to sentiment analysis in multilingual Web texts. Information retrieval. 2009;12: 526–558.

38. Bouarara HA, Hamou RM, Amine A. A Novel Bio-Inspired Approach for Multilingual Spam Filtering. International Journal of Intelligent Information Technologies (IJIIT). 2015;11: 45–87.

39. Faris H, Ala’M A-Z, Heidari AA, Aljarah I, Mafarja M, Hassonah MA, et al. An intelligent system for spam detection and identification of the most relevant features based on evolutionary random weight networks. Information Fusion. 2019;48: 67–83.

40. Groves M, Mundt K. Friend or foe? Google Translate in language for academic purposes. English for Specific Purposes. 2015;37: 112–121.

41. Ghasemi H, Hashemian M. A Comparative Study of" Google Translate" Translations: An Error Analysis of English-to-Persian and Persian-to-English Translations. English Language Teaching. 2016;9: 13–17.

42. de Vries E, Schoonvelde M, Schumacher G. No Longer Lost in Translation: Evidence that Google Translate Works for Comparative Bag-of-Words Text Applications. Political Analysis. 2018;26: 417–430. doi: 10.1017/pan.2018.26

43. Anastasiou D, Gupta R. Comparison of crowdsourcing translation with Machine Translation. Journal of Information Science. 2011;37: 637–659. doi: 10.1177/0165551511418760

44. Papineni K, Roukos S, Ward T, Zhu W-J. BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics; 2002. pp. 311–318.

45. Rice DR, Zorn C. Corpus-based dictionaries for sentiment analysis of specialized vocabularies. Proceedings of NDATAD. 2013; 98–115.

46. Varga D, Halácsy P, Kornai A, Nagy V, Németh L, Trón V. Parallel corpora for medium density languages. AMSTERDAM STUDIES IN THE THEORY AND HISTORY OF LINGUISTIC SCIENCE SERIES 4. 2007;292: 247.

47. Reitz K. Requests: HTTP For Humans. In: KennethReitz [Internet].

48. Jurafsky D, Martin JH. Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River, N.J.: Pearson Prentice Hall; 2009.

49. Bakshy E, Messing S, Adamic LA. Exposure to ideologically diverse news and opinion on Facebook. Science. 2015;348: 1130–1132. doi: 10.1126/science.aaa1160 25953820

50. Beauchamp N. Predicting and Interpolating State-Level Polls Using Twitter Textual Data. American Journal of Political Science. 2017;61: 490–503.

51. Crossley SA, McNamara DS. Cohesion, coherence, and expert evaluations of writing proficiency. Proceedings of the 32nd annual conference of the Cognitive Science Society. Austin, TX; 2010. pp. 984–989.

52. Sawilowsky SS. New effect size rules of thumb. 2009 [cited 13 Oct 2017].

53. Cohen J. A power primer. Psychological bulletin. 1992;112: 155. doi: 10.1037//0033-2909.112.1.155 19565683

54. Hitler A, Domarus M. The essential Hitler: speeches and commentary. Bolchazy Carducci Pub; 2007.

55. Crossley SA, McNamara DS. Detecting the first language of second language writers using automated indices of cohesion, lexical sophistication, syntactic complexity and conceptual knowledge. Approaching Language Transfer through Text Classification. 2012; 106–126.

56. Windsor L, Cai Z. Coh-Metrix-ML (CMX-ML). Minerva Initiative FA9550-14-1-0308; 2018.

Článek vyšel v časopise


2019 Číslo 11