A hierarchical loss and its problems when classifying non-hierarchically

Autoři: Cinna Wu aff001;  Mark Tygert aff001;  Yann LeCun aff002
Působiště autorů: Facebook, Menlo Park, CA, United States of America aff001;  Facebook, New York, NY, United States of America aff002
Vyšlo v časopise: PLoS ONE 14(12)
Kategorie: Research Article
doi: https://doi.org/10.1371/journal.pone.0226222


Failing to distinguish between a sheepdog and a skyscraper should be worse and penalized more than failing to distinguish between a sheepdog and a poodle; after all, sheepdogs and poodles are both breeds of dogs. However, existing metrics of failure (so-called “loss” or “win”) used in textual or visual classification/recognition via neural networks seldom leverage a-priori information, such as a sheepdog being more similar to a poodle than to a skyscraper. We define a metric that, inter alia, can penalize failure to distinguish between a sheepdog and a skyscraper more than failure to distinguish between a sheepdog and a poodle. Unlike previously employed possibilities, this metric is based on an ultrametric tree associated with any given tree organization into a semantically meaningful hierarchy of a classifier’s classes. An ultrametric tree is a tree with a so-called ultrametric distance metric such that all leaves are at the same distance from the root. Unfortunately, extensive numerical experiments indicate that the standard practice of training neural networks via stochastic gradient descent with random starting points often drives down the hierarchical loss nearly as much when minimizing the standard cross-entropy loss as when trying to minimize the hierarchical loss directly. Thus, this hierarchical loss is unreliable as an objective for plain, randomly started stochastic gradient descent to minimize; the main value of the hierarchical loss may be merely as a meaningful metric of success of a classifier.

Klíčová slova:

Algorithms – Dogs – Leaves – Neural networks – Online encyclopedias – Phylogenetic analysis – Probability distribution – Taxonomy


1. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–444. doi: 10.1038/nature14539 26017442

2. Cai L, Hofmann T. Hierarchical document categorization with support vector machines. In: Proc. 13th ACM Internat. Conf. Information and Knowledge Management. ACM; 2004. p. 78–87.

3. Kosmopoulos A, Partalas I, Gaussier E, Paliouras G, Androutsopoulos I. Evaluation measures for hierarchical classification: a unified view and novel approaches. Data Mining and Knowledge Discovery. 2015;29(3):820–865. doi: 10.1007/s10618-014-0382-x

4. Binder A, Kawanabe M, Brefeld U. Efficient classification of images with taxonomies. In: Proc. 9th Asian Conf. Computer Vision. vol. 5996 of Lecture Notes in Computer Science. Springer; 2009. p. 351–362.

5. Chang JY, Lee KM. Large margin learning of hierarchical semantic similarity for image classification. Computer Vision and Image Understanding. 2015;132:3–11. doi: 10.1007/s11263-014-0790-9

6. Costa EP, Lorena AC, Carvalho ACPLF, Freitas AA. A review of performance evaluation measures for hierarchical classifiers. In: Drummond C, Elazmeh W, Japkowicz N, Macskassy SA, editors. Evaluation Methods for Machine Learning II: Papers from the AAAI-2007 Workshop. AAAI Press; 2007. p. 182–196.

7. Deng J, Berg AC, Li K, Li FF. What does classifying more than 10,000 image categories tell us? In: Proc. 11th European Conf. Computer Vision. vol. 5. Springer-Verlag; 2010. p. 71–84.

8. Deng J, Berg AC, Li K, Li FF. Hierarchical semantic indexing for large scale image retrieval. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition. IEEE; 2011. p. 785–792.

9. Wang K, Zhou S, Liew SC. Building hierarchical classifiers using class proximity. In: Proc. 25th Internat. Conf. Very Large Data Bases. Morgan Kaufmann Publishers; 1999. p. 363–374.

10. Reece JB, Urry LA, Cain ML, Wasserman SA, Minorsky PV, Jackson RB. Campbell Biology. 10th ed. Pearson; 2013.

11. Silla CN Jr, Freitas AA. A survey of hierarchical classification across different application domains. J. Data Mining Knowledge Discovery. 2011;22(1–2):31–72. doi: 10.1007/s10618-010-0175-9

12. Kosmopoulos A, Paliouras G, Androutsopoulos I. Probabilistic cascading for large-scale hierarchical classification. arXiv; 2015. 1505.02251. Available from: http://arxiv.org/abs/1505.02251.

13. Redmon J, Farhadi A. YOLO9000: better, faster, stronger. In: IEEE Conf. Comput. Vision Pattern Recognition. IEEE; 2017. p. 1–9.

14. Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of tricks for efficient text classification. In: Proc. 15th Conf. European Chapter Assoc. Comput. Linguistics. ACL; 2017. p. 427–431.

15. Lewis DD, Yang Y, Rose TG, Li F. RCV1: a new benchmark collection for text categorization research. J. Machine Learning Research. 2004;5:361–397.

16. Zhang X, Zhao J, LeCun Y. Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems. vol. 28. Neural Information Processing Systems Foundation; 2015. p. 1–9.

17. Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes PN, et al. DBpedia—a large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web. 2015;6(2):167–195. doi: 10.3233/SW-140134

18. Partalas I, Kosmopoulos A, Baskiotis N, Artieres T, Paliouras G, Gaussier E, et al. LSHTC: a benchmark for large-scale text classification. arXiv; 2015. 1503.08581. Available from: http://arxiv.org/abs/1503.08581.

Článek vyšel v časopise


2019 Číslo 12
Nejčtenější tento týden