FlexGraph: Flexible partitioning and storage for scalable graph mining


Autoři: Chiwan Park aff001;  Ha-Myung Park aff001;  U. Kang aff001
Působiště autorů: Department of Computer Science and Engineering, Seoul National University, Seoul, Republic of Korea aff001
Vyšlo v časopise: PLoS ONE 15(1)
Kategorie: Research Article
doi: 10.1371/journal.pone.0227032

Souhrn

How can we analyze large graphs such as the Web, and social networks with hundreds of billions of vertices and edges? Although many graph mining systems have been proposed to perform various graph mining algorithms on such large graphs, they have difficulties in processing Web-scale graphs due to massive communication and I/O costs caused by communication between workers, and reading subgraphs repeatedly. In this paper, we propose FlexGraph, a scalable distributed graph mining method reducing the costs by exploiting properties of real-world graphs. FlexGraph significantly decreases the communication cost, which is the main bottleneck of distributed systems, by exploiting different edge placement policies based on types of vertices. Furthermore, we propose a flexible storage format to reduce I/O costs when reading input graph repeatedly. Experiments show that FlexGraph succeeds in processing up to 64× larger graphs than existing distributed memory-based graph mining methods, and consistently outperforms previous disk-based graph mining methods.

Klíčová slova:

Algorithms – Data processing – Memory – Network analysis – Social networks – Social systems – Statistical distributions – Twitter


Zdroje

1. Kang U, Tsourakakis CE, Faloutsos C. PEGASUS: A Peta-Scale Graph Mining System. In: ICDM; 2009. p. 229–238.

2. Kang U, Tsourakakis CE, Faloutsos C. PEGASUS: mining peta-scale graphs. Knowl Inf Syst. 2011;27(2):303–325. doi: 10.1007/s10115-010-0305-0

3. Park HM, Park C, Kang U. PegasusN: A Scalable and Versatile Graph Mining System. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, February 2-7, 2018, New Orleans, Louisiana, USA.; 2018.

4. Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N, et al. Pregel: a system for large-scale graph processing. In: SIGMOD; 2010. p. 135–146.

5. Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In: OSDI; 2012. p. 17–30.

6. Kang U, Tsourakakis CE, Appel AP, Faloutsos C, Leskovec J. Radius Plots for Mining Tera-byte Scale Graphs: Algorithms, Patterns, and Observations. In: SDM; 2010. p. 548–558.

7. Kang U, Tsourakakis CE, Appel AP, Faloutsos C, Leskovec J. HADI: Mining Radii of Large Graphs. ACM Trans Knowl Discov Data. 2011;5:8:1–8:24. http://doi.acm.org/10.1145/1921632.1921634.

8. Kang U, Meeder B, Faloutsos C. Spectral Analysis for Billion-Scale Graphs: Discoveries and Implementation. In: PAKDD; 2011. p. 13–25.

9. Kang U, Meeder B, Papalexakis E, Faloutsos C. HEigen: Spectral Analysis for Billion-Scale Graphs. Knowledge and Data Engineering, IEEE Transactions on. 2014;26(2):350–362. doi: 10.1109/TKDE.2012.244

10. Gao J, Zhou C, Zhou J, Yu JX. Continuous pattern detection over billion-edge graph using distributed framework. In: ICDE; 2014. p. 556–567.

11. Chen R, Shi J, Zang B, Guan H. Bipartite-oriented distributed graph partitioning for big learning. In: APSys; 2014. p. 14:1–14:7.

12. Kang U, McGlohon M, Akoglu L, Faloutsos C. Patterns on the Connected Components of Terabyte-Scale Graphs. In: ICDM; 2010. p. 875–880.

13. Yan D, Cheng J, Xing K, Lu Y, Ng W, Bu Y. Pregel Algorithms for Graph Connectivity Problems with Performance Guarantees. PVLDB. 2014;7(14):1821–1832.

14. Akoglu L, Chau DH, Kang U, Koutra D, Faloutsos C. OPAvion: mining and visualization in large graphs. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20-24, 2012; 2012. p. 717–720.

15. Kang U, Lee JY, Koutra D, Faloutsos C. Net-Ray: Visualizing and Mining Billion-Scale Graphs. In: Advances in Knowledge Discovery and Data Mining—18th Pacific-Asia Conference, PAKDD 2014, Tainan, Taiwan, May 13-16, 2014. Proceedings, Part I; 2014. p. 348–361.

16. Quick L, Wilkinson P, Hardcastle D. Using Pregel-like Large Scale Graph Processing Frameworks for Social Network Analysis. In: ASONAM; 2012. p. 457–463.

17. Gonzalez JE, Xin RS, Dave A, Crankshaw D, Franklin MJ, Stoica I. GraphX: Graph Processing in a Distributed Dataflow Framework. In: OSDI; 2014. p. 599–613.

18. Seo S, Yoon EJ, Kim J, Jin S, Kim J, Maeng S. HAMA: An Efficient Matrix Computation with the MapReduce Framework. In: CloudCom; 2010. p. 721–726.

19. Shvachko K, Kuang H, Radia S, Chansler R. The Hadoop Distributed File System. In: MSST; 2010. p. 1–10.

20. Lee H, Shao B, Kang U. Fast graph mining with HBase. Inf Sci. 2015;315:56–66. doi: 10.1016/j.ins.2015.04.016

21. Wang Z, Gu Y, Bao Y, Yu G, Yu JX. Hybrid Pulling/Pushing for I/O-Efficient Distributed and Iterative Graph Computing. In: SIGMOD; 2016. p. 479–494.

22. Bu Y, Borkar VR, Jia J, Carey MJ, Condie T. Pregelix: Big(ger) Graph Analytics on a Dataflow Engine. PVLDB. 2014;8(2):161–172.

23. Kyrola A, Blelloch GE, Guestrin C. GraphChi: Large-Scale Graph Computation on Just a PC. In: OSDI; 2012. p. 31–46.

24. Han W, Lee S, Park K, Lee J, Kim M, Kim J, et al. TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC. In: KDD; 2013. p. 77–85.

25. Lin Z, Kahng M, Sabrin KM, Chau DHP, Lee H, Kang U. MMap: Fast billion-scale graph computation on a PC via memory mapping. In: 2014 IEEE International Conference on Big Data, Big Data 2014, Washington, DC, USA, October 27-30, 2014; 2014. p. 159–164.

26. Gualdron H, Cordeiro RLF, Jr JFR, Chau DHP, Kahng M, Kang U. M-Flash: Fast Billion-Scale Graph Computation Using a Bimodal Block Processing Model. In: ECML PKDD, Proceedings, Part II; 2016. p. 623–640.

27. Seo H, Kim J, Kim M. GStream: a graph streaming processing method for large-scale graphs on GPUs. In: PPoPP, February 7-11, 2015; 2015. p. 253–254.

28. Ma L, Yang Z, Chen H, Xue J, Dai Y. Garaph: Efficient GPU-accelerated Graph Processing on a Single Machine with Balanced Replication. In: ATC; 2017. p. 195–207.

29. Maass S, Min C, Kashyap S, Kang W, Kumar M, Kim T. Mosaic: Processing a Trillion-Edge Graph on a Single Machine. In: EuroSys; 2017. p. 527–543.

30. Venkataraman S, Bodzsar E, Roy I, AuYoung A, Schreiber RS. Presto: distributed machine learning and graph processing with sparse matrices. In: EuroSys; 2013. p. 197–210.

31. Chen R, Shi J, Chen Y, Chen H. PowerLyra: differentiated graph computation and partitioning on skewed graphs. In: EuroSys; 2015. p. 1:1–1:15.

32. Dave A, Jindal A, Li LE, Xin R, Gonzalez J, Zaharia M. GraphFrames: an integrated API for mixing graph and relational queries. In: GRADES; 2016. p. 2.

33. Fan W, Xu J, Wu Y, Yu W, Jiang J. GRAPE: Parallelizing Sequential Graph Computations. PVLDB. 2017;10(12):1889–1892.

34. Fan W, Xu J, Wu Y, Yu W, Jiang J, Zheng Z, et al. Parallelizing Sequential Graph Computations. In: SIGMOD; 2017. p. 495–510.

35. Sahu S, Mhedhbi A, Salihoglu S, Lin J, Özsu MT. The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing. PVLDB. 2017;11(4):420–431.

36. Borkar VR, Carey MJ, Grover R, Onose N, Vernica R. Hyracks: A flexible and extensible foundation for data-intensive computing. In: ICDE; 2011. p. 1151–1162.

37. Dean J, Ghemawat S. MapReduce: Simplified Data Processing on Large Clusters. In: OSDI; 2004. p. 137–150.

38. Kang U, Tong H, Sun J, Lin C, Faloutsos C. GBASE: a scalable and general graph management system. In: KDD; 2011. p. 1091–1099.

39. Kang U, Tong H, Sun J, Lin CY, Faloutsos C. GBASE: an efficient analysis platform for large graphs. VLDB J. 2012;21(5):637–650. doi: 10.1007/s00778-012-0283-9

40. Qin L, Yu JX, Chang L, Cheng H, Zhang C, Lin X. Scalable big graph processing in MapReduce. In: SIGMOD; 2014. p. 827–838.

41. Yan D, Huang Y, Liu M, Chen H, Cheng J, Wu H, et al. GraphD: Distributed Vertex-Centric Graph Processing Beyond the Memory Limit. IEEE Trans Parallel Distrib Syst. 2018;29(1):99–114. doi: 10.1109/TPDS.2017.2743708

42. Kang U, Faloutsos C. Beyond’Caveman Communities’: Hubs and Spokes for Graph Compression and Mining. In: 11th IEEE International Conference on Data Mining, ICDM 2011, Vancouver, BC, Canada, December 11-14, 2011; 2011. p. 300–309.

43. Lim Y, Kang U, Faloutsos C. SlashBurn: Graph Compression and Mining beyond Caveman Communities. IEEE Trans Knowl Data Eng. 2014;26(12):3077–3089. doi: 10.1109/TKDE.2014.2320716

44. Elgohary A, Boehm M, Haas PJ, Reiss FR, Reinwald B. Compressed Linear Algebra for Large-Scale Machine Learning. PVLDB. 2016;9(12):960–971.

45. Liakos P, Papakonstantinopoulou K, Delis A. Memory-Optimized Distributed Graph Processing through Novel Compression Techniques. In: CIKM; 2016. p. 2317–2322.

46. Liakos P, Papakonstantinopoulou K, Delis A. Realizing Memory-Optimized Distributed Graph Processing. IEEE Trans Knowl Data Eng. 2018;30(4):743–756. doi: 10.1109/TKDE.2017.2779797

47. Anderson MJ, Sundaram N, Satish N, Patwary MMA, Willke TL, Dubey P. GraphPad: Optimized Graph Primitives for Parallel and Distributed Platforms. In: IPDPS; 2016. p. 313–322.

48. Ahmad Y, Khattab O, Malik A, Musleh A, Hammoud M, Kutlu M, et al. LA3: A Scalable Link- and Locality-Aware Linear Algebra-Based Graph Analytics System. PVLDB. 2018;11(8):920–933.

49. Stutz P, Bernstein A, Cohen WW. Signal/Collect: Graph Algorithms for the (Semantic) Web. In: ISWC; 2010. p. 764–780.

50. Yan D, Cheng J, Lu Y, Ng W. Blogel: A Block-Centric Framework for Distributed Computation on Real-World Graphs. PVLDB. 2014;7(14):1981–1992.

51. Tian Y, Balmin A, Corsten SA, Tatikonda S, McPherson J. From “Think Like a Vertex” to “Think Like a Graph”. PVLDB. 2013;7(3):193–204.

52. Sundaram N, Satish N, Patwary MMA, Dulloor S, Anderson MJ, Vadlamudi SG, et al. GraphMat: High performance graph analytics made productive. PVLDB. 2015;8(11):1214–1225.

53. Kalavri V, Vlassov V, Haridi S. High-Level Programming Abstractions for Distributed Graph Processing. IEEE Trans Knowl Data Eng. 2018;30(2):305–324. doi: 10.1109/TKDE.2017.2762294

54. Andreev K, Räcke H. Balanced graph partitioning. In: SPAA; 2004. p. 120–124.

55. Bourse F, Lelarge M, Vojnovic M. Balanced graph edge partition. In: KDD; 2014. p. 1456–1465.

56. Hoque I, Gupta I. LFGraph: simple and fast distributed graph analytics. In: TRIOS@SOSP; 2013. p. 9:1–9:17.

57. Kwak H, Lee C, Park H, Moon SB. What is Twitter, a social network or a news media? In: WWW; 2010. p. 591–600.

58. Jeon B, Jeon I, Kang U. TeGViz: Distributed Tera-Scale Graph Generation and Visualization. In: ICDMW; 2015. p. 1620–1623.

59. Ching A, Edunov S, Kabiljo M, Logothetis D, Muthukrishnan S. One Trillion Edges: Graph Processing at Facebook-Scale. PVLDB. 2015;8(12):1804–1815.

60. Chakrabarti D, Zhan Y, Faloutsos C. R-MAT: A Recursive Model for Graph Mining. In: SDM; 2004. p. 442–446.

61. Zhang Y, Kiriansky V, Mendis C, Amarasinghe SP, Zaharia M. Making caches work for graph analytics. In: IEEE BigData; 2017. p. 293–302.

62. Mukkara A, Beckmann N, Abeydeera M, Ma X, Sánchez D. Exploiting Locality in Graph Analytics through Hardware-Accelerated Traversal Scheduling. In: MICRO; 2018. p. 1–14.

63. Khayyat Z, Awara K, Alonazi A, Jamjoom H, Williams D, Kalnis P. Mizan: a system for dynamic load balancing in large-scale graph processing. In: EuroSys; 2013. p. 169–182.

64. Suri S, Vassilvitskii S. Counting triangles and the curse of the last reducer. In: WWW; 2011. p. 607–614.


Článek vyšel v časopise

PLOS One


2020 Číslo 1