Hadoop vs Spark性能对比
source link: https://blog.51cto.com/u_2650279/6855138
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
基于Spark-0.4和Hadoop-0.20.2
1. Kmeans
数据:自己产生的三维数据,分别围绕正方形的8个顶点
{0, 0, 0}, {0, 10, 0}, {0, 0, 10}, {0, 10, 10},
{10, 0, 0}, {10, 0, 10}, {10, 10, 0}, {10, 10, 10}
Point number | 189,918,082 (1亿9千万个三维点) |
Capacity | |
HDFS Location | /user/LijieXu/Kmeans/Square-10GB.txt |
程序逻辑:
读取HDFS上的block到内存,每个block转化为RDD,里面包含vector。 然后对RDD进行map操作,抽取每个vector(point)对应的类号,输出(K,V)为(class,(Point,1)),组成新的RDD。 然后再reduce之前,对每个新的RDD进行combine,在RDD内部算出每个class的中心和。使得每个RDD的输出只有最多K个KV对。 最后进行reduce得到新的RDD(内容的Key是class,Value是中心和,再经过map后得到最后的中心。 |
先上传到HDFS上,然后在Master上运行
|
迭代执行Kmeans算法。
一共160个task。(160 * 64MB = 10GB)
利用了32个CPU cores,18.9GB的内存。
每个机器的内存消耗为4.5GB (共40GB)(本身points数据10GB*2,Map后中间数据(K, V) => (int, (vector, 1)) (大概10GB)
最后结果:
|
50MB/s 10GB => 3.5min
10MB/s 10GB => 15min
在20GB的数据上测试
Point number | 377,370,313 (3亿7千万个三维点) |
Capacity | |
HDFS Location | /user/LijieXu/Kmeans/Square-20GB.txt |
运行测试命令:
|
得到聚类结果:
|
基本就是8个中心点
内存消耗:(每个节点大约5.8GB),共50GB左右。
内存分析:
20GB原始数据,20GB的Map输出
0.93 s |
12/06/05 11:11:08 INFO spark.CacheTracker: Looking for RDD partition 2:302
12/06/05 11:11:08 INFO spark.CacheTracker: Found partition in cache!
在20GB的数据上测试(迭代更多的次数)
|
Task数目:320
100.9 s | |
0.93 s | |
迭代轮数对内存容量的影响:
基本没有什么影响,主要内存消耗:20GB的输入数据RDD,20GB的中间数据。
Final centers: Map(5 -> (-4.728089224526789E-5, 3.17334874733142E-5, -2.0605806380414582E-4), 8 -> (1.1841686358289191E-4, 10.000062966002101, 9.999933240005394), 7 -> (9.999976672588097, 10.000199556926772, -2.0695123602840933E-4), 3 -> (-1.3506815993198176E-4, 9.999948270638338, 2.328148782609023E-5), 4 -> (3.2493629851483764E-4, -7.892413981250518E-5, 10.00002515017671), 1 -> (10.00004313126956, 7.431996896171192E-6, 7.590402882208648E-5), 6 -> (9.999982611661382, 10.000144597573051, 10.000037734639696), 2 -> (9.999958673426654, -1.1917651103354863E-4, 9.99990217533504)) |
结果可视化
2. HdfsTest
测试逻辑:
|
首先去HDFS上读取一个文本文件保存在file
再次计算file中每行的字符数,保存在内存RDD的mapped中
然后读取mapped中的每一个字符数,将其加2,计算读取+相加的耗时
只有map,没有reduce。
测试10GB的Wiki
实际测试的是RDD的读取性能。
|
测试结果:
|
每个node的内存消耗为2.7GB (共9.4GB * 3)
实际测试的是RDD的读取性能。
|
测试90GB的RandomText数据
|
111.905310882 s | |
4.681715228 s | |
4.469296148 s | |
4.441203887 s | |
1.999792125 s | |
2.151376037 s | |
1.889345699 s | |
1.847487668 s | |
1.827241743 s | |
1.747547323 s |
内存总消耗30GB左右。
单个节点的资源消耗:
3. 测试WordCount
|
打包成mySpark.jar,上传到Master的/opt/spark/newProgram。
运行程序:
|
Mesos自动将jar拷贝到执行节点,然后执行。
内存消耗:(10GB输入file + 10GB的flatMap + 15GB的Map中间结果(word,1))
还有部分内存不知道分配到哪里了。
耗时:50 sec(未经过排序)
Hadoop WordCount耗时:120 sec到140 sec
结果未排序
单个节点:
Hadoop测试
Kmeans
运行Mahout里的Kmeans
|
在运行(320个map,1个reduce)
Canopy Driver running buildClusters over input: output/data
时某个slave的资源消耗情况
Completed Jobs
Jobid | Name | Map Total | Reduce Total | Time |
job_201206050916_0029 | Input Driver running over input: /user/LijieXu/Kmeans/Square-10GB.txt | |||
job_201206050916_0030 | KMeans Driver running runIteration over clustersIn: output/clusters-0/part-randomSeed | |||
job_201206050916_0031 | KMeans Driver running runIteration over clustersIn: output/clusters-1 | |||
job_201206050916_0032 | KMeans Driver running runIteration over clustersIn: output/clusters-2 | |||
job_201206050916_0033 | KMeans Driver running runIteration over clustersIn: output/clusters-3 | |||
job_201206050916_0034 | KMeans Driver running runIteration over clustersIn: output/clusters-4 | |||
job_201206050916_0035 | KMeans Driver running runIteration over clustersIn: output/clusters-5 | |||
job_201206050916_0036 | KMeans Driver running clusterData over input: output/data | |||
job_201206050916_0037 | Input Driver running over input: /user/LijieXu/Kmeans/Square-20GB.txt | 1分31秒 | ||
job_201206050916_0038 | KMeans Driver running runIteration over clustersIn: output/clusters-0/part-randomSeed | 1分46秒 | ||
job_201206050916_0039 | KMeans Driver running runIteration over clustersIn: output/clusters-1 | 1分46秒 | ||
job_201206050916_0040 | KMeans Driver running runIteration over clustersIn: output/clusters-2 | 1分46秒 | ||
job_201206050916_0041 | KMeans Driver running runIteration over clustersIn: output/clusters-3 | 1分47秒 | ||
job_201206050916_0042 | KMeans Driver running clusterData over input: output/data | 1分34秒 |
运行多次10GB、20GB上的Kmeans,资源消耗
Hadoop WordCount测试
Spark交互式运行
进入Master的/opt/spark
|
打开Mesos版本的spark
在master:8080可以看到framework
Active Frameworks
ID | User | Name | Running Tasks | CPUs | MEM | Max Share | Connected |
201206050924-0-0018 | Spark shell | 0.0 MB | 2012-06-06 21:12:56 |
|
由于GC的问题,不能cache很大的数据集。
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK