8

Java中大集合<Long>求交集的方法比较

 2 years ago
source link: https://my.oschina.net/cimu/blog/5382401
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Java中大集合求交集的方法比较 - spencer - OSCHINA - 中文开源技术交流社区

项目中使用到List求交集,很容易想到collecion.retainAll()方法,但是在数据量比较大时,这个方法效率并不高。本文研究了几种常用的方法,以供大家参考。

【首先】梳理下思路,List去重一般有几种方法。

  1. 『外层遍历+内层遍历』查找:

复杂度O(NM) ,一般使用contains()检查是否包含

  1. 『外层遍历+内层Hash』查找:

复杂度O(N),一般将内层List转化为HashSet实现

  1. 『外层遍历+内层bitMap』查找:

复杂度O(N),一般将内层List转化为字节映射实现

【其次】这里其实忽略了一个点,就是 『单层遍历』中,检查 元素不包含 时,需要将这个元素移除(即remove方法)。remove时,也会导致性能问题。

这里面我们使用Java8中java.util.AbstractCollection#retainAll方法来验证下我们的思路。

// Java8 中 方法:java.util.AbstractCollection#retainAll
public boolean retainAll(Collection<?> c) {
    Objects.requireNonNull(c);
    boolean modified = false;
    Iterator<E> it = iterator();
    
    // 1. 外层遍历
    while (it.hasNext()) {
        
        // 2. 内层查找『是否包含』
        if (!c.contains(it.next())) {
        
            // 3. 不包含时,移除外层元素
            it.remove();
            modified = true;
        }
    }
    return modified;
}

这里用图总结下,求交集的流程:

1638927520245-7b9f2b95-0c3c-45dc-a6bd-502d67369f97.png

1.『外层遍历+内层遍历』查找:

java中常用2种遍历查找的List:ArrayList、LinkedList,在内外层中测试。

// 外层:ArrayList,内层:ArrayList
private void outArrayListInnerArrayList(List<Long> listA, List<Long> listB) {
    long begin = System.currentTimeMillis();
    ArrayList<Long> setA = new ArrayList<>(listA);
    ArrayList<Long> setB = new ArrayList<>(listB);
    setA.retainAll(setB);
    long end = System.currentTimeMillis();
    System.out.println("[ArrayList-ArrayList]RetainAll耗时:" + (end - begin));
}

// 外层:LinkedList,内层:ArrayList
private void outLinkedListInnerArrayList(List<Long> listA, List<Long> listB) {
    long begin = System.currentTimeMillis();
    LinkedList<Long> setA = new LinkedList<>(listA);
    ArrayList<Long> setB = new ArrayList<>(listB);
    setA.retainAll(setB);
    long end = System.currentTimeMillis();
    System.out.println("[LinkedList-ArrayList]RetainAll耗时:" + (end - begin));
}

// 外层:ArrayList,内层:LinkedList
private void outArrayListInnerLinkedList(List<Long> listA, List<Long> listB) {
    long begin = System.currentTimeMillis();
    LinkedList<Long> setA = new LinkedList<>(listA);
    ArrayList<Long> setB = new ArrayList<>(listB);
    setA.retainAll(setB);
    long end = System.currentTimeMillis();
    System.out.println("[LinkedList-ArrayList]RetainAll耗时:" + (end - begin));
}

// 外层:LinkedList,内层:LinkedList
private void outLinkedListInnerLinkedList(List<Long> listA, List<Long> listB) {
    long begin = System.currentTimeMillis();
    LinkedList<Long> setA = new LinkedList<>(listA);
    ArrayList<Long> setB = new ArrayList<>(listB);
    setA.retainAll(setB);
    long end = System.currentTimeMillis();
    System.out.println("[LinkedList-LinkedList]RetainAll耗时:" + (end - begin));
}

2.『外层遍历+内层Hash』查找:

java中常用HashSet,内层替换为HashSet查找。

// 外层:ArrayList,内层:HashSet
private void outArrayListInnerHashSet(List<Long> listA, List<Long> listB) {
    long begin = System.currentTimeMillis();
    ArrayList<Long> setA = new ArrayList<>(listA);
    HashSet<Long> setB = new HashSet<>(listB);
    setA.retainAll(setB);
    long end = System.currentTimeMillis();
    System.out.println("[ArrayList-HashSet]RetainAll耗时:" + (end - begin));
}

// 外层:LinkedList,内层:HashSet
private void outLinkedListInnerHashSet(List<Long> listA, List<Long> listB) {
    long begin = System.currentTimeMillis();
    LinkedList<Long> setA = new LinkedList<>(listA);
    HashSet<Long> setB = new HashSet<>(listB);
    setA.retainAll(setB);
    long end = System.currentTimeMillis();
    System.out.println("[LinkedList-HashSet]RetainAll耗时:" + (end - begin));
}

3.『外层遍历+内层bitMap』查找:

BitSet也称作BitMap,它是一种通用的快速数据结构,不幸的是它太费内存,所以通常我们使用压缩位图。RoaringBitmap是一种压缩位置,它提供更好的压缩效果,在某些情况下比其它压缩位图快好几百倍。

https://github.com/RoaringBitmap/RoaringBitmap

RoaringBitmap已经使用在很多知名的开源项目中:

  • Apache Spark
  • Apache Hive
  • Apache Tez
  • Apache Kylin
  • ... ...

Roaringbitmap中在Long类型中,提供了2种实现Roaring64NavigableMapRoaring64BitmapRoaring64NavigableMap基于红黑树实现,Roaring64Bitmap基于ART(The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases )数据结构实现。

那么,外层使用ArrayList、LinkedList,内层使用Roaring64NavigableMap、Roaring64Bitmap。 

// 外层:ArrayList,内层:Roaring64NavigableMap
private void outArrayListInnerRoaring64NavigableMap(List<Long> listA, List<Long> listB) {
    long begin = System.currentTimeMillis();
    ArrayList<Long> setA = new ArrayList<>(listA);
    Roaring64NavigableMap ansB = new Roaring64NavigableMap();
    listB.forEach(ansB::addLong);
    setA.removeIf(e -> !ansB.contains(e));
    long end = System.currentTimeMillis();
    System.out.println("[ArrayList-Roaring64NavigableMap]RetainAll耗时:" + (end - begin));
}

// 外层:LinkedList,内层:Roaring64NavigableMap
private void outLinkedListInnerRoaring64NavigableMap(List<Long> listA, List<Long> listB) {
    long begin = System.currentTimeMillis();
    LinkedList<Long> setA = new LinkedList<>(listA);
    Roaring64NavigableMap ansB = new Roaring64NavigableMap();
    listB.forEach(ansB::addLong);
    setA.removeIf(e -> !ansB.contains(e));
    long end = System.currentTimeMillis();
    System.out.println("[LinkedList-Roaring64NavigableMap]RetainAll耗时:" + (end - begin));
}


// 外层:ArrayList,内层:Roaring64Bitmap
private void outArrayListInnerRoaring64Bitmap(List<Long> listA, List<Long> listB) {
    long begin = System.currentTimeMillis();
    ArrayList<Long> setA = new ArrayList<>(listA);
    Roaring64NavigableMap ansB = new Roaring64NavigableMap();
    listB.forEach(ansB::addLong);
    setA.removeIf(e -> !ansB.contains(e));
    long end = System.currentTimeMillis();
    System.out.println("[ArrayList-Roaring64Bitmap]RetainAll耗时:" + (end - begin));
}

// 外层:LinkedList,内层:Roaring64Bitmap
private void outLinkedListInnerRoaring64Bitmap(List<Long> listA, List<Long> listB) {
    long begin = System.currentTimeMillis();
    LinkedList<Long> setA = new LinkedList<>(listA);
    Roaring64Bitmap ansB = new Roaring64Bitmap();
    listB.forEach(ansB::addLong);
    setA.removeIf(e -> !ansB.contains(e));
    long end = System.currentTimeMillis();
    System.out.println("[LinkedList-Roaring64Bitmap]RetainAll耗时:" + (end - begin));
}

使用Mac Pro 2021 M1 + JDK8测试,100万级的数据太慢,实行中没有太大参考意义,有兴趣可以自行测试。

说明:由于受数据,以及电脑本身负载影响,测试结果可能不一致,仅做量级参考。

 

查找方法(外层-内层)

1万(毫秒)

10万(毫秒)

20万(毫秒)

50万(毫秒)

『外层遍历+内层遍历』查找

ArrayList-ArrayList

26280

264133

LinkedList-ArrayList

27961

272362

LinkedList-ArrayList

21330

260976

LinkedList-LinkedList

23251

252472

『外层遍历+内层Hash』查找

ArrayList-HashSet

LinkedList-HashSet

『外层遍历+内层bitMap』查找

ArrayList-Roaring64NavigableMap

LinkedList-Roaring64NavigableMap

ArrayList-Roaring64Bitmap

LinkedList-Roaring64Bitmap

pom.xml

<dependency>
  <groupId>org.roaringbitmap</groupId>
  <artifactId>RoaringBitmap</artifactId>
  <version>0.9.23</version>
</dependency>

完整测试代码

import org.apache.commons.lang.math.RandomUtils;
import org.junit.Test;
import org.roaringbitmap.longlong.Roaring64Bitmap;
import org.roaringbitmap.longlong.Roaring64NavigableMap;

import java.util.ArrayList;
import java.util.HashSet;
import java.util.LinkedList;
import java.util.List;
import java.util.Random;

public class SetOperation {

    /**
     * 集合的运算方法用时测试
     */
    @Test
    public void setOperation() {
        int size = 50_0000;
        List<Long> listA = new ArrayList<>(size);
        List<Long> listB = new ArrayList<>(size);

        initData(size, listA, listB);

        //『外层遍历+内层遍历』查找
        System.out.println("1. 『外层遍历+内层遍历』查找");
        outArrayListInnerArrayList(new ArrayList<>(listA), new ArrayList<>(listB));
        outLinkedListInnerArrayList(new ArrayList<>(listA), new ArrayList<>(listB));
        outArrayListInnerLinkedList(new ArrayList<>(listA), new ArrayList<>(listB));
        outLinkedListInnerLinkedList(new ArrayList<>(listA), new ArrayList<>(listB));

        //『外层遍历+内层Hash』查找:
        System.out.println("2.『外层遍历+内层Hash』查找:");
        outArrayListInnerHashSet(new ArrayList<>(listA), new ArrayList<>(listB));
        outLinkedListInnerHashSet(new ArrayList<>(listA), new ArrayList<>(listB));

        //『外层遍历+内层bitMap』查找
        System.out.println("3.『外层遍历+内层bitMap』查找");
        outArrayListInnerRoaring64NavigableMap(new ArrayList<>(listA), new ArrayList<>(listB));
        outLinkedListInnerRoaring64NavigableMap(new ArrayList<>(listA), new ArrayList<>(listB));
        outArrayListInnerRoaring64Bitmap(new ArrayList<>(listA), new ArrayList<>(listB));
        outLinkedListInnerRoaring64Bitmap(new ArrayList<>(listA), new ArrayList<>(listB));
    }

    // 外层:ArrayList,内层:ArrayList
    private void outArrayListInnerArrayList(List<Long> listA, List<Long> listB) {
        long begin = System.currentTimeMillis();
        ArrayList<Long> setA = new ArrayList<>(listA);
        ArrayList<Long> setB = new ArrayList<>(listB);
        setA.retainAll(setB);
        long end = System.currentTimeMillis();
        System.out.println("[ArrayList-ArrayList]RetainAll耗时:" + (end - begin));
    }

    // 外层:LinkedList,内层:ArrayList
    private void outLinkedListInnerArrayList(List<Long> listA, List<Long> listB) {
        long begin = System.currentTimeMillis();
        LinkedList<Long> setA = new LinkedList<>(listA);
        ArrayList<Long> setB = new ArrayList<>(listB);
        setA.retainAll(setB);
        long end = System.currentTimeMillis();
        System.out.println("[LinkedList-ArrayList]RetainAll耗时:" + (end - begin));
    }

    // 外层:ArrayList,内层:LinkedList
    private void outArrayListInnerLinkedList(List<Long> listA, List<Long> listB) {
        long begin = System.currentTimeMillis();
        LinkedList<Long> setA = new LinkedList<>(listA);
        ArrayList<Long> setB = new ArrayList<>(listB);
        setA.retainAll(setB);
        long end = System.currentTimeMillis();
        System.out.println("[LinkedList-ArrayList]RetainAll耗时:" + (end - begin));
    }

    // 外层:LinkedList,内层:LinkedList
    private void outLinkedListInnerLinkedList(List<Long> listA, List<Long> listB) {
        long begin = System.currentTimeMillis();
        LinkedList<Long> setA = new LinkedList<>(listA);
        ArrayList<Long> setB = new ArrayList<>(listB);
        setA.retainAll(setB);
        long end = System.currentTimeMillis();
        System.out.println("[LinkedList-LinkedList]RetainAll耗时:" + (end - begin));
    }

    // 外层:ArrayList,内层:HashSet
    private void outArrayListInnerHashSet(List<Long> listA, List<Long> listB) {
        long begin = System.currentTimeMillis();
        ArrayList<Long> setA = new ArrayList<>(listA);
        HashSet<Long> setB = new HashSet<>(listB);
        setA.retainAll(setB);
        long end = System.currentTimeMillis();
        System.out.println("[ArrayList-HashSet]RetainAll耗时:" + (end - begin));
    }

    // 外层:LinkedList,内层:HashSet
    private void outLinkedListInnerHashSet(List<Long> listA, List<Long> listB) {
        long begin = System.currentTimeMillis();
        LinkedList<Long> setA = new LinkedList<>(listA);
        HashSet<Long> setB = new HashSet<>(listB);
        setA.retainAll(setB);
        long end = System.currentTimeMillis();
        System.out.println("[LinkedList-HashSet]RetainAll耗时:" + (end - begin));
    }


    // 外层:ArrayList,内层:Roaring64NavigableMap
    private void outArrayListInnerRoaring64NavigableMap(List<Long> listA, List<Long> listB) {
        long begin = System.currentTimeMillis();
        ArrayList<Long> setA = new ArrayList<>(listA);
        Roaring64NavigableMap ansB = new Roaring64NavigableMap();
        listB.forEach(ansB::addLong);
        setA.removeIf(e -> !ansB.contains(e));
        long end = System.currentTimeMillis();
        System.out.println("[ArrayList-Roaring64NavigableMap]RetainAll耗时:" + (end - begin));
    }

    // 外层:LinkedList,内层:Roaring64NavigableMap
    private void outLinkedListInnerRoaring64NavigableMap(List<Long> listA, List<Long> listB) {
        long begin = System.currentTimeMillis();
        LinkedList<Long> setA = new LinkedList<>(listA);
        Roaring64NavigableMap ansB = new Roaring64NavigableMap();
        listB.forEach(ansB::addLong);
        setA.removeIf(e -> !ansB.contains(e));
        long end = System.currentTimeMillis();
        System.out.println("[LinkedList-Roaring64NavigableMap]RetainAll耗时:" + (end - begin));
    }


    // 外层:ArrayList,内层:Roaring64Bitmap
    private void outArrayListInnerRoaring64Bitmap(List<Long> listA, List<Long> listB) {
        long begin = System.currentTimeMillis();
        ArrayList<Long> setA = new ArrayList<>(listA);
        Roaring64NavigableMap ansB = new Roaring64NavigableMap();
        listB.forEach(ansB::addLong);
        setA.removeIf(e -> !ansB.contains(e));
        long end = System.currentTimeMillis();
        System.out.println("[ArrayList-Roaring64Bitmap]RetainAll耗时:" + (end - begin));
    }

    // 外层:LinkedList,内层:Roaring64Bitmap
    private void outLinkedListInnerRoaring64Bitmap(List<Long> listA, List<Long> listB) {
        long begin = System.currentTimeMillis();
        LinkedList<Long> setA = new LinkedList<>(listA);
        Roaring64Bitmap ansB = new Roaring64Bitmap();
        listB.forEach(ansB::addLong);
        setA.removeIf(e -> !ansB.contains(e));
        long end = System.currentTimeMillis();
        System.out.println("[LinkedList-Roaring64Bitmap]RetainAll耗时:" + (end - begin));
    }

    private void initData(int size, List<Long> listA, List<Long> listB) {
        Random random = new Random();
        Random random2 = new Random();
        random.longs(size).forEach(e -> {
            listA.add(e);
            if (random2.nextFloat() > 0.5) {
                listB.add(e);
            } else {
                listB.add(RandomUtils.nextLong());
            }
        });
    }
}



About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK