HashMap 源码分析-白红宇

HashMap 源码分析

阅读量：5302 次

发布时间：2019-06-14

本文共 12000 字，大约阅读时间需要 40 分钟。

前言

　　关于 HashMap 源码分析本不想写，因为网上的源码分析已经烂大街了。后来想想还是写吧，原因是：看别人的东西一看就懂，但是凡事要用自己的理解，加上自己最认可的语言组织起来，等到下一次回来看是就不会显得那么难理解，这也是我坚持用自己的理解写博客的初衷。

首先来看一下 HashMap 的数据结构

图 1-1: HashMap 数据结构

我们知道在 Java 中最常用的两种结构是数组和模拟指针(引用)，几乎所有的数据结构都可以利用这两种来组合实现。数组的存储方式在内存的地址是连续的，大小固定，一旦分配不能被其他引用占用。它的特点是查询快，时间复杂度是 O(1)（具体的时间复杂度怎么计算参照维基百科），插入和删除的操作比较慢，时间复杂度是 O(n)，链表的存储方式是非连续的，大小不固定，特点与数组相反，插入和删除快，查询速度慢。HashMap 可以说是一种折中的方案吧（它是链表加数组）。

看源码之前先看几个对于 HashMap 来说很重要的变量

关于源码分析的顺序，我是按照使用顺序来进行排序的

首先

HashMap<> hashMap = new HashMap<>();

public HashMap() {　　this(DEFAULT_INITIAL_CAPACITY, DEFAULT_LOAD_FACTOR);}

static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16

static final float DEFAULT_LOAD_FACTOR = 0.75f;

public HashMap(int initialCapacity, float loadFactor) {        if (initialCapacity < 0)            throw new IllegalArgumentException("Illegal initial capacity: " + initialCapacity); if (initialCapacity > MAXIMUM_CAPACITY)            initialCapacity = MAXIMUM_CAPACITY; if (loadFactor <= 0 || Float.isNaN(loadFactor))            throw new IllegalArgumentException("Illegal load factor: " + loadFactor);        this.loadFactor = loadFactor;        threshold = initialCapacity;        init();    }

static public boolean isNaN(float v) { // 就是判断这个数是不是 float 类型的        return (v != v);    }

// 附上自己的测试public class Test{    public static void main(String[] args)    {        int i = 1;        boolean flag = Float.isNaN(i);        System.out.println(flag);  // false    }}

int threshold; // 阈值是可更改的，所以不是 final 的

Put ( K,V ) 方法

1 public V put(K key, V value) { 2         if (table == EMPTY_TABLE) {
      3             inflateTable(threshold); 4         } 5         if (key == null) 6             return putForNullKey(value);  7         int hash = hash(key); 8         int i = indexFor(hash, table.length); 9         for (Entry
     
       e = table[i]; e != null; e = e.next) {10             Object k;11             if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {12                 V oldValue = e.value;13                 e.value = value;14                 e.recordAccess(this);15                 return oldValue;16             }17         }18 19         modCount++;20         addEntry(hash, key, value, i);21         return null;22     }

static final Entry
     [] EMPTY_TABLE = {}; transient Entry
     
      [] table = (Entry
      
       []) EMPTY_TABLE;

static class Entry
     
       implements Map.Entry
      
        {        final K key;        V value;        Entry
       
         next;        int hash;    }

/**  * Inflates the table.  */    private void inflateTable(int toSize) {        // Find a power of 2 >= toSize        int capacity = roundUpToPowerOf2(toSize); // toSize 传进来到是阈值 threshold，开始 threshold = 0        threshold = (int) Math.min(capacity * loadFactor, MAXIMUM_CAPACITY + 1);        table = new Entry[capacity];        initHashSeedAsNeeded(capacity); // 暂时不分析    }

private static int roundUpToPowerOf2(int number) {        // assert number >= 0 : "number must be non-negative";        return number >= MAXIMUM_CAPACITY ? MAXIMUM_CAPACITY : (number > 1) ? Integer.highestOneBit((number - 1) << 1) : 1;}

//该函数实现获取指定 int 数的二进制数中 1 出现的最高位 public static int highestOneBit(int i) {        // HD, Figure 3-1        i |= (i >>  1);        i |= (i >>  2);        i |= (i >>  4);        i |= (i >>  8);        i |= (i >> 16);        return i - (i >>> 1);    }

截止到这里上面是 inflateTable(threshold) 方法的分析，其实里面就做了两件事：① 创建了一个长度为 2 的幂的 Entry 数组 ② 更改了 threshold 阈值（注意：这个阈值的计算只要是不超过 MAXIMUM_CAPACITY+1 的话就永远都是 capacity * 0.75）

threshold = (int) Math.min(capacity * loadFactor, MAXIMUM_CAPACITY + 1);table = new Entry[capacity];

接下来我们看一下 put 方法的时候 key 为 null 的情况，对 table 数组做了哪些操作

private V putForNullKey(V value) {        for (Entry
     
       e = table[0]; e != null; e = e.next) {            if (e.key == null) {                V oldValue = e.value;                e.value = value;                e.recordAccess(this);                return oldValue;            }        }        modCount++;        addEntry(0, null, value, 0);        return null;    }

遍历 table[0] 这个链表，如果 table[0] 的位置上有 Entry 的 key 为 null 的话，则把新的 value 赋给这个 Entry。（在这里说明一个问题，整个 hashmap 中，即整个数组 + 链表这个结构中 key 为 null 的只有可能在 table[0] 这个位置，且只有一个 key 为 null 的 Entry；还要说明一点，千万不要以为 table[0] 只能存储 key 为 null 的 Entry，它也是可以存储正常的 Entry 的。）

如果 table[0] 中没有找到的话（不管是 table[0] 这个位置根本就是空的，还是 table[0] 这个下面的链表没有 key 为 null 的，反正是遍历 table[0] 这条链表没找到），在 table[0] 这个位置新添加一个 Entry

void addEntry(int hash, K key, V value, int bucketIndex) {        if ((size >= threshold) && (null != table[bucketIndex])) {  // 由这里可以看出并不是 hashMap 的 size 的个数达到数组             resize(2 * table.length);            hash = (null != key) ? hash(key) : 0;            bucketIndex = indexFor(hash, table.length);        }        createEntry(hash, key, value, bucketIndex);}

我们先单从添加一个普通的 Entry 来说

当存放元素的数量大于或等于阈值，该 bucketIndex 下的表位置不为空的时候就会扩容。也就是说即使存放元素个数的数量大于或等于阈值，只要本次准备插入的该 bucketIndex 下的表位置还为空，它就仍然不会进行扩容。（这里具体原因先放在这里。所以，可能情况下 size 的确有可能出现大于 threshold 的情况。但是这并不影响）

void resize(int newCapacity) {        Entry[] oldTable = table;        int oldCapacity = oldTable.length;        if (oldCapacity == MAXIMUM_CAPACITY) {            threshold = Integer.MAX_VALUE;            return;        }        Entry[] newTable = new Entry[newCapacity];  // 正常情况下以以前的数组长度的两倍进行扩容        transfer(newTable, initHashSeedAsNeeded(newCapacity));        table = newTable;        threshold = (int)Math.min(newCapacity * loadFactor, MAXIMUM_CAPACITY + 1);    }

/**  * Transfers all entries from current table to newTable.  */    void transfer(Entry[] newTable, boolean rehash) {        int newCapacity = newTable.length;        for (Entry
     
       e : table) {　　// 遍历 table 中的 Entry            while(null != e) {　　　　　 // 遍历每一个单链 Entry                Entry
      
        next = e.next;                  if (rehash) {                    e.hash = null == e.key ? 0 : hash(e.key);                }                int i = indexFor(e.hash, newCapacity);　　// 重新计算索引                e.next = newTable[i];    // 拿到单链中的第一个 entry 的时候，将它的 next 置空，也就是把这个 Entry 和它的 next 链断开                newTable[i] = e;　　　　  // 把 entry 放到新数组的位置                e = next;                // 将 next 赋给 e，继续遍历            }        }    }

这里的后半部分的意思就是先把单链表中的每一个 Entry 抽出来，单独放到新的 table 数组中，目的就是想让以前的单链表中的 Entry 尽量单独存在于新的表中，而不是以多个链表的的形式存在，从而提高 HashMap 的性能。

这里我有一个疑问：难道把以前的链表中的 Entry 都以一个一个 table 中的元素存放到新的表中，新表中就没有链表吗？其实现在看来这主要还是取决于 indexFor 计算出来的索引，如果两个 Entry 计算出来的索引一样，那么就会出现问题。

void createEntry(int hash, K key, V value, int bucketIndex) {        Entry
     
       e = table[bucketIndex];  // 初始化索引为 bucketIndex 的表位置        table[bucketIndex] = new Entry<>(hash, key, value, e); // 初始化 Entry，可能会引发多线程并发问题        size++;}

到这里我们 put 方法所涉及到的基本操作分析的差不多了。下面来分析 get 方法。

public V get(Object key) {        if (key == null)            return getForNullKey();         Entry
     
       entry = getEntry(key);        return null == entry ? null : entry.getValue();}

private V getForNullKey() {        if (size == 0) {            return null;        }        for (Entry
     
       e = table[0]; e != null; e = e.next) {  // 遍历 table[0] 单链表            if (e.key == null)                return e.value;        }        return null;}

final Entry
     
       getEntry(Object key) {
             if (size == 0) {            return null;        }        int hash = (key == null) ? 0 : hash(key);　　// 检测 key 是否为空，是则返回 0；否则返回 key 的 hash 码                for (Entry
      
        e = table[indexFor(hash, table.length)]; e != null; e = e.next) { // 根据 hash 码和表长度获取索引，从 table 中取出 entry             Object k;

　　　　　　　//检测 hash 是否相同，key 的内存地址是否相等，key 是否为 null，key 的 equals 方法返回值是否为 true（之所以要比较这个是因为可以通过重写 equals 实现两个不同内存地址的对象返回 true 值）。             if (e.hash == hash && ((k = e.key) == key || (key != null && key.equals(k))))                return e;        }        return null;    }

总结

到这里 hashMap 的源码分析就基本结束了。

hashMap 有几点问题要说明

1、hashMap 中的 table 数组为什么要是 2 的幂次方？

// 这个 hash 算法是经过海量的计算机算出来的 final int hash(Object k) {        int h = hashSeed;        if (0 != h && k instanceof String) {            return sun.misc.Hashing.stringHash32((String) k);        }        h ^= k.hashCode();                h ^= (h >>> 20) ^ (h >>> 12);        return h ^ (h >>> 7) ^ (h >>> 4);    }

indexFor 算出来的就是 Entry 要放的在 table 数组中的索引的位置

static int indexFor(int h, int length) {        // assert Integer.bitCount(length) == 1 : "length must be a non-zero power of 2";        return h & (length-1);}

其实就是 key 的 hash 算法的值和数组的长度减1进行相与。有人说这样能保证与出来的结果不会越界。结论没错，但是并不是数组的长度必须是 2 的次幂的原因。真正原因是它能够保证链表尽可能的少。2 的次幂一定是偶数，那么其 length - 1 一定是奇数。假设现在数组的长度 length 为 16，length - 1 就是 15，15 对应的二进制是：1111。现在假设有两个元素需要插入，一个哈希值是 8，二进制是 1000，一个哈希值是 9，二进制是1001。和 1111 “与”运算后，结果分别是 1000 和 1001，它们被分配在了数组的不同位置，没有产生链表。那么，如果数组长度是奇数呢？length - 1 就是偶数了，偶数对应的二进制最低位一定是 0，例如 14 二进制 1110。对上面两个数子分别“与”运算，得到 1000 和 1000。结果都是一样的值。那么，哈希值 8 和 9 的元素都被存储在数组同一个 bucketIndex 位置，就会产生链表。在操作的时候，链表中的元素越多，效率越低，因为要不停的对链表循环比较，这就是为什么数组长度要为 2 的次幂。

2、hashMap 是线程不安全的，那么究竟是哪里会出现不安全的地方呢？

第一处：

void createEntry(int hash, K key, V value, int bucketIndex) {        Entry
     
       e = table[bucketIndex];        table[bucketIndex] = new Entry<>(hash, key, value, e);        size++;}

如果多个线程同时进行 createEntry 的时候，如果恰好两个 key 的 hash 值相同，这个时候两个工作线程如果恰好都取到了对应位置的头结点，都分别创建了一个 Entry，并且这两个新创建的 Entry 的 next 都是头结点（注意这些操作都是在两个线程自己的工作内存中做的），然后再把创建的两个Entry 赋给主内存中的 table[bucketIndex]，很明显，没有同步，自然在工作内存和主内存进行交互的时候没有 lock 锁，所以，其中必然会有一个值丢失。

第二处：

void resize(int newCapacity) {        Entry[] oldTable = table;        int oldCapacity = oldTable.length;        if (oldCapacity == MAXIMUM_CAPACITY) {            threshold = Integer.MAX_VALUE;            return;        }        Entry[] newTable = new Entry[newCapacity];        transfer(newTable, initHashSeedAsNeeded(newCapacity));        table = newTable;        threshold = (int)Math.min(newCapacity * loadFactor, MAXIMUM_CAPACITY + 1);    }    /**     * Transfers all entries from current table to newTable.     */    void transfer(Entry[] newTable, boolean rehash) {        int newCapacity = newTable.length;        for (Entry
     
       e : table) {            while(null != e) {                Entry
      
        next = e.next;                if (rehash) {                    e.hash = null == e.key ? 0 : hash(e.key);                }                int i = indexFor(e.hash, newCapacity);                e.next = newTable[i];                newTable[i] = e;                e = next;            }        }    }

当多个线程同时调用 resize 操作，各自生成新的数组并 rehash 后赋给该 map 底层的数组 table，结果最终只有最后一个线程生成的新数组被赋给 table 变量，其他线程的均会丢失。而且当某些线程已经完成赋值而其他线程刚开始的时候，就会使用用已经被赋值的 table 作为原始数组，这样也会有问题。

上面仅仅是 addEntry 的时候会造成线程不安全，那么相对应的删除的时候也会出现诸多线程不安全的情况。

3、hashMap 既然是线程不安全的，那么怎么解决这个问题呢？

1、使用 HashTable（其实就是把 hashMap 中的方法都加上了 synchronized）

2、外部包装：

Map
     
       map = Collections.synchronizedMap(new HashMap
      
       ());

我们看一下 Collections 部分源码

private static class SynchronizedMap
     
       implements Map
      
       , Serializable {        private static final long serialVersionUID = 1978198479659022715L;        private final Map
       
         m;       // Backing Map        final Object      mutex;        // Object on which to synchronize        SynchronizedMap(Map
        
          m) {            if (m==null)                throw new NullPointerException();            this.m = m;            mutex = this;        }        SynchronizedMap(Map
         
           m, Object mutex) { // 无论你在这里指定不指定锁，下面所有的方法使用的都是一个对象锁，所以和 hashtable 中所有方法加上锁并无区别            this.m = m;            this.mutex = mutex;        }　　　　 public V get(Object key) {            synchronized (mutex) {
    　　　　　　　　return m.get(key); 　　　　　　　}        }        public V put(K key, V value) {            synchronized (mutex) {
    　　　　　　　　return m.put(key, value); 　　　　　　　}        } }