初步了解HashMap

定义
工作原理
put和get
- put
- get
总结
参考

　　在工作中用到HashMap的地方还是特别多的，趁工作之余，初步了解一下HashMap的工作原理，可能了解的也不够深入，需要慢慢研究。

定义

　　先看源码中的Java7.0中的HashMap类，它继承了AbstractMap，实现了Map、Cloneable和Serializable接口。

public class HashMap<K,V>
    extends AbstractMap<K,V>
    implements Map<K,V>, Cloneable, Serializable

属性

　　列举HashMap中的几个参数，比较重要的是初始容量DEFAULT_INITIAL_CAPACITY和加载因子DEFAULT_LOAD_FACTOR。

/**
     * The default initial capacity - MUST be a power of two.默认初始容量为16，必须为2的幂
     */
    static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16

    /**
     * The maximum capacity, used if a higher value is implicitly specified
     * by either of the constructors with arguments.
     * MUST be a power of two <= 1<<30.最大容量为2的30次方
     */
    static final int MAXIMUM_CAPACITY = 1 << 30;

    /**
     * The load factor used when none specified in constructor.默认加载因子0.75
     */
    static final float DEFAULT_LOAD_FACTOR = 0.75f;

    /**
     * An empty table instance to share when the table is not inflated.
     */
    static final Entry<?,?>[] EMPTY_TABLE = {};

    /**
     * The table, resized as necessary. Length MUST Always be a power of two.
     */
    transient Entry<K,V>[] table = (Entry<K,V>[]) EMPTY_TABLE;

    /**
     * The number of key-value mappings contained in this map.map中元素个数
     */
    transient int size;

    /**
     * The next size value at which to resize (capacity * load factor).
     * @serial
     */
    // If table == EMPTY_TABLE then this is the initial capacity at which the
    // table will be created when inflated.扩容临界值
    int threshold;

　　先提一句，transient关键字表示不需要持久化的变量，即不会随类进行序列化。然后暂时没想明白EMPTY_TABLE和table关系，看注释和源码，貌似table == EMPTY_TABLE就会进行扩容。但是这是比较什么，没看明白(后来再阅读源码才明白：EMPTY_TABLE就是一个空的Entry数组，table == EMPTY_TABLE的判断出现在put和putAll方法里，作用是判断HashMap目前的table是否为长度为0的空数组，构造函数除了HashMap(Map)这个以外，都是在调用put或者putAll方法时才构建初始容量的table)。最后，默认初始容量表示调用默认构造函数时，HashMap的数组的初始容量。加载因子表示需要扩容的HashMap的数组装载程度，举例来说就是能装16个元素的数组在装了12个元素时，就该扩容了。为什么设置这样一个值，就需要了解一下HashMap的数据结构。

数据结构

　　HashMap的底层实现还是数组，它是一种桶结构，由数组和链表组成，数组的每一项都是一个链表。这样做的理由是：数组虽然查询效率很高，但是插入元素的效率很低。通过key进行hash运算，在数组中找到相应的项，然后遍历对应的链表，比较key值进行插入元素。这样能提高元素插入的效率，同时也将查找的效率维持在O(1+a)，a为链表长度。转一张数据结构图，

HashMap数据结构图

构造函数

　　HashMap有4个构造函数，源码如下：

/**
     * Constructs an empty <tt>HashMap</tt> with the specified initial
     * capacity and load factor.
     *
     * @param  initialCapacity the initial capacity
     * @param  loadFactor      the load factor
     * @throws IllegalArgumentException if the initial capacity is negative
     *         or the load factor is nonpositive
     */
    public HashMap(int initialCapacity, float loadFactor) {
        if (initialCapacity < 0)
            throw new IllegalArgumentException("Illegal initial capacity: " +
                                               initialCapacity);
        if (initialCapacity > MAXIMUM_CAPACITY)
            initialCapacity = MAXIMUM_CAPACITY;
        if (loadFactor <= 0 || Float.isNaN(loadFactor))
            throw new IllegalArgumentException("Illegal load factor: " +
                                               loadFactor);

        this.loadFactor = loadFactor;
        threshold = initialCapacity;
        init();
    }

    /**
     * Constructs an empty <tt>HashMap</tt> with the specified initial
     * capacity and the default load factor (0.75).
     *
     * @param  initialCapacity the initial capacity.
     * @throws IllegalArgumentException if the initial capacity is negative.
     */
    public HashMap(int initialCapacity) {
        this(initialCapacity, DEFAULT_LOAD_FACTOR);
    }

    /**
     * Constructs an empty <tt>HashMap</tt> with the default initial capacity
     * (16) and the default load factor (0.75).
     */
    public HashMap() {
        this(DEFAULT_INITIAL_CAPACITY, DEFAULT_LOAD_FACTOR);
    }

    /**
     * Constructs a new <tt>HashMap</tt> with the same mappings as the
     * specified <tt>Map</tt>.  The <tt>HashMap</tt> is created with
     * default load factor (0.75) and an initial capacity sufficient to
     * hold the mappings in the specified <tt>Map</tt>.
     *
     * @param   m the map whose mappings are to be placed in this map
     * @throws  NullPointerException if the specified map is null
     */
    public HashMap(Map<? extends K, ? extends V> m) {
        this(Math.max((int) (m.size() / DEFAULT_LOAD_FACTOR) + 1,
                      DEFAULT_INITIAL_CAPACITY), DEFAULT_LOAD_FACTOR);
        inflateTable(threshold);

        putAllForCreate(m);
    }

　　根据参数能很容易明白该构造函数的意思，这里要提一下就是：因为HashMap每次扩容都需要重新计算每个元素的hash值然后重新插入，效率十分低。所以在知道元素个数的情况下，请尽量在构造的时候传入初始容量，传入的数会被转化为2的幂。为什么一定要是2的幂呢，那这个要求就和HashMap的工作原理相关了。

工作原理

　　因为HashMap的底层实现是数组，而数组的每一项都是链表。在HashMap中，存放的元素的类型为Entry<K, V>，它是HashMap的内部类。其源码如下：

static class Entry<K,V> implements Map.Entry<K,V> {
        final K key;
        V value;
        Entry<K,V> next;
        int hash;

        /**
         * Creates new entry.
         */
        Entry(int h, K k, V v, Entry<K,V> n) {
            value = v;
            next = n;
            key = k;
            hash = h;
        }

        public final K getKey() {
            return key;
        }

        public final V getValue() {
            return value;
        }

        public final V setValue(V newValue) {
            V oldValue = value;
            value = newValue;
            return oldValue;
        }

        public final boolean equals(Object o) {
            if (!(o instanceof Map.Entry))
                return false;
            Map.Entry e = (Map.Entry)o;
            Object k1 = getKey();
            Object k2 = e.getKey();
            if (k1 == k2 || (k1 != null && k1.equals(k2))) {
                Object v1 = getValue();
                Object v2 = e.getValue();
                if (v1 == v2 || (v1 != null && v1.equals(v2)))
                    return true;
            }
            return false;
        }

        public final int hashCode() {
            return Objects.hashCode(getKey()) ^ Objects.hashCode(getValue());
        }

        public final String toString() {
            return getKey() + "=" + getValue();
        }

        /**
         * This method is invoked whenever the value in an entry is
         * overwritten by an invocation of put(k,v) for a key k that's already
         * in the HashMap.
         */
        void recordAccess(HashMap<K,V> m) {
        }

        /**
         * This method is invoked whenever the entry is
         * removed from the table.
         */
        void recordRemoval(HashMap<K,V> m) {
        }
    }

　　重要的东西为该类里的四个属性：key、value、next和hash。暂时不太清楚hash是干嘛用的，key和value就是键和值，next这个属性指的是下一个元素。在使用put方法往HashMap里面插入元素的时候，会先计算key的hash值，然后通过该hash值与数组的长度进行&运算，获得将要存放该Entry的数组元素下标。最后遍历该数组元素的链表，如果存在相同的key，则直接用value值覆盖原value，否则插入该Entry到链表头，将next指向原链表头，没有则指向null。而使用get方法从HashMap取元素时，也会先计算key的hash值，再与数组的长度进行&运算找到数组元素下标，然后遍历对应链表去寻找key对应的Entry。但hash值与数组的长度进行&运算有可能使所有的元素都插入到某一个数组元素对应的链表上，这样就会造成查询的效率十分低，所以key的hash值的计算数组的长度的取值就非常关键。

　　为了使HashMap中的元素尽量散列，HashMap实现了一个“扰动函数”，源码如下：

final int hash(Object k) {
        int h = hashSeed;
        if (0 != h && k instanceof String) {
            return sun.misc.Hashing.stringHash32((String) k);
        }

        h ^= k.hashCode();

        // This function ensures that hashCodes that differ only by
        // constant multiples at each bit position have a bounded
        // number of collisions (approximately 8 at default load factor).
        h ^= (h >>> 20) ^ (h >>> 12);
        return h ^ (h >>> 7) ^ (h >>> 4);
    }

　　再提一下，这里查看的是Java7的源码，Java8已经变了很多，但是原理都是一致的。至于为什么要使用这个“扰动函数”和数组的长度为什么一定要是2的幂，我解释不清楚，只晓得是为了让元素散列，具体的解释可以参考。每次扩容时，数组的长度都会被定义为大于当前capacity的最小2的幂，举例说明就是想定义为15，则capacity定义到16，想定义为17，capacity就定义到32。源码如下：

private static int roundUpToPowerOf2(int number) {
        // assert number >= 0 : "number must be non-negative";
        return number >= MAXIMUM_CAPACITY
                ? MAXIMUM_CAPACITY
                : (number > 1) ? Integer.highestOneBit((number - 1) << 1) : 1;
    }

    /**
     * Inflates the table.
     */
    private void inflateTable(int toSize) {
        // Find a power of 2 >= toSize
        int capacity = roundUpToPowerOf2(toSize);

        threshold = (int) Math.min(capacity * loadFactor, MAXIMUM_CAPACITY + 1);
        table = new Entry[capacity];
        initHashSeedAsNeeded(capacity);
    }

　　inflateTable(int)就是扩容方法，通过roundUpToPowerOf2(int)方法找到合适的数组长度，然后修改扩容临界值，再创建一个长度为capacity的Entry数组。　

put和get

　　put和get方法是HashMap最重要的两个方法，前面已经提到过这两个方法的实现思想，最后看一看源码。

put

　　put方法的源码如下：

public V put(K key, V value) {
        if (table == EMPTY_TABLE) {
            inflateTable(threshold);
        }
        if (key == null)
            return putForNullKey(value);
        int hash = hash(key);
        int i = indexFor(hash, table.length);
        for (Entry<K,V> e = table[i]; e != null; e = e.next) {
            Object k;
            if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
                V oldValue = e.value;
                e.value = value;
                e.recordAccess(this);
                return oldValue;
            }
        }

        modCount++;
        addEntry(hash, key, value, i);
        return null;
    }

　　put方法中有几个方法，但我不是都清楚。inflateTable(threshold)方法应该是对HashMap扩容用的，threshold也是当前HashMap的加载临界值，因为有些地方没搞懂，这里就不多介绍了。下面一个方法是putForNullKey(value)，这个方法说明HashMap是支持key为null的。hash(key)就是扰动函数，前面已经提过了。下面看看indexFor方法，这个是求数组下标的函数，非常简单，源码如下：

static int indexFor(int h, int length) {
        // assert Integer.bitCount(length) == 1 : "length must be a non-zero power of 2";
        return h & (length-1);
    }

　　可以看到就是key的hash值与数组长度-1的&运算。然后是for循环遍历该数组元素上的链表，插入元素。如果存在hash值相同并且key相同，那么就是相同的元素，只需要用新的value覆盖原来的value即可。如果不存在，则调用addEntry(hash, key, value, i)方法。该方法的源码如下：

void addEntry(int hash, K key, V value, int bucketIndex) {
        if ((size >= threshold) && (null != table[bucketIndex])) {
            resize(2 * table.length);
            hash = (null != key) ? hash(key) : 0;
            bucketIndex = indexFor(hash, table.length);
        }

        createEntry(hash, key, value, bucketIndex);
    }

　　从该方法可以看出，如果加入该元素使得数组中的元素数量大于加载临界值，就会先进行扩容。调用createEntry(hash, key, value, bucketIndex)方法就能增加一个元素，源码如下：

void createEntry(int hash, K key, V value, int bucketIndex) {
        Entry<K,V> e = table[bucketIndex];
        table[bucketIndex] = new Entry<>(hash, key, value, e);
        size++;
    }

　　从上方法可以看出，先获取table数组上的链表头，然后将创建新元素，将链表头的元素作为新元素的next参数传入，最后把新元素作为链表头。

get

　　相较于put方法，get方法则简单得多，源码如下：

public V get(Object key) {
        if (key == null)
            return getForNullKey();
        Entry<K,V> entry = getEntry(key);

        return null == entry ? null : entry.getValue();
    }

    final Entry<K,V> getEntry(Object key) {
        if (size == 0) {
            return null;
        }

        int hash = (key == null) ? 0 : hash(key);
        for (Entry<K,V> e = table[indexFor(hash, table.length)];
             e != null;
             e = e.next) {
            Object k;
            if (e.hash == hash &&
                ((k = e.key) == key || (key != null && key.equals(k))))
                return e;
        }
        return null;
    }

　　代码不多，就把两个方法放一起了，实现很简单，key为null就返回getForNullKey()，否则就计算hash值，然后找到对应的数组元素链表进行遍历，返回hash值，key都相同的元素。

总结

　　算是简单了解了一下HashMap的实现原理，但是简单了解一下就有这么多的内容，而且有些东西还要继续查阅一下资料才能搞懂，真是不容易。简而言之的话，记住HashMap的底层实现是数组，桶结构，由数组和链表组成，可以理解为“散列链表”，至于扰动函数和hash()方法，知道它们是为了让元素散列就行了。然后，记住put和get方法都是先通过key的hash值和数组长度-1进行&运算确定数组下标，再遍历对应的链表。最后，如果想要提高HashMap的使用效率，就要减少HashMap扩容的次数，最后能在构造HashMap的时候就传入初始容量。