Calculating cryptographic hash functions in Java

Cryptographic hash function is an algorithm which takes block of data of arbitrary length as an input and outputs a short fixed-length sequence of bits (usually 128-512 bits) and which ideally should have following properties:

it is very easy to compute the hash value for any message
it is infeasible to generate a message which has given hash
it is infeasible to modify a message without changing the hash
it is infeasible to generate two different messages with the same hash

Cryptographic hash functions are said to be one-way functions because it’s easy to compute its value but it is technically impossible to get the original message back (or even generate good candidates for it) based on the hash value.

Java implementation from Oracle supports the most popular cryptographic hash function which include:

MD5
SHA-1
SHA-256
SHA-384
SHA-512

Using standard Java API

In order to calculate the hash you have to first obtain instance of MessageDigest class for specified algorithm:

MessageDigest digest = MessageDigest.getInstance("SHA-256");

The available algorithms are on the list above and if you specify something unsupported, exception NoSuchAlgorithmException will be thrown.

Once you have got the instance, you can push there some binary data using various update methods which accept single byte, ByteBuffer, array of bytes or some portion of it:

digest.update(buffer, 0, readBytes);

When you are done, you can call digest method (optionally with some additional data) to complete the computation and fetch the calculated hash value as an array of bytes.

Alternatively, if your data is short and available in a single ByteBuffer or array, you can omit calls to update method and directly call one of the convenience digest methods passing it the data. It will automatically update the message digest using the provided data and will return the calculated hash value.

The complete code which calculates the hash function for a file using SHA-256 algorithm is provided below:

private static void calculateSha256(Path path) throws IOException, NoSuchAlgorithmException {
    byte[] buffer = new byte[BUFFER_SIZE];
    MessageDigest digest = MessageDigest.getInstance("SHA-256");
        
    try (InputStream is = Files.newInputStream(path)) {
        while (true) {
            int readBytes = is.read(buffer);
            if (readBytes > 0)
               digest.update(buffer, 0, readBytes);
            else
               break;
        }
    }
        
    byte[] hashValue = digest.digest();
    System.out.printf("SHA256(%s) = %s%n", path, bytesToHex(hashValue));
}

The only important thing to notice is that we cannot print the returned hash value directly on the screen but we have to first convert it to a user-readable form using bytesToHex method:

private static String bytesToHex(byte[] hashValue) {
    Formatter form = new Formatter();
    for (int i = 0; i < hashValue.length; i++)
       form.format("%02x", hashValue[i]);
    return form.toString();
}

Using Commons Codec API

As you could notice most of the code above was responsible for reading a file instead of calculating the hash. In most cases it can be greatly simplified using DigestUtils class from Apache Commons Codec library. With it our code can be simplified to:

private static void calculateSha256CommonsIO(Path path) throws IOException {
    try (InputStream is = Files.newInputStream(path)) {
        byte[] hashValue = DigestUtils.sha256(is);
        System.out.printf("SHA256(%s) = %s%n", path, bytesToHex(hashValue));
    }
}

or even:

private static void calculateSha256CommonsIO2(Path path) throws IOException {
    try (InputStream is = Files.newInputStream(path)) {
        String hashString = DigestUtils.sha256Hex(is);
        System.out.printf("SHA256(%s) = %s%n", path, hashString);
    }
}

There are also overloaded versions of sha*/md* methods which accept String and array of bytes.

Using hash functions on strings

You may wonder how DigestUtils calculates hash function for a string while hash functions don’t operate on characters but rather on bytes. The answer is simple: it converts the string to its byte representation using UTF-8 encoding and then calculates hash on it.

Generally, this works pretty fine but you may run into problems if some other code (other part of your application or other application which you are exchanging data with) uses different character encoding or what is worse uses platform’s default character encoding. Therefore, it has to be used carefully and should be double-checked.

Security consideration

There are several cryptographic hash algorithms which were good in the past but now their usage is discouraged. The list includes MD2 and MD5 algorithms which security was compromised some time ago and also SHA-1 which is very similar to MD5 and therefore may be also compromised in near future. They shouldn’t be used in new code and the only exception is when you are communicating with applications or using protocols which does not support any stronger algorithm.

In practice you should always use SHA-2 family of algorithms or something of similar or higher strength.

Conclusion

Cryptographic hash functions are used very often in practice even in areas which don’t fully utilize their power like checking for data transmission errors. Computing them is fairly easy in Java and require only few lines of code.

The sample code for this article can be found at GitHub.