Implementing AES-128-ECB (School Homework)

Notice

This is an old article and only available in Chinese. If you need a translation, please leave a comment and I will do my best to provide it as soon as possible.

首先呢这份代码是这学期选修课的大作业，前后也花了点心思，于是还是贴到这儿，顺便给博客除除草 ww

作业的题目有两大类，一类是使用常见的对称/非对称加密算法（DES/AES/RC5/RSA/Blowfish/blabla）实现一个可用的加解密工具，另一个则是实现一个 HASH 程序，类似于 md5sum 或者是 sha256sum 这种。我选择了做 AES 主要是有几个原因：

老师说了，杂凑算法于加解密算法比较起来，实现难度低，因此基础成绩就会相对低一点（然而我觉得两个都蛮简单的啊）。
在上课之前就已经接触过 AES 这个算法了（归功于 Shadowsocks 啦），用了这么久的东西，也算是对它的实现比较感兴趣的（然而怎么想都知道自己写的破代码性能会被 OpenSSL 吊着打）。

当然按照老师的说法，简单的实现只是最基本的要求，所以我这里打算做三分实现，分别是 CPU 单线程，CPU 多线程和 GPGPU。其实这里面核心代码都是完全一样的，无非就是分组加密和密钥更新上有点区别，但是听起来就高大上了很多（没毛病）。

AES 本身一次只能使用 128bit 的密钥对 128bit 的数据进行加密，所以在实际中必须使用分组加密算法来完成大文件的处理。往简单里讲，分组加密不过是把数据分成等长的分片，然后每次对一个分片进行加密，然后在将加密后的分片组合成文件。解密过程与之类似。这种最简单粗暴地分组加密被称为电子密码本（ECB）模式。然而实际上，由于相同分片数据使用相同的密钥进行加密后，输出必然也是相同的，这样也就会使加密的内容带有特征，存在潜在的隐患，于是人们提出一些会修改后续分组的密钥的分组加密算法，比如密码分组链接模式（CBC）、密文反馈模式（CFB）和计数器模式（CTR）。CBC 和 CFB 在设计上，下一分片的密钥受到上一分片的影响，也就意味着这个算法必须线性的对密码进行加密，不利于多线程的实现。那么可选的就是 CTR 和 ECB 模式了。二选一的话，选择 ECB 而不是 CTR 主要是因为我懒，ECB 只要一个函数改变传入指针就能跑了（

分区模式敲定之后，就轮到 AES 算法的了解和实现了。整个算法主要由两部分组成：数据变换和密钥拓展。

首先实现一份密钥拓展函数，其算法参考：Rijndael key schedule – Wikipedia，使用上很简单，直接将密钥指针传入即可，返回的是堆上分配的结果，用完记得要 free（（

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45


extern uint8_t round_const[];
uint8_t *key_schedule(const uint8_t *input_key) {
    uint8_t *output = (uint8_t *)malloc((10 + 1) * 16 * sizeof(uint8_t));
    uint8_t temp[4], *cursor_4_before, *cursor_16_before, *cursor_current;
    size_t counter, i, j;

    // Copy the first 16 bytes
    memcpy(output, input_key, 16 * sizeof(uint8_t));

    // Initialize cursors.
    cursor_current = output + 16;
    cursor_4_before = cursor_current - 4;
    cursor_16_before = cursor_current - 16;

    // Rock and roll.

    for (counter = 0; counter != 10; counter++) {
        // RotWord
        temp[3] = cursor_4_before[0];
        temp[0] = cursor_4_before[1];
        temp[1] = cursor_4_before[2];
        temp[2] = cursor_4_before[3];

        // SubBytes and add
        for (i = 0; i != 4; ++i) {
            cursor_current[i] = sbox[temp[i]] ^ cursor_16_before[i];
            if (i == 0) {
                cursor_current[i] ^= round_const[counter];
            }
        }

        // Forward all cursors
        cursor_current += 4, cursor_4_before += 4, cursor_16_before += 4;

        // Three step remaining is simple.
        for (i = 0; i != 3; ++i) {
            for (j = 0; j != 4; ++j) {
                cursor_current[j] = cursor_4_before[j] ^ cursor_16_before[j];
            }
            cursor_current += 4, cursor_4_before += 4, cursor_16_before += 4;
        }
    }

    return output;
}

AES 128 会对输入数据进行十轮处理，其中 AddRoundKey 步骤使用的密钥就是由上面函数生成的。

AES 128 的十轮处理中，必然会有的四个步骤：字节替换、行移位、列混淆和密钥相加（XOR）。

字节替换提供了加密法非线性的变换能力，实现起来相当容易：将每一个元素代入 S 盒中即可：

1
2
3
4
5
6
7


extern uint8_t sbox[];
void sub_bytes(uint8_t *input) {
    int i = 15;
    do {
        input[i] = sbox[input[i]];
    } while (--i >= 0);
}

这里补充一下，AES 对输入的明文和密钥的处理有一点反人类，比如明文 {a0, a1, a2, a3, …, a15}，我们要把它看成这样的一个矩阵：

1
2
3
4


a0  a4  a8  a12
a1  a5  a9  a13
a2  a6  a10 a14
a3  a7  a11 a15

所以上面给出的 OFFSET 宏乍一看比较反人类。。对应的，行移位后的矩阵应该是这样的：

1
2
3
4


a0  a4  a8  a12
a5  a9  a13 a1
a10 a14 a2  a6
a15 a3  a7  a11

在内存里也就是这样的数组： {a0, a5, a10, a15, a4, a9, a14, a3, …}

剩下两个函数就是列混淆和密钥相加。其中列混淆可以视为 Rijndael 有限域之下的矩阵乘法，详情可以看 Wikipedia，或者是翻翻书。这里我提前打了个表，来加速完成。变量名叫 cache 是本来想实时计算缓存结果的，最后还是干脆直接打全了。。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35


static inline uint8_t mul2(uint8_t a) {
    static uint8_t cache[256] = {
        0x00, 0x02, 0x04, 0x06, 0x08, 0x0A, 0x0C, 0x0E, 0x10, 0x12, 0x14, 0x16, 0x18, 0x1A, 0x1C, 0x1E,
        0x20, 0x22, 0x24, 0x26, 0x28, 0x2A, 0x2C, 0x2E, 0x30, 0x32, 0x34, 0x36, 0x38, 0x3A, 0x3C, 0x3E,
        0x40, 0x42, 0x44, 0x46, 0x48, 0x4A, 0x4C, 0x4E, 0x50, 0x52, 0x54, 0x56, 0x58, 0x5A, 0x5C, 0x5E,
        0x60, 0x62, 0x64, 0x66, 0x68, 0x6A, 0x6C, 0x6E, 0x70, 0x72, 0x74, 0x76, 0x78, 0x7A, 0x7C, 0x7E,
        0x80, 0x82, 0x84, 0x86, 0x88, 0x8A, 0x8C, 0x8E, 0x90, 0x92, 0x94, 0x96, 0x98, 0x9A, 0x9C, 0x9E,
        0xA0, 0xA2, 0xA4, 0xA6, 0xA8, 0xAA, 0xAC, 0xAE, 0xB0, 0xB2, 0xB4, 0xB6, 0xB8, 0xBA, 0xBC, 0xBE,
        0xC0, 0xC2, 0xC4, 0xC6, 0xC8, 0xCA, 0xCC, 0xCE, 0xD0, 0xD2, 0xD4, 0xD6, 0xD8, 0xDA, 0xDC, 0xDE,
        0xE0, 0xE2, 0xE4, 0xE6, 0xE8, 0xEA, 0xEC, 0xEE, 0xF0, 0xF2, 0xF4, 0xF6, 0xF8, 0xFA, 0xFC, 0xFE,
        0x1B, 0x19, 0x1F, 0x1D, 0x13, 0x11, 0x17, 0x15, 0x0B, 0x09, 0x0F, 0x0D, 0x03, 0x01, 0x07, 0x05,
        0x3B, 0x39, 0x3F, 0x3D, 0x33, 0x31, 0x37, 0x35, 0x2B, 0x29, 0x2F, 0x2D, 0x23, 0x21, 0x27, 0x25,
        0x5B, 0x59, 0x5F, 0x5D, 0x53, 0x51, 0x57, 0x55, 0x4B, 0x49, 0x4F, 0x4D, 0x43, 0x41, 0x47, 0x45,
        0x7B, 0x79, 0x7F, 0x7D, 0x73, 0x71, 0x77, 0x75, 0x6B, 0x69, 0x6F, 0x6D, 0x63, 0x61, 0x67, 0x65,
        0x9B, 0x99, 0x9F, 0x9D, 0x93, 0x91, 0x97, 0x95, 0x8B, 0x89, 0x8F, 0x8D, 0x83, 0x81, 0x87, 0x85,
        0xBB, 0xB9, 0xBF, 0xBD, 0xB3, 0xB1, 0xB7, 0xB5, 0xAB, 0xA9, 0xAF, 0xAD, 0xA3, 0xA1, 0xA7, 0xA5,
        0xDB, 0xD9, 0xDF, 0xDD, 0xD3, 0xD1, 0xD7, 0xD5, 0xCB, 0xC9, 0xCF, 0xCD, 0xC3, 0xC1, 0xC7, 0xC5,
        0xFB, 0xF9, 0xFF, 0xFD, 0xF3, 0xF1, 0xF7, 0xF5, 0xEB, 0xE9, 0xEF, 0xED, 0xE3, 0xE1, 0xE7, 0xE5,
    };
    return cache[a];
}
void mix_columns(uint8_t *input) {
    size_t i;
    uint8_t temp[16], tmp;

    memcpy(temp, input, 16);

    for (i = 0; i != 16; i += 4) {
        tmp = temp[i] ^ temp[i + 1] ^ temp[i + 2] ^ temp[i + 3];
        input[i + 0] = mul2(temp[i + 0] ^ temp[i + 1]) ^ temp[i + 0] ^ tmp;
        input[i + 1] = mul2(temp[i + 1] ^ temp[i + 2]) ^ temp[i + 1] ^ tmp;
        input[i + 2] = mul2(temp[i + 2] ^ temp[i + 3]) ^ temp[i + 2] ^ tmp;
        input[i + 3] = mul2(temp[i + 3] ^ temp[i + 0]) ^ temp[i + 3] ^ tmp;
    }
}

最后是密钥相加。太简单了对应位置 Xor 就行了，甚至不用单独写个函数。将这以上四部分组合起来之后，只要写一个函数对明文进行十轮加密，函数大概这样：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33


void aes_128_single_block(const uint8_t *input, uint8_t *round_keys, uint8_t *output) {
    size_t i, j;
    uint8_t *result, *cursor_round_key;

    // initialize cursors.
    result = output;
    cursor_round_key = round_keys;

    // Copy plain text.
    for (i = 0; i != 16; ++i) {
        result[i] = input[i] ^ cursor_round_key[i];
    }
    cursor_round_key += 16;

    // Run rounds excluding the last round.
    for (i = 0; i != 9; ++i) {
        sub_bytes(result);
        shift_rows(result);
        mix_columns(result);
        // Add round
        for (j = 0; j != 16; ++j) {
            result[j] = result[j] ^ cursor_round_key[j];
        }
        cursor_round_key += 16;
    }

    // Now the last round.
    sub_bytes(result);
    shift_rows(result);
    for (i = 0; i != 16; ++i) {
        result[i] = result[i] ^ cursor_round_key[i];
    }
}

其中 round_keys 就是最开头给的函数生成的带拓展的密钥。

完成以上的加密相关代码后，只要小幅改动部分代码就能实现密文的解密了。比如 AES 里四个步骤的对应函数，除了轮密钥相加以外，其他三个函数都需要编写对应的反函数（姑且这么叫吧）。至于轮密钥相加以外，本质工作就是将轮密钥和体（state）进行异或，根据异或的性质，两次异或直接能就是本身，所以没有再去折腾了。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48


void aes::inverse_sub_bytes(byte *input) {
    int i = 15;
    do {
        input[i] = inverse_sbox[input[i]];
    } while (--i >= 0);
}

void inverse_shift_rows(byte *input) {
    int index, inner_index;
#define OFFSET(i, j) ((j) * 4 + (i))
    for (index = 0; index != 4; ++index) {
        byte temp;
        inner_index = index;
        while (inner_index-- > 0) {
            temp = input[OFFSET(index, 3)];
            input[OFFSET(index, 3)] = input[OFFSET(index, 2)];
            input[OFFSET(index, 2)] = input[OFFSET(index, 1)];
            input[OFFSET(index, 1)] = input[OFFSET(index, 0)];
            input[OFFSET(index, 0)] = temp;
        }
    }
#undef OFFSET
}

void aes::inverse_mix_columns(byte *input) {
    size_t i;
    byte temp[16], tmp;

    memcpy(temp, input, 16);

    for (i = 0; i != 16; i += 4) {
        tmp = temp[i] ^ temp[i + 1] ^ temp[i + 2] ^ temp[i + 3];
        input[i + 0] = mul2(temp[i + 0] ^ temp[i + 1]) ^ temp[i + 0] ^ tmp;
        input[i + 1] = mul2(temp[i + 1] ^ temp[i + 2]) ^ temp[i + 1] ^ tmp;
        input[i + 2] = mul2(temp[i + 2] ^ temp[i + 3]) ^ temp[i + 2] ^ tmp;
        input[i + 3] = mul2(temp[i + 3] ^ temp[i + 0]) ^ temp[i + 3] ^ tmp;

        byte u, v;
        u = mul2(mul2(temp[i + 0] ^ temp[i + 2]));
        v = mul2(mul2(temp[i + 1] ^ temp[i + 3]));
        tmp = mul2(v ^ u);

        input[i + 0] ^= tmp ^ u;
        input[i + 2] ^= tmp ^ u;
        input[i + 1] ^= tmp ^ v;
        input[i + 3] ^= tmp ^ v;
    }
}

最后的解密过程就是加密过程反着来，写起来也没什么困难：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48


void aes::aes_decrypt_single_block(const byte *input, byte * const round_keys, byte *output) {
    size_t i, j;
    byte *result, *cursor_round_key;

    // initialize cursors.
    result = output;
    cursor_round_key = round_keys + 160;

    for (i = 0; i != 16; ++i) {
        result[i] = input[i];
    }

    // The last round
    for (i = 0; i != 16; ++i) {
        result[i] = input[i] ^ cursor_round_key[i];
    }

    inverse_shift_rows(result);

    inverse_sub_bytes(result);

    cursor_round_key -= 16;

#ifdef DEBUG
    printf("# InvRound %d\n", 10);
    print_key(result);
#endif

    // Run rounds excluding the last round.
    for (i = 0; i != 9; ++i) {
        for (j = 0; j != 16; ++j) {
            result[j] = result[j] ^ cursor_round_key[j];
        }
        inverse_mix_columns(result);
        inverse_shift_rows(result);
        inverse_sub_bytes(result);
        cursor_round_key -= 16;
#ifdef DEBUG
        printf("# InvRound %d\n", 10 - i);
        print_key(result);
#endif
    }

    // Now the last addRoundKey.
    for (i = 0; i != 16; ++i) {
        result[i] = result[i] ^ cursor_round_key[i];
    }
}

以上就是 AES 的所有核心代码了，剩下的就是如何去调用这些函数。多线程方面，依靠 C++ 的 STL 库能很容易地实现线程的创建和锁控制。首先准备一个函数，将 key_schedule 和 aes 加密/解密函数包裹在一起。再准备一个函数按照指定步长去对输入指针内容进行加密/解密操作，代码很简单：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


void process_unit(uint8_t *data_in, uint8_t *data_out, uint8_t *key, bool isEncrypt) {
    aes_cpu::byte *rkey = aes_cpu::aes::key_schedule(key);
    if (isEncrypt) {
        aes_cpu::aes::aes_128_single_block(data_in, rkey, data_out);
    }
    else {
        aes_cpu::aes::aes_decrypt_single_block(data_in, rkey, data_out);
    }
    delete[] rkey;
}

void thread_entry(uint8_t *data_in, uint8_t *data_out, uint8_t *key, size_t file_size, size_t current, size_t total, bool isEncrypt) {
    for (uint32_t i = current; i < file_size / 16; i += total) {
        if (i % 10000 == 0 && current == 0) {
            printf("process: %.2lf\r", (double)i * 100 / (file_size / 16));
        }
        process_unit(data_in + 16 * i, data_out + 16 * i, key, isEncrypt);
    }
}

现在只需要根据用户传入的线程数，创建若干线程即可完成工作：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


bool isEncrypt = true;
std::vector<std::thread> threads(thread_count);

// Construct all threads, and let them go.
for (int i = 0; i != threads.size(); ++i) {
    threads[i] = std::thread(thread_entry, data_in, data_out, key, full_block_size, i, threads.size(), isEncrypt);
}

// Sync for all threads.
for (int i = 0; i != threads.size(); ++i) {
    threads[i].join();
}

涉及到任意文件加的时候就会遇到一个问题：文件大小不一定是 128 比特的整数倍。这时候就需要使用一些密码学上的填充算法。我简单的看了一下 PKCS#7 下的填充方法，简单说来就是：

若文件恰好为 16 字节的整数倍，后面强行补十六位；
填充内容为当前大小与填充后大小的差，比如当前是 126 字节，就补充两个 0x02。对于上方的特例，就补充 16 个 0x16；

这里就不把函数列出来了，一个 realloc 一个 for 就能解决。这样我们的多线程 AES 加解密就已经全部完成了，接下来就是将以上代码迁移到 CUDA 平台。

从整体上看，CUDA 和 CPU 的代码基本一致（废话都是 AES 能不一样有鬼了），主要的区别就在于调用 GPU 是一个类似于 Master 和 Slave 通讯的过程。在完成 PKCS#7 的填充后，我们要把明文，密钥等复制到显存，同时在显存里分配一块缓冲区来保存密文，代码大致如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29


uint8_t *round_keys = key_schedule(keys);
uint8_t *gpu_input = nullptr;
uint8_t *gpu_round_keys = nullptr;
uint8_t *gpu_output = nullptr;
cudaError_t cudaStatus;

#define ASSERT(s) \
    if (cudaStatus != cudaSuccess) { \
        fprintf(stderr, s"\n"); \
        goto Error; \
    }

// Select which GPU to work with.
cudaStatus = cudaSetDevice(0);
ASSERT("cudaSetDevice");

// Prepare GPU memory block
cudaStatus = cudaMalloc((void**)&gpu_input, size * sizeof(uint8_t));
ASSERT("cudaMalloc");
cudaStatus = cudaMalloc((void**)&gpu_round_keys, (11 * 16) * sizeof(uint8_t));
ASSERT("cudaMalloc");
cudaStatus = cudaMalloc((void**)&gpu_output, size * sizeof(uint8_t));
ASSERT("cudaMalloc");

// Copy data from memory to GPU memory.
cudaStatus = cudaMemcpy(gpu_input, input, size * sizeof(uint8_t), cudaMemcpyHostToDevice);
ASSERT("cudaMemcpy: gpu_input");
cudaStatus = cudaMemcpy(gpu_round_keys, round_keys, (11 * 16) * sizeof(uint8_t), cudaMemcpyHostToDevice);
ASSERT("cudaMemcpy: gpu_round_keys");

然后就可以启动 GPU 上的函数了：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


size_t offset = 0;
const size_t threadsPerBlock = 512;
const size_t blockSize = 16;
while (offset * threadsPerBlock * blockSize * 16 < size) {
    if (offset % 100 == 0) {
        printf("progress: %lf\r", offset * (double)threadsPerBlock * blockSize * 1600 / size);
    }
    if (isEncrypt) {
        aes_128_single_block <<<blockSize, threadsPerBlock >>>(gpu_input, gpu_round_keys, gpu_output, size / 16, offset, threadsPerBlock * blockSize);
    }
    else {
        aes_decrypt_single_block <<<blockSize, threadsPerBlock >>>(gpu_input, gpu_round_keys, gpu_output, size / 16, offset, threadsPerBlock * blockSize);
    }
    offset++;
}

最后，把显存上的密文复制回内存，再写入到文件，就完成了整个加密过程：

1
2
3
4
5
6


cudaStatus = cudaGetLastError();
ASSERT("cudaGetLastError");
cudaStatus = cudaDeviceSynchronize();
ASSERT("cudaDeviceSynchronize");
cudaStatus = cudaMemcpy(output, gpu_output, size * sizeof(uint8_t), cudaMemcpyDeviceToHost);
ASSERT("cudaMemcpy");