Tuesday, July 05, 2016

Chess Move generation benchmark : from mobile phones to supercomputers

Introduction

Writing computer programs that can play chess has been hobby of mine since a very long time. I probably started thinking about it more than 15 years back. I still remember watching matches between the fritz chess program running on Pentium 1 (100Mhz, 8 MB RAM) and my father (who is a chess enthusiast). 
I wrote my first successful program when I was in college (about 12 years ago). By successful I mean it was competitive against my friends and dad but not too strong. Relatively good chess players among my friends and my dad could beat it about half of the games.
Since then I have attempted multiple times to improve or re-write it without much success. Most of the times I start something but then get too lazy and abandon the effort. 

About 3 years ago, I decided to do a complete re-write: i.e,: without copying single line of code from my original program. One of the shortcomings of my original chess program was that it didn't handle special moves like en-passent and promotion to pieces other than queen correctly. I was not too sure if it had more bugs in move-generation. This time I followed advice from chess programming wiki (CPW) and talkchess.com forums and decided to implement bug-free move generator first. 
I started with a 0x88 based board representation (https://chessprogramming.wikispaces.com/0x88), made the move generation almost bug-free but I wasn't very happy with it's speed. Bitboard representation (https://chessprogramming.wikispaces.com/Bitboards) seemed promising so I again rewrote the move generator with Bitboards. 
I spent quite a lot of time optimizing the perft routine (a test for move generation correctness and speed: https://chessprogramming.wikispaces.com/perft) and finally my perft routine became one of the fastest available. For the want of more speed, I spent another big chunk of time porting the perft routine to run on Nvidia GPUs with very promising results. AFAIK, it's currently the fastest brute-force perft counter in the world (without using hash tables. I have a version with hash table support too but speeds are not easily comparable due to the way different programs manage or use hash tables). 
Writing rest of the chess engine got side-tracked as I spent nearly a year (not full time) in optimizing perft routine for CPUs and then GPUs. I did get back to writing rest of the chess program (only the CPU version). At present it plays chess but I don't call it complete yet. There are a few things I still want to implement before releasing it. My new program is a big improvement over my original one. It easily beats all my friends and dad - even when running on a phone (so I don't have much motivation for improving it further :-/). Against other chess engines it's still very weak. It's probably about 2200 ELO on CCRL scale - top engines being close to 3400 at the time of writing (http://www.computerchess.org.uk/ccrl/404/).

The benchmarks

This article is about comparing speed of move generation (perft routine without hash) across a wide range of devices - phones, tablets, laptops, PCs and GPUs. I also have some numbers for the complete chess engine (chess search nps) too but as nobody has figured out a way to implement efficient chess search algorithm on GPU yet, the numbers only compare some phones and PCs.

In the tables/charts below you will see results for two benchmarks: 'chess' and 'perft'.
  1. perft is no. of counted moves per second in brute force move-generation for the second test position of CPW: "r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -". For depth 5 (for phones and CPU results) or depth 7 (for GPU results)
  2. chess is no. of nodes evaluated per second for starting position searching till depth 10 or 11.
The programs are available here:

For chess_cpu, I use the following commands for benchmarking:
uci
ucinewgame
position startpos
go infinite
Which will start the search and print out nps numbers (among other stats).

If you run the perft programs you will get output like this:
...
Perft 5: 193690690,   Time taken: 0.216556 seconds, nps: 894414087
...
GPU Perft 7: 374190009323,   Time taken: 4.19305 seconds, nps: 89240608095

Notes on 32 vs 64 bit, CPU Cores, compilers, hardware bit-manipulation instructions etc:

As my program uses bitboard representation where pieces of different types are represented by a 64 bit integers and bitwise operations on 64 bit numbers are heavily used for move generation. This means a 64 bit machine / build will perform much faster than the 32 bit one. I have observed slightly more than 2x speed up just going from 32 bit compile to 64 bit compile.

Apart from bitwise operations, two additional operations that are heavily used are:
  1. popCount (count no of set bits in the 64 bit number). For example this can be to find no. pieces of a type (e.g: numKnights = popCount(knights); numMajorPieces=popCount(queens | rooks);)
  2. clz (count leading zeros - find the position of first set bit). This is used at many places to enumerate all pieces in a bitboard (e,g, to loop over all bishops in the bitboard)
Most modern hardware (arm, intel and even nvidia gpus) have native instructions to either directly support or speed-up these operations. It sometimes depends on the compiler if it's able to use them effectively. e.g For linux builds I simply use __builtin_popcountll() / __builtin_ffsll() routines and hope the compiler will replace it with native instruction (or efficient sequences). 
Although I have tried making the comparison as fair as possible, I have noticed that the compiler used, the kind of optimizations it can perform and whether it can make use of hardware 'clz' and 'popCount' instructions makes a very big difference in the results of perft benchmark. 

Note that both perft_cpu and chess_cpu programs are single-threaded. You won't see any benefit on CPUs with more no. of cores/threads. perft_cpu uses very little memory (<3.5 MB total for some lookup-tables) and chess_cpu uses about 260MB (256 MB hash table in addition to the lookup tables). Having more memory capicity isn't going to help these benchmarks. These benchmarks are purely single-threaded CPU performance benchmarks but might benefit from bigger caches or lower latency RAM. 
perft_gpu obvisouly is multi-threaded and uses GPU video memory to store portions of the the search tree (in addition to the lookup tables that are present in constant or texture memories).

Results

Device / processor
Perft (nps)
chess (nps)
notes
Lg Optimus 2X (Tegra 2, 2xA9 @ 1Ghz, 512MB RAM)
17044859
gcc 4.8 build used.
gcc 5.3 build doesn't work on this device (kernel too old!)
Coolpad Dazen1 (Snapdragon 410, 4xA53 @ 1.2GHz, 2 GB RAM)
39928436
230479
32 bit (limited by OS). 
With gcc 4.8 build: 20M perft and 160k chess.
Tegra Note 7 (Tegra 4, 4xA15 @1.8Ghz, 1 GB RAM)
63811018
274913
32 bit (CPU doesn't support 64 bit)
Xiaomi Mipad (Tegra K1, 4xA15 @2.2Ghz, 2 GB RAM)
93773557
543880
32 bit (neither of CPU, OS support 64bit)
Le Eco Le 2 (Snapdragon 652, 4xA72@1.8Ghz+4xA53@1.6Ghz, 3GB)
211000000
936920
64 bit. With 32 bit: perft: 95M, chess: 570k
Home Laptop (intel core i5 3210m, 2C/4T, @~3.1 Ghz, 6GB RAM)
495492997
MSVC compile, 64 bit
Home PC (intel core i7 4790k, 4C/8T @~4.4 Ghz, 8GB RAM)
723352641
3487225
gcc compile. 64 bit. Without intrinsics, perft: 220MNps
GTX 780Ti stock (GK110 15 SM)
33318412188
GTX 970 (GM204 13 SM, @~1400/3600)
47374624070
Home PC, GPU slightly over-clocked.
GTX Titan X stock (GM200, 24 SM, @~1200/3500)
66021489568
GTX 1080 stock (GP104, 20 SM, @~1880/5000)
89240608095
P100 (GP100 pre-preduction card, unknown clocks/config)
118658287137

Lot of numbers! Here is a chart for perft scores:


GPUs are so much faster that the mobile and CPU results got squashed. We see healthy increase in performance with newer GPU generations.

Maybe logarithmic scale would help?


Much better than before. Phone to laptops and desktop CPUs show gradual increase. Going to GPUs has a big jump and after that we again have gradual increases with different GPU generations. However it still doesn't show big jump among phones or going from phones to PC. Here are CPU only scores in linear scale:


The results here might not be apples to apples due to differences in compilers and builds used. Few things to note:
  • The Lg optimus 2X phone (cortex A9 based tegra 2 @ 1GHz) was running the program compiled with older gcc compiler (gcc 4.8) because the gcc 5.3.1 build (which was used for other results) won't work. it kept complaining that kernel is too old. With gcc 4.8 build, I get ~20M nps on coolpad dazen 1 (snapdragon 410) instead of the ~40m score with gcc 5.3.1. My guess is gcc 4.8 is not utilizing the clz and popCount hardware instructions efficiently. I am not sure if arm A9 even supports them. If it does, clock per clock, for 32 bit apps A53 is probably only as fast as A9 (actually a tiny bit slower).
  • Coolpad Dazen 1 phone (A53 based Snapdragon 410) is capable of running 64 bit but it's running 32 bit build as the OS (Android 4.4.4) doesn't support 64 bit :-/. With 64 bit OS it could have been quite a bit faster (I would estimate nearly 100 Mnps instead of 40). 
  • LeEco Le 2 (Cortex A72 based Snapdragon 652) is the only phone that is running 64 bit build. With 32 bit build, it scores 95Mnps in perft - less than half of 64 bit score.
  • Tegra note 7 and Xiaomi Mipad have A15 based processors that doesn't support 64 bit - although 32 bit performance is quite decent.
  • The laptop was running a build on windows compiled using MSVC - which is slightly faster than the GCC 5.3.1 build used for other systems. Here is a comparison of performance of various compilers (taken on home PC - 4790k):
Intel core i7 4790k
perft
GCC fastest build (plain magics)
723352641
MSVC fastest build (fancy magics + __forceinline)
815811111
Intel compiler fastest build (plain magics)
1069033463

It's surprising to see almost 50% performance gain just by using a different compiler. The ICC compiler is not free - I downloaded 30 day trial version for my benchmarking.

Finally here is a graph of chess_cpu results on some of the devices:


One result that stands out is the comparatively low performance of the tegra note 7, and the big jump between tegra note 7 and the tegra k1 which can't be explained just by the CPU frequency. Maybe differences in cache sizes or memory?

Conclusion

Looking at all results, we can say that today's mobile chips are catching up quickly with desktop CPUs - at least for my perft and chess benchmarks. The fastest desktop CPU (i7 4790k) is only about 3.5x faster than the fastest mobile processor (snapdragon 652) that I was able to test. Some might argue that 4790k isn't the fastest CPU - but actually it is for single threaded performance. I have tested 6700k too and it scores exactly same as my 4790k (~1.1 Bnps in perft with intel build). If you compare clock-per-clock performance, it's even close : skylake has < 50% per clock advantage over arm A72!. 
Snapdragon 652 is definitely not the fastest mobile cpu - I am eager to get my hands on devices based on snapdragon 820/821, Exynos 8890 or Kirin 950 processors which might have better single threaded performance. Apple A9 and A9X processors are probably even faster.

In my benchmarks you might see GPUs being disproportionately faster. Keep in mind that CPU version of my perft benchmark is single threaded. Although the current benchmark results show fastest GPU being >100X faster than fastest CPU, once the benchmark is modified to utilize all CPU cores, the speedup might drop to only ~5-10x (e.g: when comparing Nvidia Tesla P100 with Intel Xeon E7-8890 v4). It's still surprising to see GPUs performing so well given that they don't have native support for 64 bit integers.

We have compared a wide range of devices - the difference between the slowest phone and the fastest system (GP100 based personal supercomputer :-) is nearly 7000X!

Monday, July 04, 2016

Running a command line C/C++ program on an android phone or tablet

Introduction:

In the times of Web, Android Apps, Facebook games, etc one might wonder why anyone would want to write and run a command line app on a phone? Just because you can - there is absolutely no commercial use.

One reason is simplicity. Almost everyone (any programmer) knows how to write a command line app in C - so there is no learning curve. As a programmer, the first thing I think when buying a device is ways to program it. Having a way to run simple C programs makes me feel happy about the device.

Another reason is performance - C code gets compiled to native machine code and doesn't incur performance penalty like java.

My main reason was to compare performance of phones/tablets by running an existing (my own) benchmark written in native cpp code (discussions on benchmark results is subject for another post).
The other way (which real android benchmarks probably use) is to write a wrapper Android app (in Java!) and use NDK to call native c/cpp code. There is a significant learning curve for android and it's too much work to make wrapper UI just for displaying statistics/benchmark results. Although previously for another project I had gone this way, I am too lazy to do it again.

This article is for people using ARM based android devices. Which is almost all android phones. Popularity of android on intel is almost as high as windows on arm.
For people using intel based phones, 'part A' isn't needed but you can still follow 'part B'. I don't have any device to try, but I believe you can directly use the same binary you use on your Linux PC (gcc hello.c).

Part A: Building the ARM binary/executable:

Write c/c++ program in your favourite editor. Some tips to make your code portable:
  1. Don't use non-standard or OS specific libraries. 
  2. You can use standard C library, STL, std::chrono (for accurate timing), std::thread (for multi-threading) etc. 
E.g:
#include <stdio.h>
int main() {
    printf("hello phone\n");
    return 0;
}
You need to set up cross compilation tool-chain to be able to compile your code for ARM. Tool chain is nothing but a compiler, libraries, linker etc that will take cpp code and generate the binary. Cross compilation simply means that you are generating binary in arm instruction set on your x86 machine. 
There are many ways to set this up: there are some commercial options and some free ones. The only ones I tried needed linux and I used latest ubuntu (16.04). 

i) Ubuntu comes with it's own cross-platform gcc/g++ tool chain which is quite easy to install and use. 

For arm 32 bit:
sudo apt-get install g++-arm-linux-gnueabihf

and for arm 64 bit:
sudo apt-get install g++-aarch64-linux-gnu

On my system, the above commands installed latest gcc 5.3.1 based tool chain.

To compile use arm-linux-gnueabihf-gcc or aarch64-linux-gnu-gcc instead of gcc. Eg:

aarch64-linux-gnu-gcc -static hello.c

the "-static" option ensures that c run-time library is statically linked otherwise there is very little chance that the binary will run on your phone (I was getting 'file not found' error when trying to run the binary).

ii) https://launchpad.net/linaro-toolchain has alternate cross-compile tool-chains. I had to use this because the default ubuntu tool-chain was creating a binary that was too new formy old phone - lg optimus 2X (causing "Fatal: Kernel too old" error).
I used a slightly older gcc 4.8 based build: gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_linux. 
It didn't work out of the box, I had to install something called zlib using:
sudo apt-get install zlib1g
sudo apt-get install zlib1g:i386
Once you have zlib installed, just extract the archive, and use "./arm-linux-gnueabihf-gcc" in the bin sub-folder to compile your C program.

E.g:
/home/ankan/gcc-arm/gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_linux/bin/./arm-linux-gnueabihf-gcc -static hello.c 

I am sure there might be ways to avoid writing the whole path, but I like it separate to avoid it interfering with my default ubuntu toolchain.


Part B: Deploying/Running the binary on your phone or tablet:


At first I thought this should be simple, but I ended up wasting almost as much time in just running the binary as I spent in compiling it. The sad part is that there is no clear instructions anywhere on the internet on how to run a compiled command line c program on android.
Some suggestions I found on the internet required rooting but I don't like rooting my new devices it voids warranty. After trying out random things, I finally figured out a solution:

1. Firstly you will need a terminal application. Here are a few that I used (install any one you like):
https://play.google.com/store/apps/details?id=com.termux&hl=en
https://play.google.com/store/apps/details?id=yarolegovich.materialterminal&hl=en
https://play.google.com/store/apps/details?id=jackpal.androidterm&hl=en

2. Copy your binary to the phone/tablet. Preferably copy it to phone's internal storage. If not feasible, copy it to external sd card. You can do it in many ways - easiest is to connect your device to the computer, enable USB or MTP mode to access the device's internal storage / sd card.

3. Open your terminal application and search for the path where you copied your binary (in your phone's internal memory or external sdcard). You can use linux commands like "ls", "cd" in the terminal application. 
Tips:
  1. Usually the phone's internal memory is at "/sdcard" but the external sd card could be anywhere (in one of my devices it was at "/external_sd" and on another device it was at "/storage/sdcard1"). 
  2. You can also use some GUI based file explorer ES file explorer) to figure out path of your sdcard/storage locations.

4. You can't directly run the binary from sdcard because it won't have the 'execute' permission and most devices won't allow setting 'execute' permission on files on SD card (or even internal storage that is visible to end-users). Unlike Windows where the file extension (.exe) determines if the file can be executed, Linux (and hence android which is basically Linux with Google's fancy UI) uses a more advanced permission system: read/write/execute and the extension of the file doesn't matter.

5. You need to copy the binary somewhere where it will allow you setting execute permission. Unfortunately the way android apps and user accounts work, each app can write only either to the SD card / phone storage (where you can't set execute permission) - or to a specific (well hidden!) path which is accessible only to the app. 
E.g, the terminal app can write only to a hidden path which is /data/data/<id of the application>. The id of the application is what you can see in google play store website listing of the application. E.g: /data/data/yarolegovich.materialterminal or /data/data/jackpal.androidterm.

6. After knowing the app where you have access and the path where you have your binary it pretty simple to copy and run it, e.g:

cd /data/data/yarolegovich.materialterminal
cp /sdcard/arm_test/a.out .
chmod +x a.out
./a.out

here "a.out" is your binary present inside the folder "arm_test" in your phone's storage.

Notes: 

 i) On some devices "cp" command doesn't work, but you can use "cat". E.g:
    cat /sdcard/arm_test/a.out >a.out

ii) On some devices chmod +x doesn't work, you can use:
    chmod 777 a.out