I have been working on En Pyssant for a while, a Free Software chess engine written in Python. The engine part of the program is not (yet) complete, but the rules of chess are wholly implemented. My primary goal in writing this engine is to use everything and the kitchen sink to learn new things: Maybe the program does not need a full testing and benchmarking suite, but I have implemented it anyway.
Though there is a benchmarking suite, I have not really prioritised performance at all during development. But when it takes 10.4 seconds to parse the following game blurb, I figured maybe it would be a good time to have a look at performance:
[Event "F/S Return Match"]
[Site "Belgrade, Serbia JUG"]
[Date "1992.11.04"]
[Round "29"]
[White "Fischer, Robert J."]
[Black "Spassky, Boris V."]
[Result "1/2-1/2"]
1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5
7. Bb3 d6 8. c3 O-O 9. h3 Nb8 10. d4 Nbd7 11. c4 c6 12. cxb5 axb5
13. Nc3 Bb7 14. Bg5 b4 15. Nb1 h6 16. Bh4 c5 17. dxe5 Nxe4
18. Bxe7 Qxe7 19. exd6 Qf6 20. Nbd2 Nxd6 21. Nc4 Nxc4 22. Bxc4 Nb6
23. Ne5 Rae8 24. Bxf7+ Rxf7 25. Nxf7 Rxe1+ 26. Qxe1 Kxf7 27. Qe3
Qg5 28. Qxg5 hxg5 29. b3 Ke6 30. a3 Kd6 31. axb4 cxb4 32. Ra5 Nd5
33. f3 Bc8 34. Kf2 Bf5 35. Ra7 g6 36. Ra6+ Kc5 37. Ke1 Nf4 38. g3
Nxh3 39. Kd2 Kb5 40. Rd6 Kc5 41. Ra6 Nf2 42. g4 Bd3 43. Re6 1/2-1/2
So let’s dive deeper. First, let’s see what hardware I’m running this on:
> inxi -c0 -v3 -z
System: Host: carmen-desktop Kernel: 4.15.13-1-default x86_64
bits: 64 gcc: 7.3.1
Desktop: Gnome 3.28.0 (Gtk 3.22.29)
Distro: openSUSE Tumbleweed 20180403
Machine: Device: desktop Mobo: ASUSTeK model: M5A97 R2.0 v: Rev 1.xx serial: N/A
UEFI: American Megatrends v: 0601 date: 07/17/2012
Battery hidpp__0: charge: N/A condition: NA/NA Wh
model: Logitech Performance MX status: Discharging
CPU: Octa core AMD FX-8350 Eight-Core (-MCP-)
arch: Bulldozer rev.0 cache: 16384 KB
flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm) bmips: 64217
clock speeds: max: 4000 MHz 1: 1403 MHz 2: 1403 MHz 3: 1547 MHz
4: 1493 MHz 5: 1396 MHz 6: 1402 MHz 7: 1477 MHz 8: 1531 MHz
Graphics: Card: Advanced Micro Devices [AMD/ATI] Tonga PRO [Radeon R9 285/380]
bus-ID: 01:00.0
Display Server: x11 (X.Org 1.19.6 ) driver: amdgpu
Resolution: 1920x1080@59.93hz
OpenGL: renderer: AMD Radeon R9 380 Series (TONGA / DRM 3.23.0 / 4.15.13-1-default, LLVM 6.0.0)
version: 4.5 Mesa 18.0.0 Direct Render: Yes
Network: Card: Realtek RTL8111/8168/8411 PCIE Gigabit Ethernet Controller
driver: r8169 v: 2.3LK-NAPI port: d000 bus-ID: 02:00.0
IF: enp2s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
Drives: HDD Total Size: 1750.4GB (13.0% used)
ID-1: model: Crucial_CT750MX3
ID-2: model: ST1000DX001
Info: Processes: 352 Uptime: 1 day 15:24 Memory: 9945.2/15968.2MB
Init: systemd runlevel: 5 Gcc sys: 7.3.1
Client: Shell (bash 4.4.191) inxi: 2.3.40
An AMD FX-8350, basically. I did my development on a Core 2 Duo, though, which is quite a bit slower.
Then, let’s take a look at the profiler graph generated by running cProfile
on
Game.from_pgn(...)
on the above game blurb, using
e0630c98
as starting point. See
here
for the code that generates the graph. I will be looking at the BitBoard
implementation exclusively for this article.
How to read the graph: Every square has a function name, a cumulative percentage, a total percentage, and the amount of times it has been called. The arrows show cumulatively how much time one function spends calling another, and how often one function calls another. The colours are there to help you read the graph. Red means “spending a lot of (cumulative) time in this function”, and blue means “barely spending any time in this function”.
As a short addendum: “Cumulative time” means how much time a function takes altogether. “Total time” (deceptively) means how much time goes by while inside the function, excluding time made in calls to sub-functions. Thus logically, the top function has a cumulative time of 100%, but very little total time, because it doesn’t do a whole lot of computation on its own.
The graph seems more or less normal. One thing that sticks out is that it
spends a lot of time in Enum.__hash__
. That’s a bit of a pain, because a lot
of my core data types (Type
, Side
, Direction
, MoveFlag
) are Enums. And
I’ve grown quite attached to the Enums. They are very pretty, you can do some
nifty things with them, and removing them would mean breaking backward
compatibility. But let’s see what happens if I replace the Enums with regular
classes, and fix all the other code to do away with Enum-specific magic.
In 45aa9ed4, I have reduced the time it takes to run the program to 6.7 seconds. Great success! It only cost us our precious Enums. But we can do better than that, probably.
If we look at the graph generated from running the code without
enums, we notice that over 40% of our time
is spent in Board.__getitem__
, which wastes 10% just calling Board.get
(because function calls are expensive in Python). Why not just call Board.get
directly instead? So I go through my codebase, and I replace (almost) all
instances of board[square]
with board.get(square)
(5279dd45).
As a result, the program now takes 6.0 seconds. This is predictable,
because we have shaved off the 10% overhead. But we can do better again.
If we look at the graph again, we notice that Move.from_san
calls Move.san
a
lot, and Move.san
performs the lion’s share of all expensive calls. Maybe we
can improve this. What if we didn’t depend on Move.san
?
To understand what’s going on, it’s probably a good idea to illustrate what
these functions do first. Move.san
turns a Move
object
(an object containing an origin, a destination, and some other information about
a given move) into Standard Algebraic Notation
(SAN). For instance,
‘Na3’ is SAN for moving a kNight from ‘b1’ to ‘a3’. Where does ‘b1’
come from? Well, that’s implied, because the only knight that could possibly
move to ‘a3’ is on ‘b1’. If there were another knight that could move to ‘a3’,
you would have to disambiguate the SAN by adding more coordinates (for instance,
‘N1a3’ if there is another knight on ‘c4’. Where did the ‘c’ go? Don’t ask).
As such, generating a SAN string requires you to generate all moves that can
land on a certain destination square and perform a bit of voodoo magic.
Generating all legal moves is expensive, but not outrageous. What is
outrageous is that Move.san
is called over a thousand times, even though there
aren’t that many notations in the game blurb that is being parsed. What gives?
Laziness gives. Having performed the SAN voodoo once, I did not want to do it
in reverse for Move.from_san
: generating a Move object from a SAN string. So
I created a lazy implementation. Given san
and position
, just create all
legal moves for position
, and generate the SAN of each legal move. Then
compare each generated SAN string against san
. If it’s a match, then
obviously we have found the correct move, and we can return that. This looks a
little like this
(57c6f1a7):
@classmethod
def from_san(cls, position, ruleset, san):
for move in ruleset.moves(position):
move_san = move.san(position, ruleset)
if move_san == san:
return move
raise ValueError('{!r} is not a valid move'.format(san))
We can do better than that… And it turns out, I did do better than that. I changed that function some time after to be more relaxed about its input. ‘Nb1a3’ and ‘Na3’ are the same move, only one of them is needlessly disambiguated. I wanted a way to also accept the verbose SAN strings, so I got to work and ended up with this:
SAN_END_PATTERN = re.compile(
r'[+#]?[?!]*$')
SAN_PATTERN = re.compile(
r'([PNBRQK])?([a-h])?([1-8])?x?-?([a-h][1-8])([QRBN])?', re.IGNORECASE)
def stripped_san(san: str) -> str:
"""Remove superfluous symbols from *san*."""
return re.sub(SAN_END_PATTERN, '', san)
...
@classmethod
def from_san(
cls,
san: str,
position,
ruleset,
strict: bool = False) -> 'Move':
if not strict:
san = stripped_san(san)
match = SAN_PATTERN.match(san)
castling = 'O-O' in san
if not match:
if not castling:
raise ValueError('{!r} is not a valid notation'.format(san))
else:
piece, file, rank, destination, promotion = match.groups()
for move in ruleset.moves(position):
move_san = move.san(position, ruleset)
if not strict:
move_san = stripped_san(move_san)
if move_san == san:
return move
# pylint: disable=too-many-boolean-expressions
if (
strict
or castling
or move.destination != destination
or (file and move.origin.file != file)
or (rank and str(move.origin.rank) != rank)
or (piece and move.piece.type != piece.lower())
or (promotion
and move.promotion != promotion.lower())
or (not piece and move.piece.type != Type.PAWN)):
continue
if not promotion and move.promotion:
break
return move
raise ValueError('{!r} is not a valid move'.format(san))
Regexes! This solution is almost perfect. The
only mistake I’ve made here is that I keep generating SANs anyway, even though I
have all the scaffolding in place to return a move without generating the SAN.
So after a bit of fiddling
(e262997c),
Move.san
is no longer called nearly as often. It still uses Move.san
when
resolving castling moves or when being strict about what notations are allowed,
but sorting that out is left as an exercise for the reader.
Having changed the innards of Move.from_san
, the benchmark really flies. It
now only takes 0.46 seconds to run the benchmark. At this point, I am
basically satisfied. Well, almost satisfied. You see, I had to sacrifice my
Enums for this performance gain, and I really want my Enums back. So that is
basically what I did. I merged the improvements sans removing the Enums, which
resulted in 0.65 seconds. Not quite as good, but workable. It still irks
me that the Enums are slower, though, so I take another look at the profiler
graph:
What is evident from this graph, is that attacked
and is_check
are
responsible for most calls to Enum.__hash__
. So it seems obvious to me to
check there first. What am I looking for? Anything that looks like
dict[Enum]
or Enum in set
. And sure enough, I find a line that looks like
this:
to_check = { # type: Dict[Type, Set[Type]]
_KNIGHT: {_KNIGHT},
_BISHOP: {_BISHOP, _QUEEN, _KING},
_ROOK: {_ROOK, _QUEEN, _KING},
}
for type_, targets in to_check.items():
...
Ah-ha! A dictionary of sets, sure to be the culprit. You need a hash to put a key in a dictionary, and a hash to put a value in a set. So I do the sensible thing, turn the dictionary into a global constant (66aa89c9), and run the benchmark again. 0.62 seconds… Not quite the improvement I had hoped for, perhaps even negligible.
I do find some other variables that can be turned into global constants, but
they don’t yield much, either. Not wanting to give up hope just yet, I turn to
pdb
. I call pdb.run('list(Game().moves())'
in the interactive interpreter,
set a breakpoint at Enum.__hash__
, and I spam ‘c’ and ‘w’ to get the traceback
for each call to Enum.__hash__
.
It turns out: Piece(type, side)
shows up reasonably often. Why does it show
up? This is not a hash operation. Except … it is. To improve performance
and memory usage, I memoised Piece.__new__
with lru_cache
, which is
basically a fancy dictionary that remembers parameters as keys and return values
as values. Removing the cache would send performance down the drain, but
perhaps we can go hunting for places where pieces are instantiated.
Sure enough, there is one in attacked
, and one in is_check
. They both look
a bit like this:
for square, piece in position.board.all_pieces():
if piece == Piece(_KING, side):
...
This is easily fixed. Just instantiate the piece outside of the loop (d66497be), and run the benchmark again. Sure enough, the timing is now improved to 0.51 seconds. Almost as fast as the benchmark without Enums! As a matter of fact, what would happen if we merged these improvements with the Enumless branch (136245f4)? We get 0.35 seconds! That’s all well and good, but I don’t want to give up my Enums just yet.
So we look at the graph again, and we
notice that attacked
still performs a lot of calls to Enum.__hash__
, and so
does Board.get
.
In attacked
, the offending line is Square.in_bounds(path)
. This function is
also memoised with lru_cache and would be extremely slow without the cache. But
perhaps it might be possible to turn path
(which is a sequence of Direction
Enums) into something that is more easily hashed. And sure enough, upon turning
path
into a sequence of (int, int)
tuples
(cfe128dc),
the performance is increased to 0.46 seconds, and the amount of calls from
attacked
to Enum.__hash__
is greatly reduced.
The last function that incurs a heavy penalty to Enum.__hash__
, then, is
Board.get
. Or more specifically, BitBoard.get
. BitBoard has a couple of
integers as internal representation rather than storing the pieces. Thus when
BitBoard.get
is called, it creates a Piece(type, side)
on-the-fly, which is
memoised and therefore requires hashing, causing a slowdown. The solution,
therefore, is to use a board that internally stores the pieces, avoiding
unnecessary Piece.__new__
calls. ListBoard
is that board.
Using ListBoard, our performance increases to 0.21 seconds, and the calls to
Enum.__hash__
are just short of 7% of cumulative time:
Meanwhile on the Enumless branch, ListBoard’s performance is 0.18 seconds. This is extremely close—close enough for me keep my precious Enums.
edit 1 (2018-04-08): Adding a simple tweak to the way castling moves are parsed (f07478f7) revs up the performance to 0.11 seconds. That’s 100x faster than what we started with!
edit 2 (2018-04-08): Changing the interpreter to PyPy 3 instead of CPython 3.6 and allowing the interpreter to warm up its JIT results in a performance of 0.05 seconds. That is 200x faster than what we started with.