{"id":9,"date":"2012-07-24T15:49:03","date_gmt":"2012-07-24T18:49:03","guid":{"rendered":"http:\/\/mdqinc.com\/?p=9"},"modified":"2012-11-29T16:41:22","modified_gmt":"2012-11-29T19:41:22","slug":"how-to-make-a-chicken-fly","status":"publish","type":"post","link":"https:\/\/mdqinc.com\/blog\/2012\/07\/how-to-make-a-chicken-fly\/","title":{"rendered":"How to make a chicken fly"},"content":{"rendered":"<p>In the last few months I\u2019ve been hard at work on my Python\/Cython based game engine, <a href=\"http:\/\/ignifuga.org\" target=\"_blank\">Ignifuga<\/a>, which I\u2019m developing alongside a mysterious game project (by the way, do you know <a href=\"http:\/\/whoisthegaucho.com\" target=\"_blank\">who is the gaucho?<\/a>). The rationale behind my choice of language had several aspects to it. I was already familiar with Python, it\u2019s a widespread language consistently in the top 5 of the most popular languages with a huge community behind it, and I felt there was no engine that did what I wanted my engine to have, that is be cross platform (at least Win\/OS X\/Linux and iOS \/ Android), data driven, simple to experiment and iterate changes. The cautionary tales about Python speed were written all across the web, but I figured that the Cython\/Python combo will let me develop the engine first, and fine tune it later.<\/p>\n<p>A couple of weeks ago I stumbled upon <a href=\"http:\/\/haxe.org\/\" target=\"_blank\">Haxe<\/a> and more importantly across <a href=\"http:\/\/www.haxenme.org\/\" target=\"_blank\">Haxe NME<\/a>. Basically Haxe NME provides most of what I wanted to do, but it is centered on the Haxe language, which is very similar to Action Script 3. The Haxe compiler outputs (among others) CPP and Javascript code, and being mostly statically typed, the apps it produces are very fast, as proven by <a href=\"http:\/\/philippe.elsass.me\/2011\/11\/nme-ready-for-the-show\/\" target=\"_blank\">this benchmark<\/a>, a variation of the <a href=\"http:\/\/blog.iainlobb.com\/2010\/11\/display-list-vs-blitting-results.html\" target=\"_blank\">BunnyMark<\/a><\/p>\n<p>Those numbers wet my appetite to see what sort of results I would get with my own engine, so I made <a href=\"https:\/\/bitbucket.org\/gabomdq\/ignifuga\/src\/c0f5a30ab843\/tests\/bunnymark\/bunnies.py\" target=\"_blank\">my version in pure Python<\/a> in a couple of hours. Since Ignifuga is entity\/component based, I created a number of entities with a sprite component attached, and in every update loop I moved them around. The results came, on my Ubuntu 12.04 i5 2500K system with a Nvidia GTX 570 and 302.17 binary drivers, at 30 FPS I got 1200 bunnies. So, what I had basically built was a low end smartphone emulator, seeing that I was getting worse figures than the Haxe BunnyMark ran on a LG Optimus One.<\/p>\n<p>Remember when I said that I chose Python\/Cython for the ease of implementation while retaining the opportunity to improve performance later on\u2026well, it felt \u201clater on\u201d had arrived. Ignifuga has two big loops ran in succession, the entities update loop and the renderer loop. The first one is ran using <a href=\"http:\/\/pypi.python.org\/pypi\/greenlet\/\" target=\"_blank\">greenlets<\/a>, which is a sort of cooperative threading, and lets entities and components update themselves. The renderer loop iterates over the existing sprite components, intersecting their position with the current visible area, and those that fall in that zone are renderer to the screen using SDL. Almost every operation I just mentioned was done in Python, so this was a good starting point as any to start improving. To confirm my suspicions, I activated the basic profiling abilities that <a href=\"http:\/\/docs.python.org\/library\/profile.html\" target=\"_blank\">Python<\/a> and <a href=\"http:\/\/docs.cython.org\/src\/tutorial\/profiling_tutorial.html\" target=\"_blank\">Cython<\/a> provide, and indeed I confirmed that every rendering loop was taking about 33 ms with 1200 bunnies on screen (that is, the entire alloted time for the frame).<\/p>\n<p>You can see what the renderer update loop looked like before the optimization<a href=\"https:\/\/bitbucket.org\/gabomdq\/ignifuga\/changeset\/2c341a89ab04#chg-Renderer.pyx\" target=\"_blank\"> here<\/a>. Thanks to Cython\u2019s seamless <a href=\"http:\/\/docs.cython.org\/src\/userguide\/wrapping_CPlusPlus.html\" target=\"_blank\">integration with CPP<\/a>, I quickly replaced all the Python dict\/list iteration based code (which I believe was what was eating up most of the CPU time), with fast <a href=\"http:\/\/www.cplusplus.com\/reference\/stl\/\" target=\"_blank\">STL containers<\/a>. I also added some extra intelligence to the code, to avoid intersecting rectangles excessively when the corresponding sprites hadn\u2019t moved (which is useless for the BunnyMark as all sprites move every frame, but it\u2019s useful for real apps). With this optimizations in place, I ran the benchmark again\u2026and got some glorious 1400 bunnies on screen.<\/p>\n<p>Sigh\u2026time to profile the code again. This time, the renderer loop was at less than 1ms per frame, and the update loop was now eating all the CPU bandwidth (we are talking about A LOT of CPU bandwith!). Time to fix up the update loop using the same strategy, I removed dicts\/lists and added STL containers, I called the Greenlet functions directly, etc. The benchmark now gave me an overwhelming 1600 bunnies\u2026but now the CPU cycles went almost entirely to the bunnies.py update loop itself.<\/p>\n<p>Being the competitive type of person that I am, I could not stop there. So, I made a Cython version of the Bunnymark. And to top that, I cheated. Yes, I confess\u2026I cheated. As the Haxe NME manipulates the sprites positions directly, I decided to cut through all of the nice scaffolding I had built to make development easier and cleaner (each entity and its components have their own update loop, it\u2019s all neat and cleanly separated, etc). <a href=\"https:\/\/bitbucket.org\/gabomdq\/ignifuga\/src\/c0f5a30ab843\/tests\/bunnymark\/bunniesx.pyx\" target=\"_blank\">This is the result<\/a>, it creates a single entity, and then it clones the sprite component as many times as needed, and then it moves thos sprites around directly without bothering to go through the \u201cproper channels\u201d. The result now looked promising, 57.000 bunnies (versus the 1600 I had with the pure Python benchmark).<\/p>\n<p>I then tested this benchmark on my other systems. On an ASUS Transformer TF101 I\u2019m getting 2600 bunnies at 30 FPS (Haxe NME gets 4000), on an iPad 2 1600 bunnies at 30 FPS (extrapolating from the results published Haxe NME gets about 6000). The reason for this difference may lie in several places, I\u2019m still going through some parts of Python code in time sensitive parts of my code, Ignifuga being an engine, I may have more scaffolding running at any given time than what the Haxe version has, etc. But overall I feel this is a very good result (specially seeing what I started with).<\/p>\n<p>A nice advantage Cython includes is the ability to apply <a href=\"http:\/\/docs.cython.org\/src\/userguide\/parallelism.html\" target=\"_blank\">seamless parallelization<\/a> where OpenMP is available (which seems to be Linux only). I\u2019ve added a bit of this on my renderer loop (the part which intersects the sprite rectangles with the screen rectangles), and on the bunnies update loop (<a href=\"https:\/\/bitbucket.org\/gabomdq\/ignifuga\/src\/c0f5a30ab843\/tests\/bunnymark\/bunniesp.pyx\" target=\"_blank\">this is the OpenMP version of the Bunnymark<\/a>). As a result, I got 60000 bunnies on my i5 2500k Linux system (3000 more than the non parallel version), to understand why I did not get more, it\u2019s time to see a few profile dumps\u2026<\/p>\n<p>Ubuntu 12.04, i5 2500k, GTX 570, non parallel (1 CPU at 100% load), 57k bunnies<\/p>\n<pre class=\"brush: bash;\">ncalls  tottime  percall  cumtime  percall filename:lineno(function)\r\n        1   73.544   73.544  146.260  146.260 GameLoop.pyx:27(run)\r\n     2223   62.774    0.028   62.838    0.028 Renderer.pyx:280(update)\r\n     2271    9.398    0.004    9.796    0.004 GameLoopBase.pyx:263(_doSwitch)\r\n        1    0.122    0.122    0.122    0.122 SDL.pyx:13(initializeSDL)\r\n        1    0.120    0.120    0.120    0.120 Renderer.pyx:43(__init__)\r\n      569    0.106    0.000    0.107    0.000 Canvas.pyx:228(text)\r\n        3    0.065    0.022    0.065    0.022 Canvas.pyx:26(__init__)\r\n     2223    0.064    0.000    0.064    0.000 Renderer.pyx:664(flip)\r\n    57771    0.047    0.000    0.054    0.000 Renderer.pyx:341(addSprite)\r\n      570    0.030    0.000    0.030    0.000 Renderer.pyx:371(removeSprite)\r\n        1    0.018    0.018  146.524  146.524 {bunniesx.run}\r\n   172229    0.018    0.000    0.018    0.000 Renderer.pyx:437(spriteDst)\r\n   172229    0.015    0.000    0.015    0.000 Renderer.pyx:441(spriteRot)\r\n   172229    0.014    0.000    0.014    0.000 Renderer.pyx:445(spriteColor)\r\n     2268    0.011    0.000    9.872    0.004 GameLoopBase.pyx:291(_processTask)\r\n    57771    0.008    0.000    0.008    0.000 Renderer.pyx:316(_indexSprite)\r\n     2223    0.005    0.000    9.877    0.004 GameLoopBase.pyx:169(update)<\/pre>\n<p>Ubuntu 12.04, i5 2500k, GTX 570, OpenMP parallel (4 CPUs at about 25% load each), 60k bunnies<\/p>\n<pre class=\"brush: bash;\">ncalls  tottime  percall  cumtime  percall filename:lineno(function)\r\n        1   63.751   63.751  118.381  118.381 GameLoop.pyx:27(run)\r\n     1763   48.222    0.027   48.268    0.027 Renderer.pyx:280(update)\r\n     1810    5.868    0.003    6.278    0.003 GameLoopBase.pyx:263(_doSwitch)\r\n        1    0.126    0.126    0.127    0.127 Renderer.pyx:43(__init__)\r\n      593    0.112    0.000    0.113    0.000 Canvas.pyx:228(text)\r\n        1    0.106    0.106    0.106    0.106 SDL.pyx:13(initializeSDL)\r\n        3    0.067    0.022    0.067    0.022 Canvas.pyx:26(__init__)\r\n    60195    0.048    0.000    0.056    0.000 Renderer.pyx:341(addSprite)\r\n     1763    0.046    0.000    0.046    0.000 Renderer.pyx:664(flip)\r\n      594    0.032    0.000    0.032    0.000 Renderer.pyx:371(removeSprite)\r\n        1    0.019    0.019  118.637  118.637 {bunniesp.run}\r\n   179451    0.017    0.000    0.017    0.000 Renderer.pyx:437(spriteDst)\r\n   179451    0.016    0.000    0.016    0.000 Renderer.pyx:441(spriteRot)\r\n   179451    0.014    0.000    0.014    0.000 Renderer.pyx:445(spriteColor)\r\n     1807    0.011    0.000    6.356    0.004 GameLoopBase.pyx:291(_processTask)\r\n    60195    0.008    0.000    0.008    0.000 Renderer.pyx:316(_indexSprite)\r\n     1763    0.004    0.000    6.360    0.004 GameLoopBase.pyx:169(update)<\/pre>\n<p>As you can see, in both cases the renderer loop still eats up most of the frame time (27\/28 ms out of the 33ms available for 30 FPS). There\u2019s a key renderer detail to consider: the OpenGL function calls need to be serialized, they can\u2019t be parallelized, so we\u2019ve basically hit a bottleneck, the one brought forth the graphics card\/driver performance. As a sidenote, this is the profile output of a much slower computer running Windows with a GTX 285 GPU and Nvidia drivers:<\/p>\n<p>Windows 7, Q6600, GTX 285, no parallel, 61k bunnies (using SDL\u2019s OpenGL backend)<\/p>\n<pre class=\"brush: bash;\">ncalls  tottime  percall  cumtime  percall filename:lineno(function)\r\n        1   50.177   50.177   90.285   90.285 GameLoop.pyx:27(run)\r\n     1466   30.128    0.021   30.222    0.021 Renderer.pyx:280(update)\r\n     1514    9.110    0.006    9.728    0.006 GameLoopBase.pyx:263(_doSwitch)\r\n        1    0.225    0.225    0.225    0.225 Renderer.pyx:43(__init__)\r\n      583    0.203    0.000    0.204    0.000 Canvas.pyx:228(text)\r\n        3    0.140    0.047    0.140    0.047 Canvas.pyx:26(__init__)\r\n     1466    0.094    0.000    0.094    0.000 Renderer.pyx:664(flip)\r\n   176443    0.075    0.000    0.096    0.000 Renderer.pyx:437(spriteDst)\r\n   176443    0.071    0.000    0.092    0.000 Renderer.pyx:441(spriteRot)\r\n   176443    0.066    0.000    0.092    0.000 Renderer.pyx:445(spriteColor)\r\n        1    0.061    0.061    0.061    0.061 SDL.pyx:20(terminateSDL)\r\n    59185    0.027    0.000    0.096    0.000 Renderer.pyx:341(addSprite)\r\n        1    0.023    0.023   90.644   90.644 {bunniesx.run}\r\n      112    0.020    0.000    0.020    0.000 Log.pyx:46(log)\r\n     1511    0.010    0.000    9.879    0.007 GameLoopBase.pyx:291(_processTask)\r\n\r\n    59185    0.009    0.000    0.009    0.000 Renderer.pyx:316(_indexSprite)\r\n     1466    0.004    0.000    9.882    0.007 GameLoopBase.pyx:169(update)<\/pre>\n<p>This clearly shows the sorry state of nVidia\u2019s binary drivers for Linux. The Windows version of the drivers are <strong>much faster<\/strong>, evidenced by the fact that on a much slower computer, the non parallelized renderer loop takes 21ms and it shows 1K more bunnies than the parallel version under Linux. In this slower system, the CPU is now the limiting factor at 61k bunnies. On my faster i5-2500K system, using Windows (non parallel version) I get 127300 bunnies (more than double the amount of bunnies using the exact same hardware). I don\u2019t have the profiling output for that one, but I believe the limit is again on the CPU.<\/p>\n<p>In the near future I plan on experimenting with <a href=\"http:\/\/zeromq.org\" target=\"_blank\">ZeroMQ<\/a> for paralellization, as OpenMP\u2019s support in Cython is seamless and very neat but it suffers from a two big problems\u2026there\u2019s very little support for it outside of Linux, and the runtime library is GPL licensed.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the last few months I\u2019ve been hard at work on my Python\/Cython based game engine, Ignifuga, which I\u2019m developing alongside a mysterious game project (by the way, do you know who is the gaucho?). The rationale behind my choice of language had several aspects to it. I was already familiar with Python, it\u2019s a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[3,4,6],"tags":[],"_links":{"self":[{"href":"https:\/\/mdqinc.com\/blog\/wp-json\/wp\/v2\/posts\/9"}],"collection":[{"href":"https:\/\/mdqinc.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mdqinc.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mdqinc.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mdqinc.com\/blog\/wp-json\/wp\/v2\/comments?post=9"}],"version-history":[{"count":7,"href":"https:\/\/mdqinc.com\/blog\/wp-json\/wp\/v2\/posts\/9\/revisions"}],"predecessor-version":[{"id":59,"href":"https:\/\/mdqinc.com\/blog\/wp-json\/wp\/v2\/posts\/9\/revisions\/59"}],"wp:attachment":[{"href":"https:\/\/mdqinc.com\/blog\/wp-json\/wp\/v2\/media?parent=9"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mdqinc.com\/blog\/wp-json\/wp\/v2\/categories?post=9"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mdqinc.com\/blog\/wp-json\/wp\/v2\/tags?post=9"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}