Wednesday, December 26, 2007

Serious != Professional

Compare those two code snippets:

bool operator==(const AClass& first, const AClass& second) {
return (first.weight == second.weight) && (first.value == second.value);


bool operator==(const AClass& apple, const AClass& orange) {
return (apple.weight == orange.weight) && (apple.value == orange.value);

The meaning is the same, but the second one actually compares oranges to apples!!! Geek humor, weeee!!!

Or consider a description of those two use cases in design document:

The user should be able to save custom records in the system and retrieve them at any time. Records should be kept on non-volatile, persistent storage. Retrieval rates should be less than one second for the first item. If a number of items were saved together, then subsequent items within a set should be retrieved within 0.1 sec.


Our prospect customer, Moses (Let My People Go), wishes to keep 10 commandments persistent. Since his last accident with stone scrolls, he wishes to give a shot to our system. Apart from making his record persistent, he wishes to be able to retrieve any commandment in less than one second, in case he forgets it and is wondering whether he is allowed to do something. If he saved a set of commandments, then the subsequent ones should be retrieved within 0.1 sec. To keep the flow.

Granted, the second one has more words: 58 in the first version vs 86 in the second one. On the other hand it is easier to read and it keeps you awake, while the first version actually puts the reader mind in a dormant state. I know this state. This is when my eyes read the document, the words pass through my brain without stopping for a cup of coffee or a pee. If the document is important, then I have to read it a number of times to get meaning to stay.

The second version on the other hand, is so ridiculous that it is an easy read and something stays in your mind. I think the reason something stays is because it is much easier to imagine Moses trying to save 10 commandments, rather than a "customer" wishing to save his "record". God knows what this records hold.

Another comparison:


They communicate the same message, but the second one is funnier. In a not-funny-joke way. The important thing to notice here is that the task of understanding the design/code/diagram is hard enough by itself. If the means can be made lighter, then the reader can concentrate on the task at hand.

The point I'm getting to is that things might be funny, but it does not make them content-empty or unprofessional. Many people feel that they have to be serious to be taken seriously. This is not the case. One should enjoy his work and have fun doing it. This is the way to do it better than good, this is a way to do it great.

You should have fun while doing your job and the best and easiest way to have fun is to crack a joke once in a while.

Saturday, November 17, 2007

Sure sigh of a scalable program

I work for a company that develops scalable, high-performance, clustered server.
Last month I have found myself struggling to explain what is the difference between scalable, high-performance programs and programs written "just to work".

Developing a small program is like building a tree house, while creating huge, multi-modules and gazillion lines monster is like constructing a skyscraper. The tasks are so different, that it is hard to even compare them. The same goes, so I feel, to the comparison of a high-performance, clustering system and small, stand-alone utility. At the same time, explaining why this is the case was just like explaining why the water is wet. Something that you know, but cannot put into word. In other words, I was a feeling stupid.

This got me thinking. If I have to put one single test that the program is good enough to work under pressure. What test will it be?

Since Intel architecture rules supreme now in dekstop/server rooms, let me go over a little bit of history here.
DOS and 8086 days. Computer memory is wide open before each running program. It starts from 0 and goes to whatever you bought. The upper memory (384KB) is reserved for system usage, however, your program still can access it and crash the computer, of course. What is "memory management" in those days? OS (DOS) doesn't really manage memory. It just gives all available memory to the currently executing program. Once program A has finished, program B will be loaded into exactly same location where A has been just a moment ago. There are no restrictions on the memory programs can access. They can change DOS code or do whatever they want. Accessing other's memory is considered rude thought. Kind of like peeing in a pool.
From program point of view, once a memory is allocated it is always there. It's access is fast and constant. How is malloc()/free() functions implemented? Easy. All you have to do is to find the beginning of the system memory and that's it. The memory from the end of the program to the start of system memory is yours to manage. OS doesn't care what memory manager does with the memory.

Time went by and 640KB is not enough for everybody anymore. Each program now wants lots of memory all for itself and forever. 80386 had exactly the thing: virtual memory + memory protection. Each program can now run in its own memory space. You want to zero all your memory? Go wild, nobody else will be hurt. As a matter of fact, 80286 had preliminary support for protected mode. However, due to a number of mistakes, such as inability to switch it on and off without loosing memory contents, and inability to access BIOS, rendered the mode unusable.
80386 provided everything needed to put each program in its own luxurious apartment from a memory point of view:
  • 32-bit virtual memory space
  • paging
Virtual memory is simple. Each program has a memory chunk starting from 0 and going until 2^32. Neat. There is one "problem" though: what if a computer with 4MB RAM runs a dozen of program each one consuming 1MB of RAM. Obviously, there is not enough RAM for all of them. Who gets the memory and who doesn't? This is where paging comes into play. Paging basically says: "I have a limited amount of physical memory, much less than I give my programs to allocate. I will put some of the memory on a disk (it can be any secondary storage, but usually it is a disk) and once it is needed I will bring it back. If more RAM memory is needed, I will transfer (page out) some of the memory to the disk." This is what OS is saying to itself. Good thing, the processor is backing it up. If the program is trying to access a memory that is currently on disk, the OS should get a notification that something is wrong. Hardware is doing it and the notification is called "page fault" interrupt. So, your program is accessing some variable that at the time happens to be on a disk. Hardware recognize this situation and raises a special interrupt. OS gets the interrupt and loads the requested page from a disk to a memory. Your program is dormant at the moment and unaware of this process. The disk might be busy and serving other requests, but your program doesn't care about it. It is in a better place right now. Probably drooling or in Tumbolia. At the end, the page is loaded to the memory and OS continues your program from the same spot. From it's point of view, the variable was accessed just like any other variable in memory. One thing is missing from this description: how the hardware "knows" that this address is not in the memory? It has to be some sort of dialog between OS and hardware on what addresses in memory and what addresses on a disk. This is a good question and will be answered later.

As you can see, just accessing a memory during a normal execution can have unpredictable latency. You just never know when the accessed variable is on disk and when it is in the memory. Most of the programs live happily with this uncertainty. They want an abstraction of 4GB of flat memory and do not care what is going on under the hood. Take a web server for example. It has to be as fast as possible, but does the memory paging in and out is really a bottle neck here? Here are some ping results:

> ping PING ( 56(84) bytes of data.
64 bytes from ( icmp_seq=1 ttl=54 time=158 ms 64 bytes from ( icmp_seq=2 ttl=54 time=158 ms 64 bytes from ( icmp_seq=3 ttl=54 time=168 ms 64 bytes from ( icmp_seq=4 ttl=54 time=161 ms

> ping
PING ( 56(84) bytes of data.
64 bytes from ( icmp_seq=1 ttl=244 time=94.1 ms
64 bytes from ( icmp_seq=2 ttl=244 time=92.6 ms

64 bytes from ( icmp_seq=3 ttl=244 time=97.4 ms

> ping
PING ( 56(84) bytes of data. 64 bytes from ( icmp_seq=1 ttl=244 time=218 ms 64 bytes from ( icmp_seq=2 ttl=244 time=207 ms 64 bytes from ( icmp_seq=3 ttl=244 time=206 ms

This is only a ping trip time. Disk access is a matter of a few ms. Do you think that anyone will notice if cnet access is 230ms instead of 208ms? Moreover, the times will go up, as each page access has images, javascript and sometimes, flash to load. All this takes a lot of time, dwarfing paging latencies to a minuscule size.

BTW, did you notice that Google has a constantly lower response times than Yahoo and their page is much "lighter". Yahoo has to remedy this situation if they want to stay in the game.

However, if you are writing an application that has to be blazing fast and that serves machines on LAN, not on the Internet. Don't talk to me about bandwidth. Latency means responsiveness. If your desktop-server application works slow, it does not matter why it happens. Your computer might be obediently trying to compress DVD quality movie (a task you gave it by mistake, but its too stupid to even ask you: "Are you sure you want to compress this movie? It is incompressible you know...") or the server is paging like mad right now. It does not matter why your application isn't responding. It is annoying like hell. LAN client-server applications have to be fast. Mind you, not all of them, but there are some.

Take for example, NAS server. Basically, you ask your NAS server to read or write some data from or to a disk. If your LAN round trip time is less than 1ms, which is usual, then the time it takes for your server to read the data is the latency. Lets say it is about 5ms. Therefore, if in the process of reading your data, server software hits page fault a couple of times and OS reads pages from disk to the memory a couple of times. Then instead of 6ms, your request latency will be around 16ms. This is the good case, where disk queues were empty and OS didn't have to write other processes pages in order to bring NAS server pages.

Once again. Your application latency tripled itself just because two memory pages were on disk and your server software didn't even know about it!!! This problem is not disk specific: go and read Joel Spolsky article about The Law of Leaky Abstractions.

This server software cannot allow itself to use standard abstraction. It has to know exactly when and where its memory is. Now it is time to talk a little bit about malloc, free and how your application (C and C++) gets its memory.

Remember the question: how the hardware "knows" that this address is not in the memory? Well, here is how. OS maps allocated pages using TLB (translation lookaside buffer). Since now all programs use 4GB of memory and if your's does then you should really check it for memory leaks, OS negotiates with programs on amount of memory they are consuming. At every point of a program life cycle, OS knows exactly how much memory the program uses and thus, can map the appropriate amount of pages. These mappings will explain to hardware what pages are present and what pages are missing (on a secondary storage). Thus, hardware will know when to raise "page fault" interrupt. But how does OS negotiates with programs regarding the amount of used memory?

This is where malloc and free come in handy. From OS point of view, each program has its memory allocated from zero up to some limit. Like this:

where "=" specifies a total memory that can be used by this program, "-" specifies memory currently used by the program (and mapped by operating system), "0" is a beginning of a memory and "v" points to the largest address used currently by the program.
If no "free" was called, i.e. all program memory is used then another malloc will move "v" to the right. If some memory was freed ("_"):

Then the next malloc might use freed memory instead of taking new memory from OS. If a new memory has to be taken from operating system a special system call is invoked: sbrk. In human it means system break or move the system break or i need more memory! map more memory for me! NOW!!! move it move it move it move it!!!

So, depending on the current allocation situation, a new call to malloc might be translated into:
  1. Search for a suitable memory chunk (done by malloc rather efficiently). However, sometimes you know that your program always (or almost always) allocates memory in 4KB chunks. For example, because this is the chunk size you chose to move data. In those cases, one can write much more efficient memory allocator than general-case malloc.
  2. Do a sbrk call, which will cause operating system to map more pages. This is problematic, since you have already wasted some time looking for a suitable page. Doing a system call now is gonna add more to malloc latency.
This means that if you must control the latency of your program in general and memory allocations in particular, then you have to:
  1. "Pin" all your memory such that it wont be paged out. This is the most important step.
  2. Pre-allocate all the memory that you will ever need. This step is tricky. Don't allocate too much, as you will bring other programs to their knees, but also don't allocate too little, as it wont be enough.
  3. Write your own, customized memory allocator. Sounds like a lot of work that already has been done by other people. True. It is a lot of work. However, you will be surprised what performance gain you can get from this thing. Something completely unrelated to your program domain. Sadly I cannot put numbers here, since they are confidential, but let me say that they are tens of percents. And the behavior smoother. Everybody loves smooth behavior.
Back to beginning. If I had to give one the sign of scalable, high-performance program what it will be? If the program has its custom-made, crafted memory allocator, then it is scalable, high-performance program. Or at the very least, programmers seriously considered making one. If no thoughts were given to memory allocation, then the program is written "just to work". This is not a bad thing by the way.