Last month I have found myself struggling to explain what is the difference between scalable, high-performance programs and programs written "just to work".
Developing a small program is like building a tree house, while creating huge, multi-modules and gazillion lines monster is like constructing a skyscraper. The tasks are so different, that it is hard to even compare them. The same goes, so I feel, to the comparison of a high-performance, clustering system and small, stand-alone utility. At the same time, explaining why this is the case was just like explaining why the water is wet. Something that you know, but cannot put into word. In other words, I was a feeling stupid.
This got me thinking. If I have to put one single test that the program is good enough to work under pressure. What test will it be?
Since Intel architecture rules supreme now in dekstop/server rooms, let me go over a little bit of history here.
DOS and 8086 days. Computer memory is wide open before each running program. It starts from 0 and goes to whatever you bought. The upper memory (384KB) is reserved for system usage, however, your program still can access it and crash the computer, of course. What is "memory management" in those days? OS (DOS) doesn't really manage memory. It just gives all available memory to the currently executing program. Once program A has finished, program B will be loaded into exactly same location where A has been just a moment ago. There are no restrictions on the memory programs can access. They can change DOS code or do whatever they want. Accessing other's memory is considered rude thought. Kind of like peeing in a pool.
From program point of view, once a memory is allocated it is always there. It's access is fast and constant. How is malloc()/free() functions implemented? Easy. All you have to do is to find the beginning of the system memory and that's it. The memory from the end of the program to the start of system memory is yours to manage. OS doesn't care what memory manager does with the memory.
Time went by and 640KB is not enough for everybody anymore. Each program now wants lots of memory all for itself and forever. 80386 had exactly the thing: virtual memory + memory protection. Each program can now run in its own memory space. You want to zero all your memory? Go wild, nobody else will be hurt. As a matter of fact, 80286 had preliminary support for protected mode. However, due to a number of mistakes, such as inability to switch it on and off without loosing memory contents, and inability to access BIOS, rendered the mode unusable.
80386 provided everything needed to put each program in its own luxurious apartment from a memory point of view:
- 32-bit virtual memory space
As you can see, just accessing a memory during a normal execution can have unpredictable latency. You just never know when the accessed variable is on disk and when it is in the memory. Most of the programs live happily with this uncertainty. They want an abstraction of 4GB of flat memory and do not care what is going on under the hood. Take a web server for example. It has to be as fast as possible, but does the memory paging in and out is really a bottle neck here? Here are some ping results:
> ping www.yahoo.com PING www.yahoo-ht3.akadns.net (188.8.131.52) 56(84) bytes of data.
64 bytes from f1.us.www.vip.ird.yahoo.com (184.108.40.206): icmp_seq=1 ttl=54 time=158 ms 64 bytes from f1.us.www.vip.ird.yahoo.com (220.127.116.11): icmp_seq=2 ttl=54 time=158 ms 64 bytes from f1.us.www.vip.ird.yahoo.com (18.104.22.168): icmp_seq=3 ttl=54 time=168 ms 64 bytes from f1.us.www.vip.ird.yahoo.com (22.214.171.124): icmp_seq=4 ttl=54 time=161 ms
> ping www.google.com PING www.l.google.com (126.96.36.199) 56(84) bytes of data.
64 bytes from nf-in-f99.google.com (188.8.131.52): icmp_seq=1 ttl=244 time=94.1 ms
64 bytes from nf-in-f99.google.com (184.108.40.206): icmp_seq=2 ttl=244 time=92.6 ms
64 bytes from nf-in-f99.google.com (220.127.116.11): icmp_seq=3 ttl=244 time=97.4 ms
> ping www.cnet.com PING c18-ssa-xw-lb.cnet.com (18.104.22.168) 56(84) bytes of data. 64 bytes from c18-ssa-xw-lb.cnet.com (22.214.171.124): icmp_seq=1 ttl=244 time=218 ms 64 bytes from c18-ssa-xw-lb.cnet.com (126.96.36.199): icmp_seq=2 ttl=244 time=207 ms 64 bytes from c18-ssa-xw-lb.cnet.com (188.8.131.52): icmp_seq=3 ttl=244 time=206 ms
BTW, did you notice that Google has a constantly lower response times than Yahoo and their page is much "lighter". Yahoo has to remedy this situation if they want to stay in the game.
However, if you are writing an application that has to be blazing fast and that serves machines on LAN, not on the Internet. Don't talk to me about bandwidth. Latency means responsiveness. If your desktop-server application works slow, it does not matter why it happens. Your computer might be obediently trying to compress DVD quality movie (a task you gave it by mistake, but its too stupid to even ask you: "Are you sure you want to compress this movie? It is incompressible you know...") or the server is paging like mad right now. It does not matter why your application isn't responding. It is annoying like hell. LAN client-server applications have to be fast. Mind you, not all of them, but there are some.
Take for example, NAS server. Basically, you ask your NAS server to read or write some data from or to a disk. If your LAN round trip time is less than 1ms, which is usual, then the time it takes for your server to read the data is the latency. Lets say it is about 5ms. Therefore, if in the process of reading your data, server software hits page fault a couple of times and OS reads pages from disk to the memory a couple of times. Then instead of 6ms, your request latency will be around 16ms. This is the good case, where disk queues were empty and OS didn't have to write other processes pages in order to bring NAS server pages.
Once again. Your application latency tripled itself just because two memory pages were on disk and your server software didn't even know about it!!! This problem is not disk specific: go and read Joel Spolsky article about The Law of Leaky Abstractions.
This server software cannot allow itself to use standard abstraction. It has to know exactly when and where its memory is. Now it is time to talk a little bit about malloc, free and how your application (C and C++) gets its memory.
Remember the question: how the hardware "knows" that this address is not in the memory? Well, here is how. OS maps allocated pages using TLB (translation lookaside buffer). Since now all programs use 4GB of memory and if your's does then you should really check it for memory leaks, OS negotiates with programs on amount of memory they are consuming. At every point of a program life cycle, OS knows exactly how much memory the program uses and thus, can map the appropriate amount of pages. These mappings will explain to hardware what pages are present and what pages are missing (on a secondary storage). Thus, hardware will know when to raise "page fault" interrupt. But how does OS negotiates with programs regarding the amount of used memory?
This is where malloc and free come in handy. From OS point of view, each program has its memory allocated from zero up to some limit. Like this:
where "=" specifies a total memory that can be used by this program, "-" specifies memory currently used by the program (and mapped by operating system), "0" is a beginning of a memory and "v" points to the largest address used currently by the program.
If no "free" was called, i.e. all program memory is used then another malloc will move "v" to the right. If some memory was freed ("_"):
Then the next malloc might use freed memory instead of taking new memory from OS. If a new memory has to be taken from operating system a special system call is invoked: sbrk. In human it means system break or move the system break or i need more memory! map more memory for me! NOW!!! move it move it move it move it!!!
So, depending on the current allocation situation, a new call to malloc might be translated into:
- Search for a suitable memory chunk (done by malloc rather efficiently). However, sometimes you know that your program always (or almost always) allocates memory in 4KB chunks. For example, because this is the chunk size you chose to move data. In those cases, one can write much more efficient memory allocator than general-case malloc.
- Do a sbrk call, which will cause operating system to map more pages. This is problematic, since you have already wasted some time looking for a suitable page. Doing a system call now is gonna add more to malloc latency.
- "Pin" all your memory such that it wont be paged out. This is the most important step.
- Pre-allocate all the memory that you will ever need. This step is tricky. Don't allocate too much, as you will bring other programs to their knees, but also don't allocate too little, as it wont be enough.
- Write your own, customized memory allocator. Sounds like a lot of work that already has been done by other people. True. It is a lot of work. However, you will be surprised what performance gain you can get from this thing. Something completely unrelated to your program domain. Sadly I cannot put numbers here, since they are confidential, but let me say that they are tens of percents. And the behavior smoother. Everybody loves smooth behavior.