Saturday, November 17, 2007

Sure sigh of a scalable program

I work for a company that develops scalable, high-performance, clustered server.
Last month I have found myself struggling to explain what is the difference between scalable, high-performance programs and programs written "just to work".

Developing a small program is like building a tree house, while creating huge, multi-modules and gazillion lines monster is like constructing a skyscraper. The tasks are so different, that it is hard to even compare them. The same goes, so I feel, to the comparison of a high-performance, clustering system and small, stand-alone utility. At the same time, explaining why this is the case was just like explaining why the water is wet. Something that you know, but cannot put into word. In other words, I was a feeling stupid.

This got me thinking. If I have to put one single test that the program is good enough to work under pressure. What test will it be?

Since Intel architecture rules supreme now in dekstop/server rooms, let me go over a little bit of history here.
DOS and 8086 days. Computer memory is wide open before each running program. It starts from 0 and goes to whatever you bought. The upper memory (384KB) is reserved for system usage, however, your program still can access it and crash the computer, of course. What is "memory management" in those days? OS (DOS) doesn't really manage memory. It just gives all available memory to the currently executing program. Once program A has finished, program B will be loaded into exactly same location where A has been just a moment ago. There are no restrictions on the memory programs can access. They can change DOS code or do whatever they want. Accessing other's memory is considered rude thought. Kind of like peeing in a pool.
From program point of view, once a memory is allocated it is always there. It's access is fast and constant. How is malloc()/free() functions implemented? Easy. All you have to do is to find the beginning of the system memory and that's it. The memory from the end of the program to the start of system memory is yours to manage. OS doesn't care what memory manager does with the memory.

Time went by and 640KB is not enough for everybody anymore. Each program now wants lots of memory all for itself and forever. 80386 had exactly the thing: virtual memory + memory protection. Each program can now run in its own memory space. You want to zero all your memory? Go wild, nobody else will be hurt. As a matter of fact, 80286 had preliminary support for protected mode. However, due to a number of mistakes, such as inability to switch it on and off without loosing memory contents, and inability to access BIOS, rendered the mode unusable.
80386 provided everything needed to put each program in its own luxurious apartment from a memory point of view:
  • 32-bit virtual memory space
  • paging
Virtual memory is simple. Each program has a memory chunk starting from 0 and going until 2^32. Neat. There is one "problem" though: what if a computer with 4MB RAM runs a dozen of program each one consuming 1MB of RAM. Obviously, there is not enough RAM for all of them. Who gets the memory and who doesn't? This is where paging comes into play. Paging basically says: "I have a limited amount of physical memory, much less than I give my programs to allocate. I will put some of the memory on a disk (it can be any secondary storage, but usually it is a disk) and once it is needed I will bring it back. If more RAM memory is needed, I will transfer (page out) some of the memory to the disk." This is what OS is saying to itself. Good thing, the processor is backing it up. If the program is trying to access a memory that is currently on disk, the OS should get a notification that something is wrong. Hardware is doing it and the notification is called "page fault" interrupt. So, your program is accessing some variable that at the time happens to be on a disk. Hardware recognize this situation and raises a special interrupt. OS gets the interrupt and loads the requested page from a disk to a memory. Your program is dormant at the moment and unaware of this process. The disk might be busy and serving other requests, but your program doesn't care about it. It is in a better place right now. Probably drooling or in Tumbolia. At the end, the page is loaded to the memory and OS continues your program from the same spot. From it's point of view, the variable was accessed just like any other variable in memory. One thing is missing from this description: how the hardware "knows" that this address is not in the memory? It has to be some sort of dialog between OS and hardware on what addresses in memory and what addresses on a disk. This is a good question and will be answered later.

As you can see, just accessing a memory during a normal execution can have unpredictable latency. You just never know when the accessed variable is on disk and when it is in the memory. Most of the programs live happily with this uncertainty. They want an abstraction of 4GB of flat memory and do not care what is going on under the hood. Take a web server for example. It has to be as fast as possible, but does the memory paging in and out is really a bottle neck here? Here are some ping results:

> ping www.yahoo.com PING www.yahoo-ht3.akadns.net (87.248.113.14) 56(84) bytes of data.
64 bytes from f1.us.www.vip.ird.yahoo.com (87.248.113.14): icmp_seq=1 ttl=54 time=158 ms 64 bytes from f1.us.www.vip.ird.yahoo.com (87.248.113.14): icmp_seq=2 ttl=54 time=158 ms 64 bytes from f1.us.www.vip.ird.yahoo.com (87.248.113.14): icmp_seq=3 ttl=54 time=168 ms 64 bytes from f1.us.www.vip.ird.yahoo.com (87.248.113.14): icmp_seq=4 ttl=54 time=161 ms

> ping www.google.com
PING www.l.google.com (64.233.183.99) 56(84) bytes of data.
64 bytes from nf-in-f99.google.com (64.233.183.99): icmp_seq=1 ttl=244 time=94.1 ms
64 bytes from nf-in-f99.google.com (64.233.183.99): icmp_seq=2 ttl=244 time=92.6 ms

64 bytes from nf-in-f99.google.com (64.233.183.99): icmp_seq=3 ttl=244 time=97.4 ms


> ping www.cnet.com
PING c18-ssa-xw-lb.cnet.com (216.239.122.220) 56(84) bytes of data. 64 bytes from c18-ssa-xw-lb.cnet.com (216.239.122.220): icmp_seq=1 ttl=244 time=218 ms 64 bytes from c18-ssa-xw-lb.cnet.com (216.239.122.220): icmp_seq=2 ttl=244 time=207 ms 64 bytes from c18-ssa-xw-lb.cnet.com (216.239.122.220): icmp_seq=3 ttl=244 time=206 ms

This is only a ping trip time. Disk access is a matter of a few ms. Do you think that anyone will notice if cnet access is 230ms instead of 208ms? Moreover, the times will go up, as each page access has images, javascript and sometimes, flash to load. All this takes a lot of time, dwarfing paging latencies to a minuscule size.

BTW, did you notice that Google has a constantly lower response times than Yahoo and their page is much "lighter". Yahoo has to remedy this situation if they want to stay in the game.

However, if you are writing an application that has to be blazing fast and that serves machines on LAN, not on the Internet. Don't talk to me about bandwidth. Latency means responsiveness. If your desktop-server application works slow, it does not matter why it happens. Your computer might be obediently trying to compress DVD quality movie (a task you gave it by mistake, but its too stupid to even ask you: "Are you sure you want to compress this movie? It is incompressible you know...") or the server is paging like mad right now. It does not matter why your application isn't responding. It is annoying like hell. LAN client-server applications have to be fast. Mind you, not all of them, but there are some.

Take for example, NAS server. Basically, you ask your NAS server to read or write some data from or to a disk. If your LAN round trip time is less than 1ms, which is usual, then the time it takes for your server to read the data is the latency. Lets say it is about 5ms. Therefore, if in the process of reading your data, server software hits page fault a couple of times and OS reads pages from disk to the memory a couple of times. Then instead of 6ms, your request latency will be around 16ms. This is the good case, where disk queues were empty and OS didn't have to write other processes pages in order to bring NAS server pages.

Once again. Your application latency tripled itself just because two memory pages were on disk and your server software didn't even know about it!!! This problem is not disk specific: go and read Joel Spolsky article about The Law of Leaky Abstractions.

This server software cannot allow itself to use standard abstraction. It has to know exactly when and where its memory is. Now it is time to talk a little bit about malloc, free and how your application (C and C++) gets its memory.

Remember the question: how the hardware "knows" that this address is not in the memory? Well, here is how. OS maps allocated pages using TLB (translation lookaside buffer). Since now all programs use 4GB of memory and if your's does then you should really check it for memory leaks, OS negotiates with programs on amount of memory they are consuming. At every point of a program life cycle, OS knows exactly how much memory the program uses and thus, can map the appropriate amount of pages. These mappings will explain to hardware what pages are present and what pages are missing (on a secondary storage). Thus, hardware will know when to raise "page fault" interrupt. But how does OS negotiates with programs regarding the amount of used memory?

This is where malloc and free come in handy. From OS point of view, each program has its memory allocated from zero up to some limit. Like this:

0--------------------------------------------v
================================================================
where "=" specifies a total memory that can be used by this program, "-" specifies memory currently used by the program (and mapped by operating system), "0" is a beginning of a memory and "v" points to the largest address used currently by the program.
If no "free" was called, i.e. all program memory is used then another malloc will move "v" to the right. If some memory was freed ("_"):

0----____---------------_____------------v
================================================================
Then the next malloc might use freed memory instead of taking new memory from OS. If a new memory has to be taken from operating system a special system call is invoked: sbrk. In human it means system break or move the system break or i need more memory! map more memory for me! NOW!!! move it move it move it move it!!!

So, depending on the current allocation situation, a new call to malloc might be translated into:
  1. Search for a suitable memory chunk (done by malloc rather efficiently). However, sometimes you know that your program always (or almost always) allocates memory in 4KB chunks. For example, because this is the chunk size you chose to move data. In those cases, one can write much more efficient memory allocator than general-case malloc.
  2. Do a sbrk call, which will cause operating system to map more pages. This is problematic, since you have already wasted some time looking for a suitable page. Doing a system call now is gonna add more to malloc latency.
This means that if you must control the latency of your program in general and memory allocations in particular, then you have to:
  1. "Pin" all your memory such that it wont be paged out. This is the most important step.
  2. Pre-allocate all the memory that you will ever need. This step is tricky. Don't allocate too much, as you will bring other programs to their knees, but also don't allocate too little, as it wont be enough.
  3. Write your own, customized memory allocator. Sounds like a lot of work that already has been done by other people. True. It is a lot of work. However, you will be surprised what performance gain you can get from this thing. Something completely unrelated to your program domain. Sadly I cannot put numbers here, since they are confidential, but let me say that they are tens of percents. And the behavior smoother. Everybody loves smooth behavior.
Back to beginning. If I had to give one the sign of scalable, high-performance program what it will be? If the program has its custom-made, crafted memory allocator, then it is scalable, high-performance program. Or at the very least, programmers seriously considered making one. If no thoughts were given to memory allocation, then the program is written "just to work". This is not a bad thing by the way.

No comments: