P3L2 : Memory Management #

Goals #

In order to manage the physical memory, the operating system must be able to allocate physical memory and also arbitrate how it is being accessed.
- Notice that the virtual addresses that visible to processes can be larger than the actual physical memory. – The OS can swap the memory to disks and read them back to memory when need it.
Allocation
- track how memory is used and what memory is free;
- replace the contents in physical memory with the contents in disk.
Arbitration
- quickly translate and validate the virtual addresses.
  - This rely on a combination of hardware support and software implementation.

Important Terms #

Logical Versus Physical Address Space
- An address generated by the CPU is commonly referred to as a logical address, whereas an address seen by the memory management unit is referred to as a physical address.
- Binding addresses at either compile or load time generates identical logical and physical addresses. However, the execution-time address-binding scheme results in differing logical and physical addresses. In this case, we usually refer to the logical address as a virtual address. We use logical address and virtual address interchangeably in this text. The set of all logical addresses generated by a program is a logical address space. The set of all physical addresses corresponding to these logical addresses is a physical address space. Thus, in the execution-time address-binding scheme, the logical and physical address spaces differ.

Page-based memory management #

The virtual address space is subdivided into fixed-size segments called pages. The physical memory is divided into page frames of the same size.
Allocation: mapping pages -> page frames;
Arbitration: page tables.
Another way to decouple virtual/physical memory is segmentation: segment-based memory management.
- Allocation: flexibly-sized segments mapping to physical memory
- Arbitration: segment registers supported by hardware

Hardware support for page-based memory management #

In each CPU package, there is a memory management unit(MMU) which is responsible for
- translating virtual addresses to physical addresses;
- reports faults, such as illegal access: attempt to access unallocated memory, insufficient permission to perform certain access, or the requested page is not present in memory and need to be fetched from disk.
Another hardware supported memory management is registers which helps memory translation by pointing the pointers to page table.
- In segment-based management, the pointer can point to the base address, the size of segment and the total number of segments.
Most MMU integrates a small cache of valid virtual/physical address translations, which is called translation look-aside buffer(TLB), to make the translation faster.
Finally, the actual translation is done by hardware. The OS maintains the page table, but the hardware performs it. This also imply that the hardware will dictate what type of memory management modes are supported.

Page Tables #

The page table tells the system where to find the physical address associated with the virtual address.
The sizes of the pages in virtual memory is identical to the sizes of the page frames in physical memory. By keeping the size of these the same, the operating system does not need to manage the translation of every single virtual address within a page. Instead, we can only translate the first virtual address in a page to the first physical address in a page frame. The rest of the memory address in the page map to the corresponding offsets in the page frame. As a result, we can reduce the number of entries we have in the page table.
What this means is that only the first portion of the virtual address is used to index into the page table. We call this part of the address the virtual page number (VPN), and the rest of the of the virtual address is the offset. The VPN is used as an index into the page table, which will produce the physical frame number (PFN), which is the first physical address of the frame in DRAM. To complete the full translation, the PFN needs to be summed with the offset specified in the latter part of the virtual address to produce the actual physical address. The PFN with the offset can be used to reference a specific location in DRAM.
- VPN + offset is the address we have in program, and PFN + offset is the actual address in physical memory

An example #

Let’s say we want to initialize an array for the very first time. We have already allocated the memory for that array into the virtual address space for the process, we have just never accessed it before. Since this portion of the address space has not been accessed before, the operating system has not yet allocated memory for it.
What will happen the first time we access this memory is that the operating system will realize that there isn’t physical memory that corresponds to this range of virtual memory addresses, so it will take a free page of physical memory, and create a page table entry linking the two.
The physical memory is only allocated when the process is trying to access it. This is called allocation on first touch. The reason for this is to make sure that physical memory is only allocated when it’s really needed. Sometimes, programmers may create data structures that they don’t really use.
If a process hasn’t used some of its memory pages for a long time, it is possible that those pages will be reclaimed. The contents will no longer be present in physical memory. They will be swapped out to disk, and some other content will end up in the corresponding physical memory slot.
In order to detect this, page table entries have a valid bit to indicate the validity of the access of the page. 1 is valid, 0 is not. When MMU sees the bit is 0 when access occur, it raise a fault. Then if the OS decide to allow the access, the mapping will be re-established. However the physical memory address will be different and page table needs to be updated to map the virtual address to the new physical address.
In summary, the operating system creates a page table for every every process it runs. Whenever a context switch is performed, the operating system swap in the page table associated with the new process. Hardware assists with page table accesses by maintaining a register that points to the active page table.
- On x86 platforms, this register is the CR3 register.

Page Table Entry(PTE) #

32 bits in total for each entry.
- A present bit(P) indicates whether this page is in physical memory or on disk (i.e., it has been swapped out).
- A protection bits(R/W), indicating whether the page could be read from, written to, or executed from.
- A dirty bit(D) indicates whether the page has been modiﬁed since it was brought into memory.
- A reference bit (a.k.a. accessed bit)(A) is sometimes used to track whether a page has been accessed, and is useful in determining which pages are popular and thus should be kept in memory;
- A user/supervisor bit (U/S) which determines if user-mode processes can access the page;
- A few bits (PWT, PCD, PAT, and G) that determine how hardware caching works for these pages;
- Finally, the page frame number (PFN) itself.
The MMU uses the page table entry not just to perform the address translation, but also to rely on these bits to determine the validity of the access. If the hardware determines that a physical memory access cannot be performed, it causes a page fault.
If this happens, then the CPU will place an error code on the stack of the kernel, and it will generate a trap into the OS kernel, which will invoke the page fault handler. This handler determines the action to take based on the error code and the faulting address.
Pieces of information in the error code will include whether or not the page was not present and needs to be brought in from disk or perhaps there is some sort of permission protection that was violated and that is why the page access if forbidden.
On x86 platforms, the error code is generated from some of the flags in the page table entry and the faulting address is stored in the CR2 register.

Page Table Size #

For 32-bit architecture, PTE = 4 bytes(32 bits), and let’s use the common page size: 4KB(2^12 bytes). Notice that 2^12 bytes means the page offset in PTE is 12 bits which represents 2^12 of bytes in the physical memory.
- To represent a 4GB(2^32 bytes) memory, we need 2^32 / 2^12 = 2^20 entries. Each PTE is 4 bytes. In total, we need 2^22 bytes = 4MB memory.
For 64-bit architecture, PTE = 8 bytes(64 bits), and let’s use the common page size: 4KB(2^12 bytes).
- To represent a 8GB(2^64 bytes) memory, we need 2^64 / 2^12 = 2^52 entries. Each PTE is 8 bytes. In total, we need 2^55 bytes = 32PB memory.
Remember that page tables are a per-process allocation.
It is important to know that a process will not use all of the theoretically available virtual memory. Even on 32-bit architectures, not all of the 4GB is used by every type of process. The problem is that the page table assumes that there is an entry for every VPN, regardless of whether the VPN is needed by the process or not. This is unnecessarily large.

Multi-level Page Tables #

One way to solve the problem is to use a two-level paging algorithm(hierarchical paging), in which the page table itself is also paged.
- The internal page table only exists when the virtual memory regions are valid.
- When a process request more memory to be allocated via malloc, the OS will check/create another page table for it, add new entry in the outer page table, map the entry to the new virtual memory region in the internal page table.
For example, consider again the system with a 32-bit logical address space and a page size of 4 KB. A logical address is divided into a page number consisting of 20 bits and a page offset consisting of 12 bits. Because we page the page table, the page number is further divided into a 10-bit page number and a 10-bit page offset. Thus, a logical address is as follows:
- Different way to look at this^
  1. From p1’s perspective, p2 + d is the offset, meaning each entry is 2^22 = 4MB and there are 2^10 = 1K records in outer table
  2. From p2’s perspective, d is the offset, meaning each entry is 2^12 = 4KB, and there are 1K x 2^10 = 1M potential records.
  3. The 1K outer entries exists no matter what, but the 1M in the internal table doesn’t.
- where p_1 is an index into the outer page table and p_2 is the displacement within the page of the inner page table. The address-translation method for this architecture is:
  - Because address translation works from the outer page table inward, this scheme is also known as a forward-mapped page table.
- The size of each page in the inner page table is 2^10(p_2) * 2^10(page size or page offset) = 2^20 = 1MB
For a system with a 64-bit logical address space, a two-level paging scheme is no longer appropriate. Assume the page size is 4KB(2^12), the outer page table is gonna have 2^42 entries. So we need further dividing. And even we divide the page table to three-level, the outer page table is still 2^34 bytes(16GB). The 64-bit UltraSPARC would require seven levels of paging which is why, for 64-bit architectures, hierarchical page tables are generally considered inappropriate.
- Because more layers means more memory accesses required for translation.

Page Table Cache (TLB) #

A single memory reference require 1 access to the page table entry if it’s single-level, 4 accesses if it’s four level and 1 access to memory. To avoid this slow action, the standard technique is cache.
Most architectures, the MMU integrates a hardware cache which is called TLB - Translation Lookaside Buffer.

Inverted Page Tables #

One of the drawbacks of multi-level method is that each page table may consist of millions of entries. These tables may consume large amounts of physical memory just to keep track of how other physical memory is being used.
To solve this problem, we can use an inverted page table. An inverted page table has one entry for each real page (or frame) of memory. Each entry consists of the virtual address of the page stored in that real memory location, with information about the process that owns the page. Thus, only one page table is in the system, and it has only one entry for each page of physical memory.
To illustrate this method, we describe a simplified version of the inverted page table used in the IBM RT. Each virtual address in the system consists of a triple: <process-id, page-number, offset>.
Each inverted page-table entry is a pair <process-id, page-number> where the process-id assumes the role of the address-space identifier. When a memory reference occurs, part of the virtual address, consisting of <process-id, pagenumber>, is presented to the memory subsystem. The inverted page table is then searched for a match. If a match is found—say, at entry i—then the physical address <i, offset> is generated. If no match is found, then an illegal address access has been attempted.
Although this scheme decreases the amount of memory needed to store each page table, it increases the amount of time needed to search the table when a page reference occurs. Because the inverted page table is sorted by physical address, but lookups occur on virtual addresses, the whole table might need to be searched before a match is found. This search would take far too long. To alleviate this problem, we use a hash table, to limit the search to one—or at most a few—page-table entries.

Hashed Page Tables #

One approach for quick search is to use a hashed page table, with the hash value being the virtual page number. Each entry in the hash table contains a linked list of elements that hash to the same location (to handle collisions). Each element consists of three fields: (1) the virtual page number, (2) the value of the mapped page frame, and (3) a pointer to the next element in the linked list.
The algorithm works as follows: The virtual page number in the virtual address is hashed into the hash table. The virtual page number is compared with field 1 in the first element in the linked list. If there is a match, the corresponding page frame (field 2) is used to form the desired physical address. If there is no match, subsequent entries in the linked list are searched for a matching virtual page number.

IA-32 Segmentation #

Memory management in IA-32 systems is divided into two components segmentation and paging—and works as follows: The CPU generates logical addresses, which are given to the segmentation unit. The segmentation unit produces a linear address for each logical address. The linear address is then given to the paging unit, which in turn generates the physical address in main memory. Thus, the segmentation and paging units form the equivalent of the memory-management unit (MMU).
- the segmentation unit can be code, heap, data, stack, etc.
The logical address space of a process is divided into two partitions. The first partition consists of up to 8 K segments that are private to that process. The second partition consists of up to 8 K segments that are shared among all the processes. Information about the first partition is kept in the local descriptor table (LDT); information about the second partition is kept in the global descriptor table (GDT). Each entry in the LDT and GDT consists of an 8-byte segment descriptor with detailed information about a particular segment, including the base location and limit of that segment.
- The logical address is a pair (selector, offset), where the selector is a 16-bit number:

Page Size #

10-bit offset -> 1KB page size
- 10-bit offset means it can used to address 2^10 of bytes in the page, so the page size is 1KB
Common page size in
- Linux/x86: 4KB, 2MB, 1GB
- Solaris/SPARC: 8KB, 4MB, 2GB
Size comparison:
large huge
page size 2MB 1GB
offset bits 21 bits 30 bits
reduction factor(on page table size comparing to 12 bits) x512 x1024
Conclusion:
- Larger pages means fewer page table entries, smaller page tables, and more TLB hits. But also means wasting memory since large unused gaps within the page itself which is called internal fragmentation.

	large	huge
page size	2MB	1GB
offset bits	21 bits	30 bits
reduction factor(on page table size comparing to 12 bits)	x512	x1024

Memory Allocation #

Memory allocators decide what are the physical pages that will be allocated to a particular virtual memory region, can exist at the kernel level as well as at the user level.
Kernel level allocators are responsible for allocating pages for the kernel and also for certain static state of processes when they are created - the code, the stack and so forth. In addition, the kernel level allocators are responsible for keeping track of the free memory that is available in the system.
User level allocators are used for dynamic process state - the heap. This is memory this is dynamically allocated during the process’s execution. The basic interface for these allocators includes malloc and free. These calls request some amount of memory from kernel’s free pages and then ultimately release it when they are done.
- Once the kernel allocates some memory through a malloc call, the kernel is no longer involved in the management of that memory. That memory is now under the purview of the user level allocator.

Memory Allocation Challenges #

Initially, all memory is available for user processes and is considered one large block of available memory, a hole. Eventually, as you will see, memory contains a set of holes of various sizes.
The figure below depicts this scheme.
- Initially, the memory is fully utilized, containing processes 5, 8, and 2. After process 8 leaves, there is one contiguous hole. Later on, process 9 arrives and is allocated memory. Then process 5 departs, resulting in two noncontiguous holes.
The problem that memory allocation suffer is called external fragmentation. As processes are loaded and removed from memory, the free memory space is broken into little pieces. External fragmentation exists when there is enough total memory space to satisfy a request but the available spaces are not contiguous.

Solutions in Linux Kernel #

The linux kernel relies on two main allocators: the buddy allocator, and the slab allocator.
The buddy allocator starts with 2^x area.
- On request, it subdivided into 2^x chunks and find smallest 2^x chunk that can satisfy request.
  - The fragmentation is still there, but
- On free, check near buddy to see if you can aggregate into a larger chunk.
- Because the buddy allocator has granularity of powers of two, there will be some internal fragmentation using the buddy allocator. This is a problem because there are a lot of data structures in the Linux kernel that are not close to powers of 2 in size. For example, the task struct used to represent processes/threads is 1.7Kb. To fix this issue, Linux also use the slab allocator.
The slab allocator uses cache for common object types/sizes, on top of contiguous memory, which
- solves the internal fragmentation and avoid external fragmentation.

Demand Paging #

Consider how an executable program might be loaded from secondary storage into memory. One option is to load the entire program in physical memory at program execution time. However, a problem with this approach is that we may not initially need the entire program in memory.
An alternative strategy is to load pages only as they are needed. This technique is known as demand paging and is commonly used in virtual memory systems. With demand-paged virtual memory, pages are loaded only when they are demanded during program execution. Pages that are never accessed are thus never loaded into physical memory.
What happens if the process tries to access a page that was not brought into memory? Access to a page marked invalid causes a page fault.The paging hardware, in translating the address through the page table, will notice that the invalid bit is set, causing a trap to the operating system. This trap is the result of the operating system’s failure to bring the desired page into memory.
The procedure for handling this page fault is:
1. We check an internal table (usually kept with the process control block) for this process to determine whether the reference was a valid or an invalid memory access.
2. If the reference was invalid, we terminate the process. If it was valid but we have not yet brought in that page, we now page it in.
3. We find a free frame (by taking one from the free-frame list, for example).
4. We schedule a secondary storage operation to read the desired page into the newly allocated frame.
5. When the storage read is complete, we modify the internal table kept with the process and the page table to indicate that the page is now in memory.
6. We restart the instruction that was interrupted by the trap. The process can now access the page as though it had always been in memory.

Page Replacement #

when should pages be swapped out?
- when memory usage is above the threshold (high watermark)
- when CPU usage is below threshold (low watermark)
which pages should be swapped out?
- A common algorithm is Least-Recently Used (LRU). LRU replacement associates with each page the time of that page’s last use. When a page must be replaced, LRU chooses the page that has not been used for the longest period of time.
- This policy uses the access bit that is available on modern hardware to keep track of whether or not the page has been referenced.
- Other candidates for pages that can be freed from physical memory are pages that don’t need to be written out to disk. Writing pages out to secondary storage takes time, and we would like to avoid this overhead. To assist in making this decision, OS can keep track of the dirty bit maintained by the MMU hardware which keeps track of whether or not a given page has been modified.
Avoid non-swappable pages.
- In Linux, a number of parameters are available to help configure the swapping nature of the system. This includes the threshold page count that determines when pages start getting swapped out.
- We can configure how many pages should be replaced during a given period of time.
- Linux also categorizes the pages into different types, such as claimable and swappable, which helps inform the swapping algorithm as to which pages can be replaced.
- There is a variation of LRU policy to perform two scans before determining which pages should be swapped out.

Copy On Write(COW) #

When we need to create a new process, we need to re-create the entire parent process by copying over its entire address space. However, many of the pages in the parent address space are static - they won’t change - so it’s unclear why we have to incur the copying cost.
In order to avoid unnecessary copying, a new process’s address space, entirely or in part, will just point to the address space of its parent. The same physical address may be referred by two completely different virtual addresses belonging to the two processes. We have to make sure to write protect the page as a way to track accesses to it.
If the page is only going to be read, we save memory and we also save on the CPU cycles we would waste performing the unnecessary copy.
If a write request is issued for the physical address via either one of the virtual addresses, the MMU will detect that the page is write protected and will issue a page fault.
At this point, the operating system will finally create the copy of the memory page, and will update the page table of the faulting process to point to the newly allocated physical memory.

Failure Management Checkpointing #

Checkpointing is a failure and recovery management technique. The idea behind checkpointing is to periodically save process state.
A simple approach to checkpointing would be to pause the execution of the process and copy its entire state.
A better approach will take advantage of the hardware support for memory management and it will try to optimize the disruption that checkpointing will cause on the execution of the process. We can write-protect the process state and try to copy everything once.
However, since the process continues executing, it will continue dirtying pages. We can track the dirty pages - again using MMU support - and we will copy only the diffs on the pages that have been modified. That will allow us to provide for incremental checkpoints.
checkpointing can also be used in other services.
- Debugging:
  - rewind-replay(RR)
    - rewind = restart from checkpoint
    - gradually go back to older checkpoints until error found
- Migration:
  - continue the work on another machine
  - during disaster recovery, or during consolidation efforts when we try to condense computing into as few machines as possible
  - repeated checkpoints in a fast loop until pause-and-copy becomes acceptable