Linux Address Space

Gaurav Sarma
6 min readMar 2, 2018

Linux processes interact with virtual memory and not the physical memory. Every process has a notion that it is the only process running in the system and hence, has unlimited access to the memory present in the system.

Various processes may have the same virtual memory address space but it doesn’t collide because the kernel takes care of the virtual memory to physical memory mapping. An example when a process may have to share it’s virtual memory is when it spawns threads, or threads of execution.

The process doesn’t have permission to access certain parts of the address space which is reserved by the kernel. A process can access a memory address only if it is in the valid area. Memory addresses can have associated permissions that a process must respect. If this is not respected by the process, then the kernel throws a Segmentation Fault message and kills the process.

Memory areas may have the following content:

  • Executable file’s code, which is known as the text section
  • Executable file’s initialized global variables, which is known as the data section
  • Uninitialized variables called the bss (block started by symbol) section
  • Stack
  • Heap

Memory Descriptor:

In the linux kernel code, the processes’ address space can be defined in the following data structure.

struct mm_struct {
struct vm_area_struct *mmap; /* list of memory areas */
struct rb_root mm_rb; /* red-black tree of VMAs */
struct vm_area_struct *mmap_cache; /* last used memory area */
unsigned long free_area_cache; /* 1st address space hole */
pgd_t *pgd; /* page global directory */
atomic_t mm_users; /* address space users */
atomic_t mm_count; /* primary usage counter */
int map_count; /* number of memory areas */
struct rw_semaphore mmap_sem; /* memory area semaphore */
spinlock_t page_table_lock; /* page table lock */
struct list_head mmlist; /* list of all mm_structs */
unsigned long start_code; /* start address of code */
unsigned long end_code; /* final address of code */
unsigned long start_data; /* start address of data */
unsigned long end_data; /* final address of data */
unsigned long start_brk; /* start address of heap */
unsigned long brk; /* final address of heap */
unsigned long start_stack; /* start address of stack */
unsigned long arg_start; /* start of arguments */
unsigned long arg_end; /* end of arguments */
unsigned long env_start; /* start of environment */
unsigned long env_end; /* end of environment */
unsigned long rss; /* pages allocated */
unsigned long total_vm; /* total number of pages */
unsigned long locked_vm; /* number of locked pages */
unsigned long def_flags; /* default access flags */
unsigned long cpu_vm_mask; /* lazy TLB switch mask */
unsigned long swap_address; /* last scanned address */
unsigned dumpable:1; /* can this mm core dump? */
int used_hugetlb; /* used hugetlb pages? */
mm_context_t context; /* arch-specific data */
int core_waiters; /* thread core dump waiters */
struct completion *core_startup_done; /* core start completion */
struct completion core_done; /* core end completion */
rwlock_t ioctx_list_lock; /* AIO I/O list lock */
struct kioctx *ioctx_list; /* AIO I/O list */
struct kioctx default_kioctx; /* AIO default I/O context */
};

The number of processes/threads using the same address space can be checked via the mm_users variable. The mmap and mm_rb point to the memory addresses in the address space. Both the variables point to the same information but in different representations. mmap is a linked list whereas mm_rb is a red black tree. This is done so that the mmap can be used for simple traversal need and the mm_rb can be used for searching purposes.

The kernel represents the process address space via the memory descriptor. The memory descriptor of the process is pointed to via the mm field in the task_struct structure.

struct task_struct {  volatile long        state;          /* -1 unrunnable, 0 runnable, >0 stopped */
long counter;
long priority;
unsigned long signal;
unsigned long blocked; /* bitmap of masked signals */
unsigned long flags; /* per process flags, defined below */
int errno;
long debugreg[8]; /* Hardware debugging registers */
struct exec_domain *exec_domain;
struct linux_binfmt *binfmt;
struct task_struct *next_task, *prev_task;
struct task_struct *next_run, *prev_run;
unsigned long saved_kernel_stack;
unsigned long kernel_stack_page;
int exit_code, exit_signal;
unsigned long personality;
int dumpable:1;
int did_exec:1;
int pid;
int pgrp;
int tty_old_pgrp;
int session;
/* boolean value for session group leader */
int leader;
int groups[NGROUPS];
struct task_struct *p_opptr, *p_pptr, *p_cptr,
*p_ysptr, *p_osptr;
struct wait_queue *wait_chldexit;
unsigned short uid,euid,suid,fsuid;
unsigned short gid,egid,sgid,fsgid;
unsigned long timeout, policy, rt_priority;
unsigned long it_real_value, it_prof_value, it_virt_value;
unsigned long it_real_incr, it_prof_incr, it_virt_incr;
struct timer_list real_timer;
long utime, stime, cutime, cstime, start_time;
unsigned long min_flt, maj_flt, nswap, cmin_flt, cmaj_flt, cnswap;
int swappable:1;
unsigned long swap_address;
unsigned long old_maj_flt; /* old value of maj_flt */
unsigned long dec_flt; /* page fault count of the last time */
unsigned long swap_cnt; /* number of pages to swap on next pass */
struct rlimit rlim[RLIM_NLIMITS];
unsigned short used_math;
char comm[16];
int link_count;
struct tty_struct *tty;
struct sem_undo *semundo;
struct sem_queue *semsleeping;
struct desc_struct *ldt;
struct thread_struct tss;
struct fs_struct *fs;
struct files_struct *files;
struct mm_struct *mm;
struct signal_struct *sig;
#ifdef __SMP__
int processor;
int last_processor;
int lock_depth;
#endif
};

The current->mm points to the memory descriptor of the process. The copy_mm() is used to copy the parent’s memory descriptor to the child during fork(). Each process receives a unique mm_struct, hence a unique address space. In some cases where the address space is shared by multiple processes, they are known as threads and are done by calling the clone()with CLONE_VM flag set. This is why threads are just another process according to the linux kernel who happen to share the address space i.e some of its resources with another process.

When the process exits, it calls the exit_mm() function which in turn calls free_mm() if the reference count of the process is 0 and does some housekeeping and statistics update.

Virtual memory areas

The memory areas are represented in the kernel code via the vm_area_struct, which are also called virtual memory areas.

struct vm_area_struct {
struct mm_struct *vm_mm; /* associated mm_struct */
unsigned long vm_start; /* VMA start, inclusive */
unsigned long vm_end; /* VMA end , exclusive */
struct vm_area_struct *vm_next; /* list of VMA's */
pgprot_t vm_page_prot; /* access permissions */
unsigned long vm_flags; /* flags */
struct rb_node vm_rb; /* VMA's node in the tree */
union { /* links to address_space->i_mmap or i_mmap_nonlinear */
struct {
struct list_head list;
void *parent;
struct vm_area_struct *head;
} vm_set;
struct prio_tree_node prio_tree_node;
} shared;
struct list_head anon_vma_node; /* anon_vma entry */
struct anon_vma *anon_vma; /* anonymous VMA object */
struct vm_operations_struct *vm_ops; /* associated ops */
unsigned long vm_pgoff; /* offset within file */
struct file *vm_file; /* mapped file, if any */
void *vm_private_data; /* private data */
};

It describes a single memory area over a contiguous interval. Each memory area has certain associated permissions and flags which help to denote the type of memory area — for example, memory-mapped areas or the processes’s user-space stack.

The vm_mm struct points to the corresponding mm_struct that it belongs to which confirms the uniqueness of the address space of a process.

Although the applications operate on the virtual memory address space, the processors operate on the physical memory. Therefore, whenever an application accesses a virtual memory address, it is first converted to the physical memory, i.e where the data actually resides. This lookup is done via page tables. Virtual memory is divided up into chunks and the index is stored. The index can point to another table or to the physical page.

Linux, by default, maintains 3 levels of page tables to further optimize the page lookup. Even on systems which have no hardware support, it still optimizes the 3 level page table as it is necessary to have indexed page tables for faster lookups.

The top page table is known as the Page Global Directory (PGD) which contains an array of unsigned long entries. The entry in the PGD point to the PMD.

The second page table is known as the Page Middle Directory (PMD) which further points to the PTE.

The Page Table Entries (PTE) point to the actual physical pages.

Every process has its own page tables and is pointed to the PGD via the pgd data structure in the memory descriptor.

Even after maintain 3 levels of page tables, the lookup can only be so fast as it is vast searchable area. In order to further improve upon this, most processors implement a Translation Lookaside Buffer (TLB) which acts as a hardware cache between virtual to physical mappings. Therefore, if the cache is hit, it returns directly from the TLB or it further processes the virtual to physical memory mapping.

Most of the data in the article is inspired by Linux Kernel Development book by Robert Love. This is a must read for anybody who wishes to actually understand the underneath workings of the linux kernel.

--

--