Agenda
1. 背景
2. swap设备初始化
3. swap slot分配
4. swap cache, page和swap slot的关系
5. 换出
6. 换入
7. 回收swap slot.
1. 背景
当系统中空闲的物理内存不足的时候,需要回收一些正在使用的内存。回收内存主要集中在用户进程的页缓存和匿名内存。页缓存是用来缓存存储设备上的文件内容,而匿名内存是用于缓存堆,栈等内容。当系统中可用的空闲内存不足时,可以把页缓存的内容释放掉或者写入存储设备,这样就释放出了可用的内存。而匿名内存由于它并不和磁盘上的文件对应,因此需要提供一个swap设备用来将匿名内存的内容腾到swap设备上来释放出可用的内存。swap设备可以是一个文件,也可以是一个磁盘分区。本文试图把swap的管理讲清楚。由于swap是属于内存回收的一部分,因此内存回收系统的介绍可以在之前的博客中找到 认识物理内存回收机制以及 How to swap out the anonymous page?。事实上,除了匿名内存会被swap处理外,还有另外两类内存。a). Dirty pages that belong to a private memory mapping of a process; b). Pages that belong to an IPC shared memory region[1]
当一个页被swap换到swap设备上后,虚拟地址对应的页表项的值需要改写到对应的swap设备上。这个值对于32位系统来说,一共有32位,其中低7位中的高5位(第2~6位)来表示该页的内容被换出到哪一个设备上,第7位~第31位用来表示在swap设备上的offset, 一般情况下若在内存中物理页是4KB,而磁盘上一个磁盘块也是4KB的话,只会使用到高20位,即第11位~31位(4KB对齐嘛)。这样当MMU读取到某个页表项时,发现第0位是0,就会产生一个MMU Fault,然后Kernel处理该Fault,发现页表项其余31位不为0,则说明该缺页的内容在swap上,进行Swap换入操作。
当swap设备中的空间比较紧张的时候(匿名内存换出达到一定数量后),swap设备便需要进行回收,以腾出更多的swap空间用于swap换出。swap回收的主要是那些之前已经在swap中存在,但后来某个进程又将swap的内容换入了,但swap的空间却没有及时释放的情况。
2. Swap设备的初始化
Swap设备可以是一个磁盘分区,也可以是一个文件,他们的启用通过swapon系统调用 。对于文件来说,由于文件才磁盘中对应的磁盘块不一定连续,因此对于文件的swap设备管理起来稍微有点复杂。每个swap设备都对应一个swap_info_struct. 系统默认可以支持32个设备。swap设备的初始化可以简化为与该设备对应的swap_info_struct的初始化。
由于系统可以支持多个swap设备,那么当执行swap换出的时候,到底选择哪一个swap设备呢?每一个swap设备都被设置了一个优先级,数值越大,优先级越高。当swap换出的时候,选择优先级最高的swap设备进行换出。
swap设备在kernel中对应的数据结构是:
178 /*
179 * The in-memory structure used to track swap areas.
180 */
181 struct swap_info_struct {
182 unsigned long flags; /* SWP_USED etc: see above */
183 signed short prio; /* swap priority of this type */
184 signed char type; /* strange name for an index */
185 signed char next; /* next type on the swap list */
186 unsigned int max; /* extent of the swap_map */
187 unsigned char *swap_map; /* vmalloc'ed array of usage counts */
188 unsigned int lowest_bit; /* index of first free in swap_map */
189 unsigned int highest_bit; /* index of last free in swap_map */
190 unsigned int pages; /* total of usable pages of swap */
191 unsigned int inuse_pages; /* number of those currently in use */
192 unsigned int cluster_next; /* likely index for next allocation */
193 unsigned int cluster_nr; /* countdown to next cluster search */
194 unsigned int lowest_alloc; /* while preparing discard cluster */
195 unsigned int highest_alloc; /* while preparing discard cluster */
196 struct swap_extent *curr_swap_extent;
197 struct swap_extent first_swap_extent;
198 struct block_device *bdev; /* swap device or bdev of swap file */
199 struct file *swap_file; /* seldom referenced */
200 unsigned int old_block_size; /* seldom referenced */
201 #ifdef CONFIG_FRONTSWAP
202 unsigned long *frontswap_map; /* frontswap in-use, one bit per page */
203 atomic_t frontswap_pages; /* frontswap pages in-use counter */
204 #endif
205 spinlock_t lock; /*
206 * protect map scan related fields like
207 * swap_map, lowest_bit, highest_bit,
208 * inuse_pages, cluster_next,
209 * cluster_nr, lowest_alloc and
210 * highest_alloc. other fields are only
211 * changed at swapon/swapoff, so are
212 * protected by swap_lock. changing
213 * flags need hold this lock and
214 * swap_lock. If both locks need hold,
215 * hold swap_lock first.
216 */
217 };
该结构的每一个成员字段都有对应的解释。其中重点说一下,该结构中的flags字段说明该swap的一些属性。他们的解释如下
146 enum {
147 SWP_USED = (1 << 0), /* is slot in swap_info[] used? */
148 SWP_WRITEOK = (1 << 1), /* ok to write to this swap? */
149 SWP_DISCARDABLE = (1 << 2), /* swapon+blkdev support discard */
150 SWP_DISCARDING = (1 << 3), /* now discarding a free cluster */
151 SWP_SOLIDSTATE = (1 << 4), /* blkdev seeks are cheap */
152 SWP_CONTINUED = (1 << 5), /* swap_map has count continuation */
153 SWP_BLKDEV = (1 << 6), /* its a block device */
154 SWP_FILE = (1 << 7), /* set after swap_activate success */
155 /* add others here before... */
156 SWP_SCANNING = (1 << 8), /* refcount in scan_swap_map */
157 };
SWP_USED 说明该swap设备可用。
SWP_WRITEOK 说明该swap设备可以写入,每一个swap设备在enable后都是可以写入的。
SWP_DISCARDABLE说明该设备是否支持discard(关于discard,请查阅man swapon中的—discard参数以及参考[2]),一般情况下只用于SSD,并且如果SSD支持TRIM或者discard的话性能会有所提高。
SWP_SOLIDSTATE 对于SSD设备的上的swap file,需要设置该位。这在查找free swap page的时候会用到。
SWP_CONTINUED 的用法,请看swap回收中的对应描述。
SWP_BLKDEV, SWP_FILE 是说明该swap设备是一个分区,还是一个文件。
SWP_SCANNING,当系统在某个swap中寻找空闲的页槽时,便会置上该位,查找结束后清掉。
另外,swap_map 是一个数组,每一个swap slot对应一个字节的map,用于计算该slot的使用计数。
3. Swap槽的分配
当有匿名页需要换出的时候,便需要现在swap设备上分配一个空闲的槽。分配槽大致分为两步,第一步是定位swap设备,第二步从swap设备上寻找一个空闲的槽。
关于定位swap设备,系统定义了一个swap_list数据结构来辅助定位。swap_list有两个成员,一个是head,另一个是next。无论何时,在查找swap设备的时候都是从swap_list.next开始的。
416 swp_entry_t get_swap_page(void)
417 {
418 struct swap_info_struct *si;
419 pgoff_t offset;
420 int type, next;
421 int wrapped = 0;
422 int hp_index;
423
424 spin_lock(&swap_lock);
425 if (atomic_long_read(&nr_swap_pages) <= 0)
426 goto noswap;
427 atomic_long_dec(&nr_swap_pages);//Available Pages in swap device.
428
429 for (type = swap_list.next; type >= 0 && wrapped < 2; type = next) {
430 hp_index = atomic_xchg(&highest_priority_index, -1);//Store -1, and return previous value.
431 /*
432 * highest_priority_index records current highest priority swap
433 * type which just frees swap entries. If its priority is
434 * higher than that of swap_list.next swap type, we use it. It
435 * isn't protected by swap_lock, so it can be an invalid value
436 * if the corresponding swap type is swapoff. We double check
437 * the flags here. It's even possible the swap type is swapoff
438 * and swapon again and its priority is changed. In such rare
439 * case, low prority swap type might be used, but eventually
440 * high priority swap will be used after several rounds of
441 * swap.
442 */
443 if (hp_index != -1 && hp_index != type &&
444 swap_info[type]->prio < swap_info[hp_index]->prio &&
445 (swap_info[hp_index]->flags & SWP_WRITEOK)) {
446 type = hp_index;
447 swap_list.next = type;
448 }
449
450 si = swap_info[type];
451 next = si->next;
452 if (next < 0 ||
453 (!wrapped && si->prio != swap_info[next]->prio)) {
454 next = swap_list.head;
455 wrapped++;
456 }
457
458 spin_lock(&si->lock);
459 if (!si->highest_bit) {
460 spin_unlock(&si->lock);
461 continue;
462 }
463 if (!(si->flags & SWP_WRITEOK)) {
464 spin_unlock(&si->lock);
465 continue;
466 }
467
468 swap_list.next = next;
469
470 spin_unlock(&swap_lock);
471 /* This is called for allocating swap entry for cache */
472 offset = scan_swap_map(si, SWAP_HAS_CACHE);
473 spin_unlock(&si->lock);
474 if (offset)
475 return swp_entry(type, offset);
476 spin_lock(&swap_lock);
477 next = swap_list.next;
478 }
479
480 atomic_long_inc(&nr_swap_pages);
481 noswap:
482 spin_unlock(&swap_lock);
483 return (swp_entry_t) {0};
484 }
首先是要定位到合适的swap file或者swap 设备。所有的可用swap 都枚举在数组swap_info[]中。但第一个优先选用的swap file的index会记录在swap_list的next成员中。因此,第429行的for循环是从swap_list.next开始的。系统中还有一个全局的变量是highest_priority_index,记录了系统中优先级最好的swap file. 那么在swap_list.next的file和highest_priority_index该选择哪一个呢?就看谁的优先级最高就选哪一个(见第443行~445行的比较)。
接着,对选出来的swap file进行进一步的验证,有空间可以分配(459~462行),该swap是可以写入的(463~466行),同时设置swap_list.next指针。以使swap的分布在各个swap上尽量的均匀。
当swap file选出来以后,就需要通过scan_swap_map在该swap file上分配一个空闲的slot了。scan_swap_map中的参数usage,设置为SWAP_HAS_CACHE说明该swap对应的物理页也在内存中存在于address_space中。如果该swap页槽没有被address_space引用,该标志会被清掉。
get_swap_page-> scan_swap_map:
187 static unsigned long scan_swap_map(struct swap_info_struct *si,
188 unsigned char usage)
189 {
190 unsigned long offset;
191 unsigned long scan_base;
192 unsigned long last_in_cluster = 0;
193 int latency_ration = LATENCY_LIMIT;
194 int found_free_cluster = 0;
195
196 /*
197 * We try to cluster swap pages by allocating them sequentially
198 * in swap. Once we've allocated SWAPFILE_CLUSTER pages this
199 * way, however, we resort to first-free allocation, starting
200 * a new cluster. This prevents us from scattering swap pages
201 * all over the entire swap partition, so that we reduce
202 * overall disk seek times between swap pages. -- sct
203 * But we do now try to find an empty cluster. -Andrea
204 * And we let swap pages go all over an SSD partition. Hugh
205 */
206
207 si->flags += SWP_SCANNING;
208 scan_base = offset = si->cluster_next;
209
210 if (unlikely(!si->cluster_nr--)) {
211 if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) {
212 si->cluster_nr = SWAPFILE_CLUSTER - 1;
213 goto checks;
214 }
215 if (si->flags & SWP_DISCARDABLE) {
216 /*
217 * Start range check on racing allocations, in case
218 * they overlap the cluster we eventually decide on
219 * (we scan without swap_lock to allow preemption).
220 * It's hardly conceivable that cluster_nr could be
221 * wrapped during our scan, but don't depend on it.
222 */
223 if (si->lowest_alloc)
224 goto checks;
225 si->lowest_alloc = si->max;
226 si->highest_alloc = 0;
227 }
228 spin_unlock(&si->lock);
229
230 /*
231 * If seek is expensive, start searching for new cluster from
232 * start of partition, to minimize the span of allocated swap.
233 * But if seek is cheap, search from our current position, so
234 * that swap is allocated from all over the partition: if the
235 * Flash Translation Layer only remaps within limited zones,
236 * we don't want to wear out the first zone too quickly.
237 */
238 if (!(si->flags & SWP_SOLIDSTATE))
239 scan_base = offset = si->lowest_bit;
240 last_in_cluster = offset + SWAPFILE_CLUSTER - 1;
241
242 /* Locate the first empty (unaligned) cluster */
243 for (; last_in_cluster <= si->highest_bit; offset++) {
244 if (si->swap_map[offset])
245 last_in_cluster = offset + SWAPFILE_CLUSTER;
246 else if (offset == last_in_cluster) {
247 spin_lock(&si->lock);
248 offset -= SWAPFILE_CLUSTER - 1;
249 si->cluster_next = offset;
250 si->cluster_nr = SWAPFILE_CLUSTER - 1;
251 found_free_cluster = 1;
252 goto checks;
253 }
254 if (unlikely(--latency_ration < 0)) {
255 cond_resched();
256 latency_ration = LATENCY_LIMIT;
257 }
258 }
259
260 offset = si->lowest_bit;
261 last_in_cluster = offset + SWAPFILE_CLUSTER - 1;
262
263 /* Locate the first empty (unaligned) cluster */
264 for (; last_in_cluster < scan_base; offset++) {
265 if (si->swap_map[offset])
266 last_in_cluster = offset + SWAPFILE_CLUSTER;
267 else if (offset == last_in_cluster) {
268 spin_lock(&si->lock);
269 offset -= SWAPFILE_CLUSTER - 1;
270 si->cluster_next = offset;
271 si->cluster_nr = SWAPFILE_CLUSTER - 1;
272 found_free_cluster = 1;
273 goto checks;
274 }
275 if (unlikely(--latency_ration < 0)) {
276 cond_resched();
277 latency_ration = LATENCY_LIMIT;
278 }
279 }
280
281 offset = scan_base;
282 spin_lock(&si->lock);
283 si->cluster_nr = SWAPFILE_CLUSTER - 1;
284 si->lowest_alloc = 0;
285 }
286
287 checks:
288 if (!(si->flags & SWP_WRITEOK))
289 goto no_page;
290 if (!si->highest_bit)
291 goto no_page;
292 if (offset > si->highest_bit)
293 scan_base = offset = si->lowest_bit;
294
295 /* reuse swap entry of cache-only swap if not busy. */
296 if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
297 int swap_was_freed;
298 spin_unlock(&si->lock);
299 swap_was_freed = __try_to_reclaim_swap(si, offset);
300 spin_lock(&si->lock);
301 /* entry was freed successfully, try to use this again */
302 if (swap_was_freed)
303 goto checks;
304 goto scan; /* check next one */
305 }
306
307 if (si->swap_map[offset])
308 goto scan;
309
310 if (offset == si->lowest_bit)
311 si->lowest_bit++;
312 if (offset == si->highest_bit)
313 si->highest_bit--;
314 si->inuse_pages++;
315 if (si->inuse_pages == si->pages) {
316 si->lowest_bit = si->max;
317 si->highest_bit = 0;
318 }
319 si->swap_map[offset] = usage;
320 si->cluster_next = offset + 1;
321 si->flags -= SWP_SCANNING;
322
323 if (si->lowest_alloc) {
324 /*
325 * Only set when SWP_DISCARDABLE, and there's a scan
326 * for a free cluster in progress or just completed.
327 */
328 if (found_free_cluster) {
329 /*
330 * To optimize wear-levelling, discard the
331 * old data of the cluster, taking care not to
332 * discard any of its pages that have already
333 * been allocated by racing tasks (offset has
334 * already stepped over any at the beginning).
335 */
336 if (offset < si->highest_alloc &&
337 si->lowest_alloc <= last_in_cluster)
338 last_in_cluster = si->lowest_alloc - 1;
339 si->flags |= SWP_DISCARDING;
340 spin_unlock(&si->lock);
341
342 if (offset < last_in_cluster)
343 discard_swap_cluster(si, offset,
344 last_in_cluster - offset + 1);
345
346 spin_lock(&si->lock);
347 si->lowest_alloc = 0;
348 si->flags &= ~SWP_DISCARDING;
349
350 smp_mb(); /* wake_up_bit advises this */
351 wake_up_bit(&si->flags, ilog2(SWP_DISCARDING));
352
353 } else if (si->flags & SWP_DISCARDING) {
354 /*
355 * Delay using pages allocated by racing tasks
356 * until the whole discard has been issued. We
357 * could defer that delay until swap_writepage,
358 * but it's easier to keep this self-contained.
359 */
360 spin_unlock(&si->lock);
361 wait_on_bit(&si->flags, ilog2(SWP_DISCARDING),
362 wait_for_discard, TASK_UNINTERRUPTIBLE);
363 spin_lock(&si->lock);
364 } else {
365 /*
366 * Note pages allocated by racing tasks while
367 * scan for a free cluster is in progress, so
368 * that its final discard can exclude them.
369 */
370 if (offset < si->lowest_alloc)
371 si->lowest_alloc = offset;
372 if (offset > si->highest_alloc)
373 si->highest_alloc = offset;
374 }
375 }
376 return offset;
377
378 scan:
379 spin_unlock(&si->lock);
380 while (++offset <= si->highest_bit) {
381 if (!si->swap_map[offset]) {
382 spin_lock(&si->lock);
383 goto checks;
384 }
385 if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
386 spin_lock(&si->lock);
387 goto checks;
388 }
389 if (unlikely(--latency_ration < 0)) {
390 cond_resched();
391 latency_ration = LATENCY_LIMIT;
392 }
393 }
394 offset = si->lowest_bit;
395 while (++offset < scan_base) {
396 if (!si->swap_map[offset]) {
397 spin_lock(&si->lock);
398 goto checks;
399 }
400 if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
401 spin_lock(&si->lock);
402 goto checks;
403 }
404 if (unlikely(--latency_ration < 0)) {
405 cond_resched();
406 latency_ration = LATENCY_LIMIT;
407 }
408 }
409 spin_lock(&si->lock);
410
411 no_page:
412 si->flags -= SWP_SCANNING;
413 return 0;
414 }
空槽的查找是在某一个cluster中的。每一个cluster有255个空槽。当槽中的slot都分配完后,就需要再查找一个空的cluster了。210~285行,便是寻找新的cluster的过程。243~258行如果到了highest_bit找不下空的cluster.就从lowest_bit开始再找一遍(263~279行)。如果还找不下,那就scan处,从头到尾扫描(380~393)然后遇到SWAP_HAS_CACHE的槽,就进行回收(385行)。
243~258行,试图寻找到一个有255个空槽的cluster,这样以后的分配都在该cluster中了。323行~376行,还专门针对支持TRIM特性的SSD做了一个优化。当找到一个空的cluster后,就对该cluster进行一个Discard.
总之,当某个swap_map 的entry为空,就说明对应的slot可供分配。
4. Swap Cache,Swap槽和Page的关系
swap cache 并不是真的cache,而是一种log方式。为什么要有swap cache呢? 看一种情况:当在物理内存中的一个物理页被多个进程引用的时候,如果要被换出,那么该物理页在多个进程中的页表项就都需要改写为swap slot的地址。当有某个进程要使用被换出的物理页时,就会在物理内存中分配一页,然后将swap 中的内容copy过去。这时候,其他指向swap slot的进程的页表项里放的仍然是swap slot的地址。如果其他进程需要使用swap slot的内容了,怎么办? 当然是共用之前进程已经swap in的物理内存,然后swap slot对应的计数减1就可以了。其他进程怎么知道swap slot的内容已经在物理内存中有copy呢?就是依靠swap cache.
更详细的解释,请参考 Understanding the Linux Kernel 3rd, 17.4.6 Chapter The Swap Cache 一节.
5. 换出(Swap Out)
换出的操作,即,在内存回收中,需要对某些暂时不会被用到的匿名页进行释放以得到更多的空闲物理内存,匿名页的内容暂存在swap device/file上。所以换出操作包含两步,首先需要找到暂时不会被用到的匿名页,其次,对这些暂时不用的匿名页进行换出。
查找匿名页采用的是近似的LRU算法。具体可参考之前对内存回收的分析博客。
对匿名页进行换出,需要现在swap device/file上查找一块空闲的space,然后将该页置位dirty状态,挂载到读写块队列中,利用磁盘驱动进行写入。在swap disk/file上查找空闲space在上一节的swap 分配中已经分析过了。而页内容写到磁盘属于块设备读写内容,不做过多分析。这里需要注意的一点是,假若一个页被多个进程引用,那么在写出时,怎么能保证被多个进程引用的物理页,只被写出一次,并写到同一个swap entry上呢。另外,做在换入操作时,当某个A进程已经将在swap页读入到内存中了,而其他共享该页的进程B的页表项仍然指向swap entry,当B进程需要访问在swap entry的页时,怎么能保证不再重复读A已经读进来的swap页呢?这两个问题的解决依靠address_space. address_space是一个基数树。每一个swap device/file都有一个address_space。这个基数树中,key是swap_entry,而数据是指向物理页的指针。关于swap_entry,物理页以及address_space的关系,请看swap_entry, physical page和 address_space一节。
当找到一个free的swap slot后,就要进行换出
77 /*
78 * __add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
79 * but sets SwapCache flag and private instead of mapping and index.
80 */
81 int __add_to_swap_cache(struct page *page, swp_entry_t entry)
82 {
83 int error;
84 struct address_space *address_space;
85
86 VM_BUG_ON(!PageLocked(page));
87 VM_BUG_ON(PageSwapCache(page));
88 VM_BUG_ON(!PageSwapBacked(page));
89
90 page_cache_get(page);
91 SetPageSwapCache(page);
92 set_page_private(page, entry.val);
93
94 address_space = swap_address_space(entry);
95 spin_lock_irq(&address_space->tree_lock);
96 error = radix_tree_insert(&address_space->page_tree,
97 entry.val, page);
98 if (likely(!error)) {
99 address_space->nrpages++;
100 __inc_zone_page_state(page, NR_FILE_PAGES);
101 INC_CACHE_INFO(add_total);
102 }
103 spin_unlock_irq(&address_space->tree_lock);
104
105 if (unlikely(error)) {
106 /*
107 * Only the context which have set SWAP_HAS_CACHE flag
108 * would call add_to_swap_cache().
109 * So add_to_swap_cache() doesn't returns -EEXIST.
110 */
111 VM_BUG_ON(error == -EEXIST);
112 set_page_private(page, 0UL);
113 ClearPageSwapCache(page);
114 page_cache_release(page);
115 }
116
117 return error;
118 }
对于换出的page,首先需要对该page的使用计数加1(90行),然后设置PG_swapcache标志,最后设置page->private的value为swap_entry的值(在swap disk/file中的offset). 最终会插入到address_space的基数树中。
6. 换入(Swap In)
当一个进程中的某个物理页被换出到swap area后,如果再次访问便会发生缺页中断。缺页中断的处理函数如果发现页表项中得值非零,但invalid。就认为需要进行swap in了。swap in的对应处理函数是:do_swap_page。
handle_pte_fault->do_swap_page:
2994 /*
2995 * We enter with non-exclusive mmap_sem (to exclude vma changes,
2996 * but allow concurrent faults), and pte mapped but not yet locked.
2997 * We return with mmap_sem still held, but pte unmapped and unlocked.
2998 */
2999 static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
3000 unsigned long address, pte_t *page_table, pmd_t *pmd,
3001 unsigned int flags, pte_t orig_pte)
3002 {
3003 spinlock_t *ptl;
3004 struct page *page, *swapcache;
3005 swp_entry_t entry;
3006 pte_t pte;
3007 int locked;
3008 struct mem_cgroup *ptr;
3009 int exclusive = 0;
3010 int ret = 0;
3011
3012 if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
3013 goto out;
3014
3015 entry = pte_to_swp_entry(orig_pte);
3016 if (unlikely(non_swap_entry(entry))) {
3017 if (is_migration_entry(entry)) {
3018 migration_entry_wait(mm, pmd, address);
3019 } else if (is_hwpoison_entry(entry)) {
3020 ret = VM_FAULT_HWPOISON;
3021 } else {
3022 print_bad_pte(vma, address, orig_pte, NULL);
3023 ret = VM_FAULT_SIGBUS;
3024 }
3025 goto out;
3026 }
3027 delayacct_set_flag(DELAYACCT_PF_SWAPIN);
3028 page = lookup_swap_cache(entry);
3029 if (!page) {
3030 page = swapin_readahead(entry,
3031 GFP_HIGHUSER_MOVABLE, vma, address);
…...
3117 pte = mk_pte(page, vma->vm_page_prot);
3118 if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
3119 pte = maybe_mkwrite(pte_mkdirty(pte), vma);
3120 flags &= ~FAULT_FLAG_WRITE;
3121 ret |= VM_FAULT_WRITE;
3122 exclusive = 1;
3123 }
3124 flush_icache_page(vma, page);
3125 if (pte_swp_soft_dirty(orig_pte))
3126 pte = pte_mksoft_dirty(pte);
3127 set_pte_at(mm, address, page_table, pte);
3128 if (page == swapcache)
3129 do_page_add_anon_rmap(page, vma, address, exclusive);
3130 else /* ksm created a completely new copy */
3131 page_add_new_anon_rmap(page, vma, address);
3132 /* It's better to call commit-charge after rmap is established */
3133 mem_cgroup_commit_charge_swapin(page, ptr);
3134
3135 swap_free(entry);
3136 if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
3137 try_to_free_swap(page);
3138 unlock_page(page);
3139 if (page != swapcache) {
3140 /*
3141 * Hold the lock to avoid the swap entry to be reused
3142 * until we take the PT lock for the pte_same() check
3143 * (to avoid false positives from pte_same). For
3144 * further safety release the lock after the swap_free
3145 * so that the swap count won't change under a
3146 * parallel locked swapcache.
3147 */
3148 unlock_page(swapcache);
3149 page_cache_release(swapcache);
3150 }
3151
3152 if (flags & FAULT_FLAG_WRITE) {
3153 ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
3154 if (ret & VM_FAULT_ERROR)
3155 ret &= VM_FAULT_ERROR;
3156 goto out;
3157 }
3158
3159 /* No need to invalidate - it was non-present before */
3160 update_mmu_cache(vma, address, page_table);
3161 unlock:
3162 pte_unmap_unlock(page_table, ptl);
3163 out:
3164 return ret;
3165 out_nomap:
3166 mem_cgroup_cancel_charge_swapin(ptr);
3167 pte_unmap_unlock(page_table, ptl);
3168 out_page:
3169 unlock_page(page);
3170 out_release:
3171 page_cache_release(page);
3172 if (page != swapcache) {
3173 unlock_page(swapcache);
3174 page_cache_release(swapcache);
3175 }
3176 return ret;
3177 }
3012行是为了避免重复处理,若该Pte已经被其他进程或者其他core处理了,就不用再处理了。
3015行,根据pte获得其swap slot地址。
3016行~3026行是做错误或者其他检查。
3027行是设置该线程对应的struct tast_struct的task_delay_info.flags,我正在做swap in 呢. 3054 4062行是其对照,取消掉标志。说明已经做完了。
由于某个物理页可能被多个进程引用,因此,当该物理页被其他进程已经copy到物理内存中,即能够在该swap area的swap cache中查找到该页(3028行),就不需要再次读入了。若没有在swap cache中,就需要驱动存储控制器,读入swap area中对应的内容到物理内存中(3030行)。当这些事情都做完后,就更新页表(3117和3127行)。其次,要对swap area中对应slot的引用计数进行维护(3135行)。
7. 回收Swap
当swap disk/file 比较满的时候,就需要回收一些空间了。swap区满的判断标准是:
/* Swap 50% full? Release swapcache more aggressively.. */
408 static inline bool vm_swap_full(void)
409 {
410 return atomic_long_read(&nr_swap_pages) * 2 < total_swap_pages;
411 }
可以看出,只要达到了50%的swap空间,就认为swap 区需要回收一些空间了,那么问题来了,哪些被分配出去的swap可以被回收? 能满足以下条件的,可以被回收:
1. 该swap entry在swapcache中存在(PG_swapcache被置位)。即,物理内存中同样有一份该swap entry中内容的copy。
2. 页没有被回写(PG_writeback没有被置位),说明该swap entry在物理内存中存在的copy是后来被换入的,而不是正在被换出的。
3. 该swap entry的引用计数是0,即该swap entry已经不被任何进程所引用。这一点,容易和第一点相矛盾。 注意,在释放该swap space之前,已经把该swap entry从address space中删掉了,但PG_swapcache仍然置位。
当一个swap entry从address_space(也就是swapcache)中被删除后,
862 /*
863 * Called after dropping swapcache to decrease refcnt to swap entries.
864 */
865 void swapcache_free(swp_entry_t entry, struct page *page)
866 {
867 struct swap_info_struct *p;
868 unsigned char count;
869
870 p = swap_info_get(entry);
871 if (p) {
872 count = swap_entry_free(p, entry, SWAP_HAS_CACHE);
873 if (page)
874 mem_cgroup_uncharge_swapcache(page, entry, count != 0);
875 spin_unlock(&p->lock);
876 }
877 }
swapcache_free->swap_entry_free:
前面已经把swap_entry从swapcache的基数树中删除了,因此下面需要把该swap_entry在swapmap[]中对应的计数减一。由于之前swap_entry存在在swapcache中,因此swapmap[]中对应的计数count的SWAP_HAS_CACHE被置位,所以,swapcache_free中调用swap_entry_free也需要传入SWAP_HAS_CACHE参数。下文会看到,若传入SWAP_HAS_CACHE参数,那么swap_entry在swap_map中对应的SWAP_HAS_CACHE一定被置位(800~803行)。
789 static unsigned char swap_entry_free(struct swap_info_struct *p,
790 swp_entry_t entry, unsigned char usage)
791 {
792 unsigned long offset = swp_offset(entry);
793 unsigned char count;
794 unsigned char has_cache;
795
796 count = p->swap_map[offset];
797 has_cache = count & SWAP_HAS_CACHE;
798 count &= ~SWAP_HAS_CACHE;
799
800 if (usage == SWAP_HAS_CACHE) {
801 VM_BUG_ON(!has_cache);
802 has_cache = 0;
803 } else if (count == SWAP_MAP_SHMEM) {
804 /*
805 * Or we could insist on shmem.c using a special
806 * swap_shmem_free() and free_shmem_swap_and_cache()...
807 */
808 count = 0;
809 } else if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) {
810 if (count == COUNT_CONTINUED) {
811 if (swap_count_continued(p, offset, count))
812 count = SWAP_MAP_MAX | COUNT_CONTINUED;
813 else
814 count = SWAP_MAP_MAX;
815 } else
816 count--;
817 }
818
819 if (!count)
820 mem_cgroup_uncharge_swap(entry);
821
822 usage = count | has_cache;
823 p->swap_map[offset] = usage;
824
825 /* free if no reference */
826 if (!usage) {
827 dec_cluster_info_page(p, p->cluster_info, offset);
828 if (offset < p->lowest_bit)
829 p->lowest_bit = offset;
830 if (offset > p->highest_bit)
831 p->highest_bit = offset;
832 set_highest_priority_index(p->type);
833 atomic_long_inc(&nr_swap_pages);
834 p->inuse_pages--;
835 frontswap_invalidate_page(p->type, offset);
836 if (p->flags & SWP_BLKDEV) {
837 struct gendisk *disk = p->bdev->bd_disk;
838 if (disk->fops->swap_slot_free_notify)
839 disk->fops->swap_slot_free_notify(p->bdev,
840 offset);
841 }
842 }
843
844 return usage;
845 }
从826行,可以看出,若在swap device/file中,某个swap slot的引用计数为0,就可以被free了。那么一个slot的引用计数如前所述,在每一个swap设备或者swap file中都会有一个swap_map. swap_map是一个unsigned char* 的数组。swap设备中的每个页顺序对应该数组的一项,该数组用来记录对应页的使用计数。比如一个swap entry有n个PTE映射,那么就需要计数为N。系统规定了计数的最大值是0x7f,即十进制数127.那么计数超过127怎么办呢,系统就分配一页用作swap_map的延续。然后在延续的swap_map上对应的offset处再重新计数,并把上一个offset的值设置为0x80。对应的patch(570a335b8):
+/*
+ * add_swap_count_continuation - called when a swap count is duplicated
+ * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's
+ * page of the original vmalloc'ed swap_map, to hold the continuation count
+ * (for that entry and for its neighbouring PAGE_SIZE swap entries). Called
+ * again when count is duplicated beyond SWAP_MAP_MAX * SWAP_CONT_MAX, etc.
+ *
+ * These continuation pages are seldom referenced: the common paths all work
+ * on the original swap_map, only referring to a continuation page when the
+ * low "digit" of a count is incremented or decremented through SWAP_MAP_MAX.
+ *
+ * add_swap_count_continuation(, GFP_ATOMIC) can be called while holding
+ * page table locks; if it fails, add_swap_count_continuation(, GFP_KERNEL)
+ * can be called after dropping locks.
+ */
+int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
+{
+ struct swap_info_struct *si;
+ struct page *head;
+ struct page *page;
+ struct page *list_page;
+ pgoff_t offset;
+ unsigned char count;
+
+ /*
+ * When debugging, it's easier to use __GFP_ZERO here; but it's better
+ * for latency not to zero a page while GFP_ATOMIC and holding locks.
+ */
+ page = alloc_page(gfp_mask | __GFP_HIGHMEM);
+
+ si = swap_info_get(entry);
+ if (!si) {
+ /*
+ * An acceptable race has occurred since the failing
+ * __swap_duplicate(): the swap entry has been freed,
+ * perhaps even the whole swap_map cleared for swapoff.
+ */
+ goto outer;
+ }
+
+ offset = swp_offset(entry);
+ count = si->swap_map[offset] & ~SWAP_HAS_CACHE;
+
+ if ((count & ~COUNT_CONTINUED) != SWAP_MAP_MAX) {
+ /*
+ * The higher the swap count, the more likely it is that tasks
+ * will race to add swap count continuation: we need to avoid
+ * over-provisioning.
+ */
+ goto out;
+ }
+
+ if (!page) {
+ spin_unlock(&swap_lock);
+ return -ENOMEM;
+ }
+
+ /*
+ * We are fortunate that although vmalloc_to_page uses pte_offset_map,
+ * no architecture is using highmem pages for kernel pagetables: so it
+ * will not corrupt the GFP_ATOMIC caller's atomic pagetable kmaps.
+ */
+ head = vmalloc_to_page(si->swap_map + offset);
+ offset &= ~PAGE_MASK;
+
+ /*
+ * Page allocation does not initialize the page's lru field,
+ * but it does always reset its private field.
+ */
+ if (!page_private(head)) {
+ BUG_ON(count & COUNT_CONTINUED);
+ INIT_LIST_HEAD(&head->lru);
+ set_page_private(head, SWP_CONTINUED);
+ si->flags |= SWP_CONTINUED;
+ }
+
+ list_for_each_entry(list_page, &head->lru, lru) {
+ unsigned char *map;
+
+ /*
+ * If the previous map said no continuation, but we've found
+ * a continuation page, free our allocation and use this one.
+ */
+ if (!(count & COUNT_CONTINUED))
+ goto out;
+
+ map = kmap_atomic(list_page, KM_USER0) + offset;
+ count = *map;
+ kunmap_atomic(map, KM_USER0);
+
+ /*
+ * If this continuation count now has some space in it,
+ * free our allocation and use this one.
+ */
+ if ((count & ~COUNT_CONTINUED) != SWAP_CONT_MAX)
+ goto out;
+ }
+
+ list_add_tail(&page->lru, &head->lru);
+ page = NULL; /* now it's attached, don't free it */
+out:
+ spin_unlock(&swap_lock);
+outer:
+ if (page)
+ __free_page(page);
+ return 0;
+}
参考:
1. Understanding the Linux Kernel 3rd.
2. 【无聊科普】固态硬盘(SSD)为什么需要TRIM?http://www.guokr.com/blog/483475/
目录
1. 简述
2. 初始化
3. 分配
4. 释放
5. 小结
1. 简述
CMA的全称是contiguous memory allocator, 其工作原理是:预留一段的内存给驱动使用,但当驱动不用的时候,memory allocator(buddy system)可以分配给用户进程用作匿名内存或者页缓存。而当驱动需要使用时,就将进程占用的内存通过回收或者迁移的方式将之前占用的预留内存腾出来,供驱动使用。本文对CMA的初始化,分配和释放做一下源码分析(源码版本v3.10).
2. 初始化
CMA的初始化必须在buddy 物理内存管理初始化之前和 memory
block early allocator 分配器初始化之后(可参考dma_contiguous_reserve函数的注释:This function reserves memory from early allocator. It should be called by arch specific code once the early allocator (memblock or bootmem) has
been activated and all other subsystems have already allocated/reserved
memory.)。
在ARM中,初始化CMA的接口是:dma_contiguous_reserve(phys_addr_t limit)。参数limit是指该CMA区域的上限。
setup_arch->arm_memblock_init->dma_contiguous_reserve:
107 void __init dma_contiguous_reserve(phys_addr_t limit)
108 {
109 phys_addr_t selected_size = 0;
110
111 pr_debug("%s(limit %08lx)\n", __func__, (unsigned long)limit);
112
113 if (size_cmdline != -1) {
114 selected_size = size_cmdline;
115 } else {
116 #ifdef CONFIG_CMA_SIZE_SEL_MBYTES
117 selected_size = size_bytes;
118 #elif defined(CONFIG_CMA_SIZE_SEL_PERCENTAGE)
119 selected_size = cma_early_percent_memory();
120 #elif defined(CONFIG_CMA_SIZE_SEL_MIN)
121 selected_size = min(size_bytes, cma_early_percent_memory());
122 #elif defined(CONFIG_CMA_SIZE_SEL_MAX)
123 selected_size = max(size_bytes, cma_early_percent_memory());
124 #endif
125 }
126
127 if (selected_size) {
128 pr_debug("%s: reserving %ld MiB for global area\n", __func__,
129 (unsigned long)selected_size / SZ_1M);
130
131 dma_declare_contiguous(NULL, selected_size, 0, limit);
132 }
133 };
在该函数中,需要弄清楚俩值,分别是selected_size和limit。selectetd_size是声明CMA区域的大小,limit规定了CMA区域在分配时候的上界。
首先介绍下怎样获得selected_size: 若cmdline中定义了cma=”xxx”,那么就用cmdline中规定的(114行)。若cmdline中没有定义,则看有没有在config文件中定义CONFIG_CMA_SIZE_SEL_MBYTES(117行)或者CONFIG_CMA_SIZE_SEL_PERCENTAGE(119行)。如果前面两个配置项都没有定义,则从CONFIG_CMA_SIZE_MBYTES和CONFIG_CMA_SIZE_PERCENTAGE中选择两者的最小值(121行)或者最大值(123行)。
计算好CMA的size并得到了limit后,就进入dma_declare_contiguous中。
setup_arch->arm_memblock_init->dma_contiguous_reserve->
dma_declare_contiguous:
218 /**
219 * dma_declare_contiguous() - reserve area for contiguous memory handling
220 * for particular device
221 * @dev: Pointer to device structure.
222 * @size: Size of the reserved memory.
223 * @base: Start address of the reserved memory (optional, 0 for any).
224 * @limit: End address of the reserved memory (optional, 0 for any).
225 *
226 * This function reserves memory for specified device. It should be
227 * called by board specific code when early allocator (memblock or bootmem)
228 * is still activate.
229 */
230 int __init dma_declare_contiguous(struct device *dev, phys_addr_t size,
231 phys_addr_t base, phys_addr_t limit)
232 {
233 struct cma_reserved *r = &cma_reserved[cma_reserved_count];
234 phys_addr_t alignment;
235
…
248
249 /* Sanitise input arguments */
250 alignment = PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order);
251 base = ALIGN(base, alignment);
252 size = ALIGN(size, alignment);
253 limit &= ~(alignment - 1);
254
255 /* Reserve memory */
256 if (base) {
257 if (memblock_is_region_reserved(base, size) ||
258 memblock_reserve(base, size) < 0) {
259 base = -EBUSY;
260 goto err;
261 }
262 } else {
263 /*
264 * Use __memblock_alloc_base() since
265 * memblock_alloc_base() panic()s.
266 */
267 phys_addr_t addr = __memblock_alloc_base(size, alignment, limit);
268 if (!addr) {
269 base = -ENOMEM;
270 goto err;
271 } else {
272 base = addr;
273 }
274 }
275
276 /*
277 * Each reserved area must be initialised later, when more kernel
278 * subsystems (like slab allocator) are available.
279 */
280 r->start = base;
281 r->size = size;
282 r->dev = dev;
283 cma_reserved_count++;
284 pr_info("CMA: reserved %ld MiB at %08lx\n", (unsigned long)size / SZ_1M,
285 (unsigned long)base);
286
287 /* Architecture specific contiguous memory fixup. */
288 dma_contiguous_early_fixup(base, size);
289 return 0;
290 err:
291 pr_err("CMA: failed to reserve %ld MiB\n", (unsigned long)size / SZ_1M);
292 return base;
293 }
在该函数中,首先根据输入的参数size和limit,得到CMA区域的基址和大小。基址若没有指定的话(在该初始化的情境中是0),就需要用early allocator分配了。而大小需要进行一个alignment,这个alignment一般是4MB(250行,MAX_ORDER是11, pageblock_order是10)。用early allocator分配的这个CMA会从物理内存lowmem的高地址开始分配。
得到CMA区域的基址和大小后,会存入到cma_reserved[]全局数组中(280~282行)。全局变量cma_reserved_count来标识在cma_reserved[]数组中,保留了多少个cma区(283行)
在ARM的kernel code中,得到的CMA区域还会保存到dma_mmu_remap数组中(这个dma_mmu_remap数据结构只记录基址和大小,下面396~410行)。
396 struct dma_contig_early_reserve {
397 phys_addr_t base;
398 unsigned long size;
399 };
400
401 static struct dma_contig_early_reserve dma_mmu_remap[MAX_CMA_AREAS] __initdata;
402
403 static int dma_mmu_remap_num __initdata;
404
405 void __init dma_contiguous_early_fixup(phys_addr_t base, unsigned long size)
406 {
407 dma_mmu_remap[dma_mmu_remap_num].base = base;
408 dma_mmu_remap[dma_mmu_remap_num].size = size;
409 dma_mmu_remap_num++;
410 }
以上,只是将CMA区域reserve下来并记录到相关的数组中。当buddy系统初始化结束后,会对reserved的CMA区域进行进一步的处理:
200 static int __init cma_init_reserved_areas(void)
201 {
202 struct cma_reserved *r = cma_reserved;
203 unsigned i = cma_reserved_count;
204
205 pr_debug("%s()\n", __func__);
206
207 for (; i; --i, ++r) {
208 struct cma *cma;
209 cma = cma_create_area(PFN_DOWN(r->start),
210 r->size >> PAGE_SHIFT);
211 if (!IS_ERR(cma))
212 dev_set_cma_area(r->dev, cma);
213 }
214 return 0;
215 }
216 core_initcall(cma_init_reserved_areas);
212行,创建的struct cma结构会设置在dev->cma_area中。这样,当某个外设进行CMA分配的时候,便根据dev->cma_area中设定的区间进行CMA
buffer的分配。
cma_init_reserved_areas->
cma_create_area:
159 static __init struct cma *cma_create_area(unsigned long base_pfn,
160 unsigned long count)
161 {
162 int bitmap_size = BITS_TO_LONGS(count) * sizeof(long);
163 struct cma *cma;
164 int ret = -ENOMEM;
165
166 pr_debug("%s(base %08lx, count %lx)\n", __func__, base_pfn, count);
167
168 cma = kmalloc(sizeof *cma, GFP_KERNEL);
169 if (!cma)
170 return ERR_PTR(-ENOMEM);
171
172 cma->base_pfn = base_pfn;
173 cma->count = count;
174 cma->bitmap = kzalloc(bitmap_size, GFP_KERNEL);
175
176 if (!cma->bitmap)
177 goto no_mem;
178
179 ret = cma_activate_area(base_pfn, count);
180 if (ret)
181 goto error;
182
183 pr_debug("%s: returned %p\n", __func__, (void *)cma);
184 return cma;
185
186 error:
187 kfree(cma->bitmap);
188 no_mem:
189 kfree(cma);
190 return ERR_PTR(ret);
191 }
cma_init_reserved_areas->
cma_create_area-> cma_create_area:
137 static __init int cma_activate_area(unsigned long base_pfn, unsigned long count)
138 {
139 unsigned long pfn = base_pfn;
140 unsigned i = count >> pageblock_order;
141 struct zone *zone;
142
143 WARN_ON_ONCE(!pfn_valid(pfn));
144 zone = page_zone(pfn_to_page(pfn));
145
146 do {
147 unsigned j;
148 base_pfn = pfn;
149 for (j = pageblock_nr_pages; j; --j, pfn++) {
150 WARN_ON_ONCE(!pfn_valid(pfn));
151 if (page_zone(pfn_to_page(pfn)) != zone)
152 return -EINVAL;
153 }
154 init_cma_reserved_pageblock(pfn_to_page(base_pfn));
155 } while (--i);
156 return 0;
157 }
在之前CMA初始化的时候,看到其base和size都会pageblock_order对齐。pageblock_order的值是10,即一个pageblock_order代表4MB的内存块(2^10 * PAGE_SIZE)。因此,该函数cma_activate_area是对每一个用于CMA的block进行初始化(140行,155行)。由于CMA规定了,其区域内的页面必须在一个zone中,因此149~153行对每一个页面进行甄别是否都在同一个zone中,然后对CMA区域内的每一个pageblock进行初始化。
cma_init_reserved_areas->
cma_create_area-> cma_create_area-> init_cma_reserved_pageblock:
769 #ifdef CONFIG_CMA
770 /* Free whole pageblock and set it's migration type to MIGRATE_CMA. */
771 void __init init_cma_reserved_pageblock(struct page *page)
772 {
773 unsigned i = pageblock_nr_pages;
774 struct page *p = page;
775
776 do {
777 __ClearPageReserved(p);
778 set_page_count(p, 0);
779 } while (++p, --i);
780
781 set_page_refcounted(page);
782 set_pageblock_migratetype(page, MIGRATE_CMA);
783 __free_pages(page, pageblock_order);
784 totalram_pages += pageblock_nr_pages;
785 #ifdef CONFIG_HIGHMEM
786 if (PageHighMem(page))
787 totalhigh_pages += pageblock_nr_pages;
788 #endif
789 }
进入buddy的空闲页面其page->_count都需要为0.因此在778行设置pageblock 区内的每一个page的使用技术都为0.而781行将一个pageblock块的第一个page的_count设置为1的原因是在783行的__free_pages的时候会put_page_testzero减1. 同时还需要设置pageblock的第一个页面的migratetype为MIGRATE_CMA. 所有页面都有一个migratetype放在zone->pageblock_flags中,每个migratetype占3个bit,但对于buddy
system中的pageblock的第一个页面的migratetype才有意义(其他页面设置了也用不上)。
对页面的初始化做完后,就通过__free_pages将其存放在buddy system中(783行)。
由此可见在初始化的时候,所有的CMA都放在order为10的buddy链表中,具体放在相关zone的zone->free_area[10].free_list[MIGRATE_CMA]链表上。
3. 分配
CMA并不直接开放给driver的开发者。开发者只需要在需要分配dma缓冲区的时候,调用dma相关函数就可以了,例如dma_alloc_coherent。最终dma相关的分配函数会到达cma的分配函数:dma_alloc_from_contiguous
295 /**
296 * dma_alloc_from_contiguous() - allocate pages from contiguous area
297 * @dev: Pointer to device for which the allocation is performed.
298 * @count: Requested number of pages.
299 * @align: Requested alignment of pages (in PAGE_SIZE order).
300 *
301 * This function allocates memory buffer for specified device. It uses
302 * device specific contiguous memory area if available or the default
303 * global one. Requires architecture specific get_dev_cma_area() helper
304 * function.
305 */
306 struct page *dma_alloc_from_contiguous(struct device *dev, int count,
307 unsigned int align)
308 {
309 unsigned long mask, pfn, pageno, start = 0;
310 struct cma *cma = dev_get_cma_area(dev);
311 struct page *page = NULL;
312 int ret;
313
314 if (!cma || !cma->count)
315 return NULL;
316
317 if (align > CONFIG_CMA_ALIGNMENT)
318 align = CONFIG_CMA_ALIGNMENT;
319
320 pr_debug("%s(cma %p, count %d, align %d)\n", __func__, (void *)cma,
321 count, align);
322
323 if (!count)
324 return NULL;
325
326 mask = (1 << align) - 1;
327
328 mutex_lock(&cma_mutex);
329
330 for (;;) {
331 pageno = bitmap_find_next_zero_area(cma->bitmap, cma->count,
332 start, count, mask);
333 if (pageno >= cma->count)
334 break;
335
336 pfn = cma->base_pfn + pageno;
337 ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA);
338 if (ret == 0) {
339 bitmap_set(cma->bitmap, pageno, count);
340 page = pfn_to_page(pfn);
341 break;
342 } else if (ret != -EBUSY) {
343 break;
344 }
345 pr_debug("%s(): memory range at %p is busy, retrying\n",
346 __func__, pfn_to_page(pfn));
347 /* try again with a bit different memory target */
348 start = pageno + mask + 1;
349 }
350
351 mutex_unlock(&cma_mutex);
352 pr_debug("%s(): returned %p\n", __func__, page);
353 return page;
354 }
301~304行的注释,告诉该函数的目的是从特定的driver(或者系统默认)CMA中分配一段buffer. 310行是获取特定driver的CMA区域,若dev没有对应的CMA,则从系统默认的CMA区中查找。每一个CMA区域都有一个bitmap用来记录对应的page是否已经被使用(struct cma->bitmap)。因此从CMA区域查找一定数量的连续内存页的方法就是在cma->bitmap中查找连续的N个为0的bit,代表连续的N个物理页。若找到的话就返回一个不大于CMA边界的索引(333行)并设置对应的cma->bitmap中的bit位(339行)。
dma_alloc_from_contiguous-> dma_alloc_from_contiguous:
5908 /**
5909 * alloc_contig_range() -- tries to allocate given range of pages
5910 * @start: start PFN to allocate
5911 * @end: one-past-the-last PFN to allocate
5912 * @migratetype: migratetype of the underlaying pageblocks (either
5913 * #MIGRATE_MOVABLE or #MIGRATE_CMA). All pageblocks
5914 * in range must have the same migratetype and it must
5915 * be either of the two.
5916 *
5917 * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
5918 * aligned, however it's the caller's responsibility to guarantee that
5919 * we are the only thread that changes migrate type of pageblocks the
5920 * pages fall in.
5921 *
5922 * The PFN range must belong to a single zone.
5923 *
5924 * Returns zero on success or negative error code. On success all
5925 * pages which PFN is in [start, end) are allocated for the caller and
5926 * need to be freed with free_contig_range().
5927 */
5928 int alloc_contig_range(unsigned long start, unsigned long end,
5929 unsigned migratetype)
5930 {
5931 unsigned long outer_start, outer_end;
5932 int ret = 0, order;
5933
5934 struct compact_control cc = {
5935 .nr_migratepages = 0,
5936 .order = -1,
5937 .zone = page_zone(pfn_to_page(start)),
5938 .sync = true,
5939 .ignore_skip_hint = true,
5940 };
5941 INIT_LIST_HEAD(&cc.migratepages);
5942
5943 /*
5944 * What we do here is we mark all pageblocks in range as
5945 * MIGRATE_ISOLATE. Because pageblock and max order pages may
5946 * have different sizes, and due to the way page allocator
5947 * work, we align the range to biggest of the two pages so
5948 * that page allocator won't try to merge buddies from
5949 * different pageblocks and change MIGRATE_ISOLATE to some
5950 * other migration type.
5951 *
5952 * Once the pageblocks are marked as MIGRATE_ISOLATE, we
5953 * migrate the pages from an unaligned range (ie. pages that
5954 * we are interested in). This will put all the pages in
5955 * range back to page allocator as MIGRATE_ISOLATE.
5956 *
5957 * When this is done, we take the pages in range from page
5958 * allocator removing them from the buddy system. This way
5959 * page allocator will never consider using them.
5960 *
5961 * This lets us mark the pageblocks back as
5962 * MIGRATE_CMA/MIGRATE_MOVABLE so that free pages in the
5963 * aligned range but not in the unaligned, original range are
5964 * put back to page allocator so that buddy can use them.
5965 */
5966
5967 ret = start_isolate_page_range(pfn_max_align_down(start),
5968 pfn_max_align_up(end), migratetype,
5969 false);
5970 if (ret)
5971 return ret;
5972
5973 ret = __alloc_contig_migrate_range(&cc, start, end);
5974 if (ret)
5975 goto done;
5976
5977 /*
5978 * Pages from [start, end) are within a MAX_ORDER_NR_PAGES
5979 * aligned blocks that are marked as MIGRATE_ISOLATE. What's
5980 * more, all pages in [start, end) are free in page allocator.
5981 * What we are going to do is to allocate all pages from
5982 * [start, end) (that is remove them from page allocator).
5983 *
5984 * The only problem is that pages at the beginning and at the
5985 * end of interesting range may be not aligned with pages that
5986 * page allocator holds, ie. they can be part of higher order
5987 * pages. Because of this, we reserve the bigger range and
5988 * once this is done free the pages we are not interested in.
5989 *
5990 * We don't have to hold zone->lock here because the pages are
5991 * isolated thus they won't get removed from buddy.
5992 */
5993
5994 lru_add_drain_all();
5995 drain_all_pages();
5996
5997 order = 0;
5998 outer_start = start;
5999 while (!PageBuddy(pfn_to_page(outer_start))) {
6000 if (++order >= MAX_ORDER) {
6001 ret = -EBUSY;
6002 goto done;
6003 }
6004 outer_start &= ~0UL << order;
6005 }
6006
6007 /* Make sure the range is really isolated. */
6008 if (test_pages_isolated(outer_start, end, false)) {
6009 pr_warn("alloc_contig_range test_pages_isolated(%lx, %lx) failed\n",
6010 outer_start, end);
6011 ret = -EBUSY;
6012 goto done;
6013 }
6014
6015
6016 /* Grab isolated pages from freelists. */
6017 outer_end = isolate_freepages_range(&cc, outer_start, end);
6018 if (!outer_end) {
6019 ret = -EBUSY;
6020 goto done;
6021 }
6022
6023 /* Free head and tail (if any) */
6024 if (start != outer_start)
6025 free_contig_range(outer_start, start - outer_start);
6026 if (end != outer_end)
6027 free_contig_range(end, outer_end - end);
6028
6029 done:
6030 undo_isolate_page_range(pfn_max_align_down(start),
6031 pfn_max_align_up(end), migratetype);
6032 return ret;
6033 }
该函数的注释中讲述了调用该函数需要注意的事项:对齐,所分配的页面都在一个zone中。释放时,需要使用free_contig_range。
在5967行,先对始末区间进行对齐,然后通过start_isolate_page_range,先确认该区间内没有unmovable的页,如果有unmovable的页,那unmovable的页占着内存而不能被迁移,导致整个区间就都不能被用作CMA(start_isolate_page_range->set_migratetype_isolate->has_unmovable_pages)。确认没有unmovable页后,将该区间的pageblock标志为MIGRATE_ISOLATE。并将对应的page在buddy中都移到freearea[].free_list[MIGRATE_ISLOATE]的链表上,并调用drain_all_pages(start_isolate_page_range->set_migratetype_isolate),将每处理器上暂存的空闲页都释放给buddy(因为有可能在要分配的CMA区间中有页面还在pcp的pageset中—pageset记录了每cpu暂存的空闲热页)。然后通过5973行的__alloc_contig_migrate_range,将被隔离出的页中,已经被buddy分配出去的页摘出来,然后迁移到其他地方,以腾出物理页给CMA用。腾出连续的物理页后,便会通过6017行的isolate_freepages_range来将这段连续的空闲物理页从buddy
system取下来。
__alloc_contig_migrate_range的代码如下:
dma_alloc_from_contiguous-> dma_alloc_from_contiguous-> __alloc_contig_migrate_range:
5862 /* [start, end) must belong to a single zone. */
5863 static int __alloc_contig_migrate_range(struct compact_control *cc,
5864 unsigned long start, unsigned long end)
5865 {
5866 /* This function is based on compact_zone() from compaction.c. */
5867 unsigned long nr_reclaimed;
5868 unsigned long pfn = start;
5869 unsigned int tries = 0;
5870 int ret = 0;
5871
5872 migrate_prep();
5873
5874 while (pfn < end || !list_empty(&cc->migratepages)) {
5875 if (fatal_signal_pending(current)) {
5876 ret = -EINTR;
5877 break;
5878 }
5879
5880 if (list_empty(&cc->migratepages)) {
5881 cc->nr_migratepages = 0;
5882 pfn = isolate_migratepages_range(cc->zone, cc,
5883 pfn, end, true);
5884 if (!pfn) {
5885 ret = -EINTR;
5886 break;
5887 }
5888 tries = 0;
5889 } else if (++tries == 5) {
5890 ret = ret < 0 ? ret : -EBUSY;
5891 break;
5892 }
5893
5894 nr_reclaimed = reclaim_clean_pages_from_list(cc->zone,
5895 &cc->migratepages);
5896 cc->nr_migratepages -= nr_reclaimed;
5897
5898 ret = migrate_pages(&cc->migratepages, alloc_migrate_target,
5899 0, MIGRATE_SYNC, MR_CMA);
5900 }
5901 if (ret < 0) {
5902 putback_movable_pages(&cc->migratepages);
5903 return ret;
5904 }
5905 return 0;
5906 }
该函数主要进行迁移工作。由于CMA区域的页是允许被buddy system当作movable页分配出去的,所以,如果某些页之前被buddy分配出去了,但在cma->bitmap上仍然记录该页可以被用作CMA,所以这时候就需要将该页迁移到别的地方以将该页腾出来供CMA用。
5882行做的事情是,将被buddy分配出去的页挂到cc->migratepages的链表上。然后通过5894行的reclaim_clean_pages_from_list看是否某些页是clean可以直接回收掉,之后在通过5898行的migrate_pages将暂时不能回收的内存内容迁移到物理内存的其他地方。
隔离需要迁移和回收页的函数isolate_migratepages_range如下:
dma_alloc_from_contiguous-> dma_alloc_from_contiguous-> __alloc_contig_migrate_range-> isolate_migratepages_range:
427 /**
428 * isolate_migratepages_range() - isolate all migrate-able pages in range.
429 * @zone: Zone pages are in.
430 * @cc: Compaction control structure.
431 * @low_pfn: The first PFN of the range.
432 * @end_pfn: The one-past-the-last PFN of the range.
433 * @unevictable: true if it allows to isolate unevictable pages
434 *
435 * Isolate all pages that can be migrated from the range specified by
436 * [low_pfn, end_pfn). Returns zero if there is a fatal signal
437 * pending), otherwise PFN of the first page that was not scanned
438 * (which may be both less, equal to or more then end_pfn).
439 *
440 * Assumes that cc->migratepages is empty and cc->nr_migratepages is
441 * zero.
442 *
443 * Apart from cc->migratepages and cc->nr_migratetypes this function
444 * does not modify any cc's fields, in particular it does not modify
445 * (or read for that matter) cc->migrate_pfn.
446 */
447 unsigned long
448 isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
449 unsigned long low_pfn, unsigned long end_pfn, bool unevictable)
450 {
451 unsigned long last_pageblock_nr = 0, pageblock_nr;
452 unsigned long nr_scanned = 0, nr_isolated = 0;
453 struct list_head *migratelist = &cc->migratepages;
454 isolate_mode_t mode = 0;
455 struct lruvec *lruvec;
456 unsigned long flags;
457 bool locked = false;
458 struct page *page = NULL, *valid_page = NULL;
459
460 /*
461 * Ensure that there are not too many pages isolated from the LRU
462 * list by either parallel reclaimers or compaction. If there are,
463 * delay for some time until fewer pages are isolated
464 */
465 while (unlikely(too_many_isolated(zone))) {
466 /* async migration should just abort */
467 if (!cc->sync)
468 return 0;
469
470 congestion_wait(BLK_RW_ASYNC, HZ/10);
471
472 if (fatal_signal_pending(current))
473 return 0;
474 }
475
476 /* Time to isolate some pages for migration */
477 cond_resched();
478 for (; low_pfn < end_pfn; low_pfn++) {
479 /* give a chance to irqs before checking need_resched() */
480 if (locked && !((low_pfn+1) % SWAP_CLUSTER_MAX)) {
481 if (should_release_lock(&zone->lru_lock)) {
482 spin_unlock_irqrestore(&zone->lru_lock, flags);
483 locked = false;
484 }
485 }
486
487 /*
488 * migrate_pfn does not necessarily start aligned to a
489 * pageblock. Ensure that pfn_valid is called when moving
490 * into a new MAX_ORDER_NR_PAGES range in case of large
491 * memory holes within the zone
492 */
493 if ((low_pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) {
494 if (!pfn_valid(low_pfn)) {
495 low_pfn += MAX_ORDER_NR_PAGES - 1;
496 continue;
497 }
498 }
499
500 if (!pfn_valid_within(low_pfn))
501 continue;
502 nr_scanned++;
503
504 /*
505 * Get the page and ensure the page is within the same zone.
506 * See the comment in isolate_freepages about overlapping
507 * nodes. It is deliberate that the new zone lock is not taken
508 * as memory compaction should not move pages between nodes.
509 */
510 page = pfn_to_page(low_pfn);
511 if (page_zone(page) != zone)
512 continue;
513
514 if (!valid_page)
515 valid_page = page;
516
517 /* If isolation recently failed, do not retry */
518 pageblock_nr = low_pfn >> pageblock_order;
519 if (!isolation_suitable(cc, page))
520 goto next_pageblock;
521
522 /* Skip if free */
523 if (PageBuddy(page))
524 continue;
525
526 /*
527 * For async migration, also only scan in MOVABLE blocks. Async
528 * migration is optimistic to see if the minimum amount of work
529 * satisfies the allocation
530 */
531 if (!cc->sync && last_pageblock_nr != pageblock_nr &&
532 !migrate_async_suitable(get_pageblock_migratetype(page))) {
533 cc->finished_update_migrate = true;
534 goto next_pageblock;
535 }
536
537 /*
538 * Check may be lockless but that's ok as we recheck later.
539 * It's possible to migrate LRU pages and balloon pages
540 * Skip any other type of page
541 */
542 if (!PageLRU(page)) {
543 if (unlikely(balloon_page_movable(page))) {
544 if (locked && balloon_page_isolate(page)) {
545 /* Successfully isolated */
546 cc->finished_update_migrate = true;
547 list_add(&page->lru, migratelist);
548 cc->nr_migratepages++;
549 nr_isolated++;
550 goto check_compact_cluster;
551 }
552 }
553 continue;
554 }
555
556 /*
557 * PageLRU is set. lru_lock normally excludes isolation
558 * splitting and collapsing (collapsing has already happened
559 * if PageLRU is set) but the lock is not necessarily taken
560 * here and it is wasteful to take it just to check transhuge.
561 * Check TransHuge without lock and skip the whole pageblock if
562 * it's either a transhuge or hugetlbfs page, as calling
563 * compound_order() without preventing THP from splitting the
564 * page underneath us may return surprising results.
565 */
566 if (PageTransHuge(page)) {
567 if (!locked)
568 goto next_pageblock;
569 low_pfn += (1 << compound_order(page)) - 1;
570 continue;
571 }
572
573 /* Check if it is ok to still hold the lock */
574 locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
575 locked, cc);
576 if (!locked || fatal_signal_pending(current))
577 break;
578
579 /* Recheck PageLRU and PageTransHuge under lock */
580 if (!PageLRU(page))
581 continue;
582 if (PageTransHuge(page)) {
583 low_pfn += (1 << compound_order(page)) - 1;
584 continue;
585 }
586
587 if (!cc->sync)
588 mode |= ISOLATE_ASYNC_MIGRATE;
589
590 if (unevictable)
591 mode |= ISOLATE_UNEVICTABLE;
592
593 lruvec = mem_cgroup_page_lruvec(page, zone);
594
595 /* Try isolate the page */
596 if (__isolate_lru_page(page, mode) != 0)
597 continue;
598
599 VM_BUG_ON(PageTransCompound(page));
600
601 /* Successfully isolated */
602 cc->finished_update_migrate = true;
603 del_page_from_lru_list(page, lruvec, page_lru(page));
604 list_add(&page->lru, migratelist);
605 cc->nr_migratepages++;
606 nr_isolated++;
607
608 check_compact_cluster:
609 /* Avoid isolating too much */
610 if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) {
611 ++low_pfn;
612 break;
613 }
614
615 continue;
616
617 next_pageblock:
618 low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1;
619 last_pageblock_nr = pageblock_nr;
620 }
621
622 acct_isolated(zone, locked, cc);
623
624 if (locked)
625 spin_unlock_irqrestore(&zone->lru_lock, flags);
626
627 /* Update the pageblock-skip if the whole pageblock was scanned */
628 if (low_pfn == end_pfn)
629 update_pageblock_skip(cc, valid_page, nr_isolated, true);
630
631 trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
632
633 count_compact_events(COMPACTMIGRATE_SCANNED, nr_scanned);
634 if (nr_isolated)
635 count_compact_events(COMPACTISOLATED, nr_isolated);
636
637 return low_pfn;
638 }
上面函数的作用是在指定分配到的CMA区域的范围内将被使用到的内存(PageLRU(Page)不为空)隔离出来挂到cc->migratepages链表上,以备以后迁移。要迁移的页分两类,一类是可以直接被回收的(比如页缓存),另一类是暂时不能被回收,内容需要迁移到其他地方。可以被回收的物理页流程如下:
dma_alloc_from_contiguous-> dma_alloc_from_contiguous-> __alloc_contig_migrate_range-> reclaim_clean_pages_from_list:
968 unsigned long reclaim_clean_pages_from_list(struct zone *zone,
969 struct list_head *page_list)
970 {
971 struct scan_control sc = {
972 .gfp_mask = GFP_KERNEL,
973 .priority = DEF_PRIORITY,
974 .may_unmap = 1,
975 };
976 unsigned long ret, dummy1, dummy2;
977 struct page *page, *next;
978 LIST_HEAD(clean_pages);
979
980 list_for_each_entry_safe(page, next, page_list, lru) {
981 if (page_is_file_cache(page) && !PageDirty(page)) {
982 ClearPageActive(page);
983 list_move(&page->lru, &clean_pages);
984 }
985 }
986
987 ret = shrink_page_list(&clean_pages, zone, &sc,
988 TTU_UNMAP|TTU_IGNORE_ACCESS,
989 &dummy1, &dummy2, true);
990 list_splice(&clean_pages, page_list);
991 __mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
992 return ret;
993 }
在该函数中,对于那些用于文件缓存的页,如果是干净的,就进行直接回收(981~989行),下次该内容需要被用到,再次从文件中读取就是了。如果由于一些原因不能被回收掉的,那就挂回cc->migratepages链表上进行迁移(990行)。
迁移页的流程如下:
dma_alloc_from_contiguous-> dma_alloc_from_contiguous-> __alloc_contig_migrate_range-> migrate_pages:
988 /*
989 * migrate_pages - migrate the pages specified in a list, to the free pages
990 * supplied as the target for the page migration
991 *
992 * @from: The list of pages to be migrated.
993 * @get_new_page: The function used to allocate free pages to be used
994 * as the target of the page migration.
995 * @private: Private data to be passed on to get_new_page()
996 * @mode: The migration mode that specifies the constraints for
997 * page migration, if any.
998 * @reason: The reason for page migration.
999 *
1000 * The function returns after 10 attempts or if no pages are movable any more
1001 * because the list has become empty or no retryable pages exist any more.
1002 * The caller should call putback_lru_pages() to return pages to the LRU
1003 * or free list only if ret != 0.
1004 *
1005 * Returns the number of pages that were not migrated, or an error code.
1006 */
1007 int migrate_pages(struct list_head *from, new_page_t get_new_page,
1008 unsigned long private, enum migrate_mode mode, int reason)
1009 {
1010 int retry = 1;
1011 int nr_failed = 0;
1012 int nr_succeeded = 0;
1013 int pass = 0;
1014 struct page *page;
1015 struct page *page2;
1016 int swapwrite = current->flags & PF_SWAPWRITE;
1017 int rc;
1018
1019 if (!swapwrite)
1020 current->flags |= PF_SWAPWRITE;
1021
1022 for(pass = 0; pass < 10 && retry; pass++) {
1023 retry = 0;
1024
1025 list_for_each_entry_safe(page, page2, from, lru) {
1026 cond_resched();
1027
1028 rc = unmap_and_move(get_new_page, private,
1029 page, pass > 2, mode);
1030
1031 switch(rc) {
1032 case -ENOMEM:
1033 goto out;
1034 case -EAGAIN:
1035 retry++;
1036 break;
1037 case MIGRATEPAGE_SUCCESS:
1038 nr_succeeded++;
1039 break;
1040 default:
1041 /* Permanent failure */
1042 nr_failed++;
1043 break;
1044 }
1045 }
1046 }
1047 rc = nr_failed + retry;
1048 out:
1049 if (nr_succeeded)
1050 count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
1051 if (nr_failed)
1052 count_vm_events(PGMIGRATE_FAIL, nr_failed);
1053 trace_mm_migrate_pages(nr_succeeded, nr_failed, mode, reason);
1054
1055 if (!swapwrite)
1056 current->flags &= ~PF_SWAPWRITE;
1057
1058 return rc;
1059 }
该函数最重要的一句是1028行的unmap_and_move. 通过get_new_page新分配一个页,然后将内容move过去,并进行unmap.
dma_alloc_from_contiguous-> dma_alloc_from_contiguous-> __alloc_contig_migrate_range-> migrate_pages-> unmap_and_move:
858 /*
859 * Obtain the lock on page, remove all ptes and migrate the page
860 * to the newly allocated page in newpage.
861 */
862 static int unmap_and_move(new_page_t get_new_page, unsigned long private,
863 struct page *page, int force, enum migrate_mode mode)
864 {
865 int rc = 0;
866 int *result = NULL;
867 struct page *newpage = get_new_page(page, private, &result);
868
869 if (!newpage)
870 return -ENOMEM;
871
872 if (page_count(page) == 1) {
873 /* page was freed from under us. So we are done. */
874 goto out;
875 }
876
877 if (unlikely(PageTransHuge(page)))
878 if (unlikely(split_huge_page(page)))
879 goto out;
880
881 rc = __unmap_and_move(page, newpage, force, mode);
882
883 if (unlikely(rc == MIGRATEPAGE_BALLOON_SUCCESS)) {
884 /*
885 * A ballooned page has been migrated already.
886 * Now, it's the time to wrap-up counters,
887 * handle the page back to Buddy and return.
888 */
889 dec_zone_page_state(page, NR_ISOLATED_ANON +
890 page_is_file_cache(page));
891 balloon_page_free(page);
892 return MIGRATEPAGE_SUCCESS;
893 }
894 out:
895 if (rc != -EAGAIN) {
896 /*
897 * A page that has been migrated has all references
898 * removed and will be freed. A page that has not been
899 * migrated will have kepts its references and be
900 * restored.
901 */
902 list_del(&page->lru);
903 dec_zone_page_state(page, NR_ISOLATED_ANON +
904 page_is_file_cache(page));
905 putback_lru_page(page);
906 }
907 /*
908 * Move the new page to the LRU. If migration was not successful
909 * then this will free the page.
910 */
911 putback_lru_page(newpage);
912 if (result) {
913 if (rc)
914 *result = rc;
915 else
916 *result = page_to_nid(newpage);
917 }
918 return rc;
919 }
该函数是先进行unmap,然后再进行move(881行),这个move实际上是一个copy动作(__unmap_and_move->move_to_new_page->migrate_page->migrate_page_copy)。随后将迁移后老页释放到buddy系统中(905行)。
至此,CMA分配一段连续空闲物理内存的准备工作已经做完了(已经将连续的空闲物理内存放在一张链表上了cc->freepages)。但这段物理页还在buddy 系统上。因此,需要把它们从buddy 系统上摘除下来。摘除的操作并不通过通用的alloc_pages流程,而是手工进行处理(dma_alloc_from_contiguous-> dma_alloc_from_contiguous –>isolate_freepages_range)。在处理的时候,需要将连续的物理块进行打散(order为N->order为0),并将物理块打头的页的page->lru从buddy的链表上取下。设置连续物理页块中的每一个物理页的struct page结构(split_free_page函数),设置其在zone->pageblock_flags迁移属性为MIGRATE_CMA。
4. 释放
释放的流程比较简单。同分配一样,释放CMA的接口直接给dma。比如,dma_free_coherent。它会最终调用到CMA中的释放接口:free_contig_range。
6035 void free_contig_range(unsigned long pfn, unsigned nr_pages)
6036 {
6037 unsigned int count = 0;
6038
6039 for (; nr_pages--; pfn++) {
6040 struct page *page = pfn_to_page(pfn);
6041
6042 count += page_count(page) != 1;
6043 __free_page(page);
6044 }
6045 WARN(count != 0, "%d pages are still in use!\n", count);
6046 }
直接遍历每一个物理页将其释放到buddy系统中便是了(6039~6044行)。
5. 小结
CMA的使用避免了因内存预留给指定驱动而减少了系统可用内存的缺点。其CMA内存在驱动不用的时候可以分配给用户进程使用,而当其需要被驱动用作DMA传输时,将之前分配给用户进程的内存通过回收或者迁移的方式腾给驱动使用。
参考:
1. Linux内核最新的连续内存分配器(CMA)——避免预留大块内存, http://blog.csdn.net/21cnbao/article/details/7309757
2. A deep dive into CMA, http://lwn.net/Articles/486301/
大纲
- 前言
- Mark-Sweep Collector
- Mark Stage
- Sweep Stage
- Finish Stage
- Semi-Space Collector
前言
ART目前采取了垃圾追踪的回收策略,即若某个对象没有被引用,则被认为是可以回收的垃圾,释放垃圾占用的内存。ART实现了两种垃圾回收技术,一种是Mark-Sweep, 另一种是Semi-Space.
Mark-Sweep:它的大致思想是,将所有的对象在内存中的位置记录在位图A中。然后,从所有对象的根出发,扫描根对象的所有引用,扫描根对象的所有引用的引用,一层层逐级扫描,直到叶子节点的对象。在这个逐级扫描的过程中,将涉及到的对象的位置都记录在位图B中。扫描结束后,对比两张位图A和B:所有A中置位的位置,却没有B中被置位,被视为垃圾。并根据位图中的索引检索得到对象,然后释放该对象占用的内存。
Semi-Space: 它的特色是,有两个space空间,其中一个备用。另一个被系统拿来分配对象。在做垃圾扫描的时候,将所有在空间A中被有效引用的对象移动到空间B中。那么在空间A中剩余的对象就是垃圾了,直接释放垃圾对象所在的整个空间就可以了。这个回收速度很快。
Mark-Sweep Collector
要了解Android的垃圾回收策略,需要首先弄清楚对于每一个Android的Heap,有哪些Space.对于普通的Android VM进程来说,其Heap都是从Zygote继承过来。所以,每一个VM 进程Heap的虚拟地址区间基本上都是一样的(区间的末端范围会有些不同)。对于一个普通的VM进程Heap的区间一般是这样划分的。
图1. Heap中的空间
Image Space的容量取决于boot.art和boot.oat的容量大小(具体可以到/data/dalvik-cache/下查看这两个文件的大小)。在Image Space后面紧接着的就是Zygote Space。顾名思义,由于Zygote是所有VM进程的祖先,因此在普通的VM进程孵化出之前,Zygote是需要一些object的,这些object都被压缩放在Zygote Space中(为什么是压缩,后面讲Semi-Space时会有涉及)。紧挨着Zygote Space的是Main Space, Main Space的limit被设置为了176MB. 其实从Zygote Space的起始到Non Moving Space的结束位置是有512MB大小的。
Mark-Sweep Collector只针对Heap中的Zygote Space、Main Space和Non Moving Space做垃圾回收处理. Temp Space和Bump pointer Space的垃圾回收归属于Semi-Space Collector来处理。
Mark-Sweep Collector根据轻重程度不同,可以分为三类:Sticky, Partial, Full. 又根据做GC的过程中是否暂停所有的mutator线程分为ConCurrent和Non-Concurrent. 所以,ART系统一共定义了6中MS的垃圾回收器。即,
333 // Create our garbage collectors.
334 for (size_t i = 0; i < 2; ++i) {
335 const bool concurrent = i != 0;
336 garbage_collectors_.push_back(new collector::MarkSweep(this, concurrent));
337 garbage_collectors_.push_back(new collector::PartialMarkSweep(this, concurrent));
338 garbage_collectors_.push_back(new collector::StickyMarkSweep(this, concurrent));
339 }
根据情况的不同,系统选择不同的垃圾回收器进行垃圾回收工作。当然,对系统的工作线程影响最小的应该是Concurrent Sticky Mark Sweep Collector了。影响最大的就是Full Mark Sweep Collector。而最复杂的属于Concurrent Mark Sweep Collector。如果理解了最繁琐的Concurrent Mark Sweep(下文成为CMS)算法,其他5中垃圾回收器的工作原理就也理解了。CMS工作从整体上可以划分两个大的阶段:Mark 和 Sweep。
1) Mark阶段
做标记工作,即从根对象出发,递归广度优先标记所有引用的对象。一个被引用的对象在标记的过程中,先被标记,然后放在栈中,等该对象的父对象全部被标记完成后,依次弹出栈中的每一个对象,并标记其引用,然后把其引用再丢到栈中…。在整个Mark的过程中,把处于不同状态的对象可以形象的用颜色:黑、灰、白来描述。未进行标记前,所有的对象都是白色的,那些进过栈并出栈的对象又被称之为被标记过的对象称之为黑色的,而在栈中的对象被称之为灰色的。当Mark阶段完成后,所有的对象要不是黑色的,要不是白色的。白色的就是可以被回收的垃圾了。
Mark是从所有对象的根出发,这些根会有很多,最简单的根比如在当前函数栈里面的对象对其他对象的引用。由于是Concurrent,因此Mark的线程是和当前进程中的工作mutator线程[1]一起工作的,因此在Mark的过程中,mutator线程也会新分配对象或者修改对象的引用。因此,必须有一个阶段是需要Stop-The-World,暂停所有的mutator线程,以方便GC进行mark操作。如果整个Mark阶段都Stop-The-World,显然对mutator工作线程的影响太大了。ART的做法是,先Concurrent标记两遍,然后再Non-Concurrent标记一遍。两遍的原因是尽可能多的把需要标记的对象都在不暂停mutator的情况下标记了。具体的标记过程和代码罗列如下:
1.1 Initialize Phase
118 void MarkSweep::InitializePhase() {
…
121 mark_stack_ = heap_->mark_stack_.get();
…
123 immune_region_.Reset();
140 }
在初始预备阶段,为后面的mark准备好mark_stack_和immune_region.
1.1.1 Mark_stack_是做什么呢?mark_stack_是一个FIFO的栈。由于标记是按照深度优先递归标记的,mark_stack_栈来进行配合进行深度递归。理论上,递归是不需要辅助栈的,利用函数本身的栈就可以实现深度递归。但是,采用了mark_stack_栈的好处是:1. 可以并行多线程来做,当mark_stack_栈中数量达到一定数目的时候(128个)就可以用多线程,将栈进行分段,多个线程共同将栈中的数据递归标记。2. 另外一个显而易见的好处是,可以避免函数栈的深度过深。
例如根据上图,假设在栈中存在B,C。
a). 弹出C,标记C的引用F并压入F,标记G压入G。然后弹出G
b). G没有其他的引用,弹出F,F没有其他的引用。弹出B
c). 标记B的引用D,压入D. 标记B的引用E,压入E。弹出E
d). E没有任何引用。弹出D,D没有任何引用。栈空。结束。
1.1.2 immune_region是做什么的?
在一个Heap中有几个Space区,这些space的用途是不一样的,比如Image Space, 其中放的是boot.art和boot.oat的代码段数据段。对于Image Space,其中的数据是不经常更改的,因此,GC是不会完整地扫描该区域。还有Zygote Space, 该区放的是Zygote在Fork该进程之前的Zygote内容,因此也是不经常改变的。根据当前系统的内存短缺情况,采取不同程度的GC扫描策略,于是凡将某个space指定给immune_region的,GC便不会去扫描。可是,不扫描的话,如果该Space中有object的引用修改了,怎么办呢,比如Image Space有个对象A本来引用了B,但后来更改,A引用C了,于是在GC的时候,需要把B当作垃圾,而C被Mark。这就需要额外的数据结构ModUnionTable以及CardTable了。
1.1.3 ModUnionTable与CardTable
ModUnionTable是需要和CardTable配合使用的。CardTable可以看作是一个数组,每一项被称作Card,占用一个字节的空间。代表Heap中连续的128个字节的内存是否有更改。第0个字节(0号索引)对应Heap起始的128个字节,第1个字节(1号索引)对应Heap中后续的128个字节,以此类推。因此CardTable要覆盖这个Heap的所有连续Space区间(包括Bumper Space)。CardTable中的Card可以有3个值,分别是0, 0x70,0x6F. 非0代表该项对应的Heap中相应的object有更改。
在做GC的Mark时,凡是CardTable中属于immune_region的值为0x70的Card, 被记录到ModUnionTable 中有一个类似于set的c++数据结构中,同时将该Card的值减去1,即0x6F。
1.1.4 RememberedSet与CardTable
RememberedSet与ModUnionTable的作用是一样的,不同的是ModUnionTable记录Image Space, Zygote Space对其他space的引用的修改。而RememberedSet记录的是Non Moving Space和Main Space对Semi Space引用的修改。RememberedSet虽然只在Semi-Space Collector中起作用,但在CMS的ProcessCards中也会对CardTable中属于Non Moving Space和Main Space的Dirty Card 在RememberedSet中做备份。
1.1.5 Live_Bitmap和Mark_Bitmap
如前所述,Mark Sweep垃圾回收策略采取的是标记跟踪法。Live_Bitmap和Mark_Bitmap便是用来做标记的。他们都是位图的数据结构,每一位代表8个字节的空间。Live_Bitmap记录了当前存在于VM进程中所有的未标记的object和标记过的object。Mark_Bitmap经过了Mark 的过程,记录了当前VM进程中所有被引用的object. Live_Bitmap和Mark_Bitmap中间的差集,便是所有为被系统引用的object,即是可以回收的垃圾了。
1.1.6 allocation_stack和live_stack
所有新分配的object会被allocation_stack来记录,然后在Marking的时候,再在Live_Bitmap中打上live的标记。allocation_stack和live_stack其实是一个工作栈和备份栈。当在GC的时候,需要处理allocation_stack,那么会把两个stack互换。Mutator线程新分配的object会压倒备份栈中(备份栈就当作新的工作栈了)。
1.1.7 Concurrent
系统在什么情况下用concurrent,什么时候不用呢?系统定义了6中垃圾收集器,那么具体在做垃圾回收的时候,选择哪一种垃圾收集器呢?在系统初始化的时候,会定义一个collector_type_,一个collector_type_可以选择为:kCollectorTypeSS/kCollectorTypeGSS/kCollectorTypeMS/kCollectorTypeCMS. 对于前三种,都是NonConcurrent的,只有后一种是Concurrent. 目前系统默认选择的是最后一种kCollectorTypeCMS.于是在做垃圾回收的时候,根据sticky/partial/full,选择Concurrent的mark-sweep的垃圾收集器。
1.2 Marking Phase
1.2.1 在Marking阶段,如果是sticky/partial的CMS, 会把Image Space和Zygote Space都加入到Immune_region中。并且,对于sticky的CMS,还会将Main Space和Non moving Space的mark_bitmap和live_bitmap绑定(绑定的意思是mark_bitmap和live_bitmap指向同样的bitmap,这样做的目的是在mark的时候直接把live_bitmap和mark_bitmap都标记了,就不会对这两个space做垃圾回收处理了)。Sticky CMS是为了处理新分配出来的对象中的垃圾而不会处理其他Space中的垃圾,因此采用了绑定的方法。而Partial Space把Zygote Space放在immune_region的原因是,Partial CMS不会把Zygote Space完整得扫描一遍,只会扫描在card table上有修改的部分。对于Full CMS,只会把Image Space加入到Immune_region中。可以看出无论是哪一种的CMS都会把Image Space加入到Immune_region中,他们都不会完整的扫描Image Space,因为Image Space里面存放的是boot.art和boot.oat,VM进程对其更改的可能性比较小,完整地对这个60+MB的Space扫描一遍性价比太低了。
1.2.2 处理CardTable。扫描CardTable,凡是对应于Image Space和Zygote Space的值为0x70(kCardDirty)的Card都被插入到ModUnionTableCardCache中,并在CardTable中对该Card进行Aged, Aged的做法就是其值为0x70的减去1,非0x70且非0的设置为0;凡是对应于Main Space和Non Moving Space的更改的card,都存储在RememberedSet中。其他非Semi Space在CardTable中有对应的修改的Card, 只进行Aged(好像没有其他的Space了L).
2227 void Heap::ProcessCards(TimingLogger& timings, bool use_rem_sets) {
2228 // Clear cards and keep track of cards cleared in the mod-union table.
2229 for (const auto& space : continuous_spaces_) {
2230 accounting::ModUnionTable* table = FindModUnionTableFromSpace(space);
2231 accounting::RememberedSet* rem_set = FindRememberedSetFromSpace(space);
2232 if (table != nullptr) {
2233 const char* name = space->IsZygoteSpace() ? “ZygoteModUnionClearCards” :
2234 “ImageModUnionClearCards”;
2235 TimingLogger::ScopedSplit split(name, &timings);
2236 table->ClearCards();
2237 } else if (use_rem_sets && rem_set != nullptr) {
2238 DCHECK(collector::SemiSpace::kUseRememberedSet && collector_type_ == kCollectorTypeGSS)
2239 << static_cast<int>(collector_type_);
2240 TimingLogger::ScopedSplit split(“AllocSpaceRemSetClearCards”, &timings);
2241 rem_set->ClearCards();
2242 } else if (space->GetType() != space::kSpaceTypeBumpPointerSpace) {
2243 TimingLogger::ScopedSplit split(“AllocSpaceClearCards”, &timings);
2244 // No mod union table for the AllocSpace. Age the cards so that the GC knows that these cards
2245 // were dirty before the GC started.
2246 // TODO: Need to use atomic for the case where aged(cleaning thread) -> dirty(other thread)
2247 // -> clean(cleaning thread).
2248 // The races are we either end up with: Aged card, unaged card. Since we have the checkpoint
2249 // roots and then we scan / update mod union tables after. We will always scan either card.
2250 // If we end up with the non aged card, we scan it it in the pause.
2251 card_table_->ModifyCardsAtomic(space->Begin(), space->End(), AgeCardVisitor(), VoidFunctor());
2252 }
2253 }
2254 }
在目前的Heap中,所有的Space在2230~2242行都已经涉及到了,因此2251行似乎很多余。2232行处理的是Image Space和Zygote Space,2237行处理的是Non Moving Space和Main Space. 而card table不包含Large Object Space.
1.2.3 MarkRoots. 从各个入口开始Mark操作(可以看作是每一个入口所引用的对象看作一棵树,这样在ART里面所有的对象组成了一个森林,那些没有挂载在根树上的对象都是被回收的对象)。
1.2.4 MarkReachableObjects. 首先会更新immune_region中的ModUnionTable中记录的dirty Card,对之进行Mark。并压入mark_stack栈中。然后进行RecursiveMark,即递归Mark.
至此,所有在根对象的reference链上的对象都已经标记了,貌似已经可以进行Sweep。可是,不要忘了,之前标记是在ConCurrent的情况下,也就是说由于mutator线程仍然在同步地跑着,他们在GC工作的时候仍然可以allocate新的对象,并修改已经Mark过的对象的引用。例如在之前Mark操作的时候A->B,将A以及B都已经Mark后,mutator线程新分配了对象C,并将A的引用指向C,即A->C。如果对象C不再次被标记,当作垃圾的话,那么下次访问A的引用的时候,就可能造成Segment Fault了。所以,需要在Suspend所有mutator的情况下再进行一遍Mark. 可是,CMS的目的是尽可能的减少mutator线程pause的时间。于是CMS的策略是,再从头再来标记一遍。
1.2.5 PreCleanCards.
(1)首先会重复步骤1.2.2,不过这一次被操作的对象就是在1.2.2~1.2.4的过程中新产生的对象或者被修改的对象引用了。
(2)从各个入口开始Mark操作,在操作的过程中对于遇到的没有在mark_bitmap上mark过的,在mark_bitmap上做Mark标记。
(3)扫描card table中1.2.5刚开始时aged过的card以及在这个过程中dirty的card对应的对象都进行mark操作(即在mark_bitmap上标记,同时丢到mark_stack中)(233行和922行)
(4)递归处理mark_stack中的object并递归处理object的引用。(923行)
233 RecursiveMarkDirtyObjects(false, accounting::CardTable::kCardDirty – 1);
921 void MarkSweep::RecursiveMarkDirtyObjects(bool paused, byte minimum_age) {
922 ScanGrayObjects(paused, minimum_age);
923 ProcessMarkStack(paused);
924 }
1.2.6 前面在ConCurrent的情况下,经过了两轮的扫描,基本上已经扫描的差不多了。但由于mutator是在一直运行的,不可避免地会修改我们之前已经mark过的”地图”。因此,需要第三遍扫描,在Stop-The-World的情况下进行扫描。
62 void PausePhase() EXCLUSIVE_LOCKS_REQUIRED(Locks::mutator_lock_);
这次扫描的时候,除了remarkroot外,还需要扫描cardtable中Dirty Card的部分.
185 RecursiveMarkDirtyObjects(true, accounting::CardTable::kCardDirty);
1.2.7 经过了前面3次扫描以及Mark,我们的mark”地图“已经完备了。可是,我们需要注意,由于Sweep的操作是对应于live_bitmap,在live_bitmap中标记过,却在mark_bitmap中没有标记的视为垃圾。换句话说,mark_bitmap中标记的对象是live_bitmap中标记对象的子集。但目前为止live_bitmap标记的对象还不全,因为allocation_stack中的对象,我们还没有来得及在live_bitmap中做mark. 因此,allocation_stack先”搁置”起来不让后面的mutator使用,启用备份的的live_stack.
2185 void Heap::SwapStacks(Thread* self) {
2186 if (kUseThreadLocalAllocationStack) {
2187 live_stack_->AssertAllZero();
2188 }
2189 allocation_stack_.swap(live_stack_);
2190 }
2) Sweep阶段
Sweep阶段就是把那些白色的对象所占用的内存回收掉。下面这个代码片段基本上就把Sweep的事情说清楚了。
155 for (size_t i = start; i <= end; i++) {
156 word garbage = live[i] & ~mark[i];
157 if (UNLIKELY(garbage != 0)) {
158 uintptr_t ptr_base = IndexToOffset(i) + live_bitmap.heap_begin_;
159 do {
160 const size_t shift = CLZ(garbage);
161 garbage ^= static_cast<size_t>(kWordHighBitMask) >> shift;
162 *pb++ = reinterpret_cast<mirror::Object*>(ptr_base + shift * kAlignment);
163 } while (garbage != 0);
164 // Make sure that there are always enough slots available for an
165 // entire word of one bits.
166 if (pb >= &pointer_buf[buffer_size – kBitsPerWord]) {
167 (*callback)(pb – &pointer_buf[0], &pointer_buf[0], arg);
168 pb = &pointer_buf[0];
169 }
170 }
171 }
156行把live_bitmap中置位的,却没有在mark_bitmap中置位的对象当作垃圾。157~170行就是把垃圾占用的空间给释放掉。
3) Finish 阶段
在1.2.1的时候,说过,有些情况下会bind mark_bitmap to live_bitmap,这样不会回收某个space的垃圾。因此在这个阶段需要unbind。
1541 void Heap::UnBindBitmaps() {
1542 for (const auto& space : GetContinuousSpaces()) {
1543 if (space->IsContinuousMemMapAllocSpace()) {
1544 space::ContinuousMemMapAllocSpace* alloc_space = space->AsContinuousMemMapAllocSpace();
1545 if (alloc_space->HasBoundBitmaps()) {
1546 alloc_space->UnBindBitmaps();
1547 }
1548 }
1549 }
1550 }
另外,之前说过mark_bitmap是live_bitmap的一个子集,mark_bitmap中包含了所有的存活的非垃圾对象,因此交换mark_bitmap和live_bitmap的指针,使mark_bitmap作为下一次GC的live_bitmap. 同时reset 新的mark_bitmap。
300 SwapBitmaps();
2789 void Heap::ClearMarkedObjects() {
2790 // Clear all of the spaces’ mark bitmaps.
2791 for (const auto& space : GetContinuousSpaces()) {
2792 accounting::ContinuousSpaceBitmap* mark_bitmap = space->GetMarkBitmap();
2793 if (space->GetLiveBitmap() != mark_bitmap) {
2794 mark_bitmap->Clear();
2795 }
2796 }
2797 // Clear the marked objects in the discontinous space object sets.
2798 for (const auto& space : GetDiscontinuousSpaces()) {
2799 space->GetMarkBitmap()->Clear();
2800 }
2801 }
由于mark_stack的目的是为了mark,所以在Finish阶段,也需要把mark_stack给清掉。
1292 mark_stack_->Reset();
Semi-Space Collector
Semi-Space Collector 在做垃圾回收的步骤和Mark Sweep是相同的。不过Semi-Space 是在NonConcurrent的情况下对Heap进行垃圾回收的。这样看起来好像简单了很多J。
Semi-Space Collector(下文的SS是其简称)在Mark之前,会指定一个From_Space和To_Space. 以及一个promoted_space. 凡是在mark的过程中位于From_Space的对象就会被move到To_Space中。当mark结束后,From_Space留下来的全是需要清除的垃圾,而To_Space就是mark后存活的对象,并且这些对象都紧凑排列,达到了”压缩”的效果。为了达到generation promotion的效果,在一次Mark结束后,to_space的最后一个对象的边界记录为last_gc_end. 然后to_space和from_space交换指针,之前的to_space就作为GC之后的From_space使用,即新的在Bumper Space对象都会在心的From_space中分配,并且分配的位置会在last_gc_end之后。等到下一次Semi-Space Collector来临后,凡是在From_Space中last_gc_end位置之前存活下来的对象就不再move到to_space了,而是move到promoted_space了,一般promoted_space会指定为main_space. 这就是SS的promotion generation的概念。
当位于from_space的object被move到to_space或者promo_dest_space的时候,老的object->monitor会告诉其他引用的object,新的位置。这样就可以便于更新了。比如:对象A和C都引用了对象B,A->B C->B. 如果在A引用B的时候扫描后把B搬移到了to_space中新的位置,那么更新A->B’, 同时B->Monitor = 11|B’的地址。这样当C在Marking其引用的时候,发现B的monitor的高两位是11了,就知道B已经被移到新的位置了,只需要改变其引用就好了C->B’.
正如前言所述,Semi-Space Collector会根据情况不同来做GSS还是whole heap collection的SS。GSS只会对Bump Pointer Space采用Semi-Space Collector的方法进行垃圾回收。而whole heap collection的方法除了Bump pointer Space外,还会扫描main space, non moving Space和Large Object Space.(见下面的代码片段,重点是85~88行以及92~95行)
71 void SemiSpace::BindBitmaps() {
72 timings_.StartSplit(“BindBitmaps”);
73 WriterMutexLock mu(self_, *Locks::heap_bitmap_lock_);
74 // Mark all of the spaces we never collect as immune.
75 for (const auto& space : GetHeap()->GetContinuousSpaces()) {
76 if (space->GetLiveBitmap() != nullptr) {
77 if (space == to_space_) {
78 CHECK(to_space_->IsContinuousMemMapAllocSpace());
79 to_space_->AsContinuousMemMapAllocSpace()->BindLiveToMarkBitmap();
80 } else if (space->GetGcRetentionPolicy() == space::kGcRetentionPolicyNeverCollect
81 || space->GetGcRetentionPolicy() == space::kGcRetentionPolicyFullCollect
82 // Add the main free list space and the non-moving
83 // space to the immune space if a bump pointer space
84 // only collection.
85 || (generational_ && !whole_heap_collection_ &&
86 (space == GetHeap()->GetNonMovingSpace() ||
87 space == GetHeap()->GetPrimaryFreeListSpace()))) {
88 CHECK(immune_region_.AddContinuousSpace(space)) << “Failed to add space ” << *space;
89 }
90 }
91 }
92 if (generational_ && !whole_heap_collection_) {
93 // We won’t collect the large object space if a bump pointer space only collection.
94 is_large_object_space_immune_ = true;
95 }
96 timings_.EndSplit();
97 }
从85~88行,可见,若generational_为true, 且whole_heap_collection_为false的情况下,会把non moving space和main space都会加入到immune_region中。而若whole_heap_collection为true,则non moving space和main space不会加入到immune_region中,并且也会扫描lage object space(94行)。Whole_heap_collection除了对bumper pointer space做SS之外,还会对non moving space和main space做mark sweep garbage collector.
那么什么时候需要Whole heap collection呢,SemiSpace刚开始的时候是每5次Generational Semi-Space Collector,做一次Whole heap collection。后来优化的措施是,被promoted的object数量达到一定阈值后,进行一次Whole heap collection. 或者从上一次whole heap collection 至今分配的Large Object bytes超过了阈值16MB(kLargeObjectBytesAllocatedThreshold)
注:
[1]. http://www.ibm.com/developerworks/library/j-ibmjava2/
参考
June 12th, 2014 in
Uncategorized | tags:
Android,
ART,
Garbage Collector |
No Comments