Design of Cache Controller ~ VLSI Excellence

Understanding importance, working principle and associated terminologies of Cache Controller

Basics of Cache Memory -

Idea:

-> Some data is important and frequently used by processor.

-> Most data is not needed.

-> Keep the important data in CACHE!!

Benefits:

-> Faster access to frequently used data.

-> Less important data is still accessible, stored in cheaper memory.

Unit of caching – Small number of bytes (Ex 64 bytes)

Note:

1) Instruction cache does not have to worry about update because CPU core does not modify the program data.

2) Data cache suffers from updates.

Intel Core i7 Processor Example :-

-> L1 Cache- 32 Kbytes, 4 Cycles access time

-> L2 Cache- 256 Kbytes, 11 Cycles access time

-> L3 Cache – 8 Mbytes ,30-40 cycles access time

Cache Terminologies :-

Cache Hit - The cache contains that block. Access is faster.

Cache Miss - Cache must get the block from main memory. Only then it can deliver the block to CPU. Slow Access!!

Eviction/Replacement – New block is loaded into cache. Some other block from cache is removed.

Typical Programs – Tends to execute the same instructions over and over. Most instructions are not used.

Hot Spots / Loop Bodies - A small number of instructions are executed a lot.

For a long time, these instructions are only executed. This is working set for some time. The working set shifts over time.

The principle of Locality :-

-> If a byte of memory was used recently, it is likely to be used again soon -> Temporal Locality

-> If a byte of memory was used recently, next bytes are likely to be needed soon-> Spatial Locality

-> Apply to both instruction and data.

The working set :-

-> The set of bytes that was used recently.

-> The set of bytes that will be needed soon. We can’t really know this set but the both are very similar. If the working set is kept in cache, the program will execute faster.

CPU Read/Write Approach for Cache :-

-> When we bring bytes into cache, keep them.

-> Try to keep bytes as long as possible but prefer to keep most recently used bytes.

-> When we bring bytes into cache, also bring in nearby bytes.

CPU wants to read -

-> Cache Hit - Read the data from Cache and put it on the processor read bus

-> Cache Miss- Evict a block (least recently used) , Read from Main Memory

CPU wants to write-

-> Write Hit - Cache contains the block

-> Write through - Cache immediately writes the modified data to main memory

-> Write back - Cache waits, writes block to main memory when the block is evicted

-> Write miss - Cache does not contain the block

-> Write allocate - Cache will read from main memory into the cache, update the copy in the cache.

-> Write No allocate - Cache will send the write on through to memory, do not load block into cache.

Typical Approach for Cache Write : -

On Write Hit -- Use write back, Update the copy in the cache. Do not write through to memory. Cache needs a ‘dirty’ bit for each block.

On Write miss -- Use Write allocate. Load block into cache, update block into cache, do not immediately update the memory, set the dirty bit.

Note: Caches are specified by their capacity (C), Cache Line/Cache block size (b), Number of Blocks/Lines (B) = C/b, and degree of associativity (N) - N-way = N blocks.

Number of sets (S) = B/N.

For same Index/set value there might be multiple Tag values.

Types of Caches :-

(1) N-Way Set Associative Cache :-

N sets with B blocks.

Example -

2- Way Set Associative Cache Memory , N = 2

Size of Cache /Capacity = 8 words

Block size (b) = 4 bytes = 1 word

Number of blocks /Lines (B) = 8/1 = 8 Blocks

Number of sets (S) = 8/2 = 4 Sets
Note: N Cache Line/Blocks Per Set

Basic Operation :-

-> Input -- a 32 bits address

-> Extract 28 bits tag, 2 bits set/index and 2 bits offset

-> Send 'set' to the N-way associative memory and retrieve the 'tag' value

-> Compare the 'tag' value

-> If match- Retrieve 4 bytes data from associative memory. Use offset to extract the desire byte. Send to CPU.

-> If No match- Select a block to evict (Least recently used (need more bits/logic to implement))

-> Is it dirty? Write back to memory

-> Send tag to main memory, get 4 bytes back, update cache line (key, data, dirty, valid), extract the desired byte, send to CPU. ( Will see use of valid (v) and dirty bits layer in the article)

(2) Fully Associative Cache :-

Only one set with B blocks.

Example -

Block size (b) = 64 bytes

Number of Lines /Blocks (B) = 512 Lines

Size of Cache (C=B*b) = 32 kB (Doesn’t include all bits, just data)

Number of sets = 1

Address size = 32 bits (26 bits Key/ Tag, 6 bytes offset address for 64 bytes)Note: All Cache Lines/Blocks under one Set

Basic Operation :-

-> Input -- a 32 bits address

-> Extract 26 bits tag and 6 bits offset

-> Send tag to associative memory

-> If match- Retrieve 64 bytes data from associative memory. Use offset to extract the desire byte. Send to CPU.

-> If No match- Select a block to evict (Least recently used (need more bits/logic to implement))

-> Is it dirty? Write back to memory

-> Send tag to main memory, get 64 bytes back, update cache line (key, data, dirty, valid), extract the desired byte, send to CPU.

(3) Direct Mapped Cache:-

Associative memories are either small or slow (have to search tag through all lines)

Number of sets are equal to number of cache blocks(B)

Example -

Size of Cache /Capacity = 8 words

Block size (b) = 4 bytes = 1 word

Number of blocks /Lines (B) = 8/1 = 8 Blocks

Number of sets (S) = B = 8 Sets

Address size = 32 bits (27 bits Key/ Tag, 3 bits index/set, 2 bytes offset address for 4 bytes)Note: One Cache Line/Block Per Set

Note: In fully associative memory any block can go to any line.

Direct Mapped Cache, each block can go into only one cache line.

Example - block 10 can only go to line 2.

Line 2 can hold only either block 10 or 2 or 18

Each line still needs to keep info about which block it holds.

But now we avoid the associative lookup. The block we seek can only be in one line. All we have to do is retrieve the line and check to see if it contains the right block.

8 Blocks/Lines = 2^3 (3 bits)

Basic Operation :-

-> From the address, extract tag, index and offset.

-> Use index to find the correct line.

-> 3 bits address into cache memory.

-> Read out this line.

-> Compare the tag from the address to the tag in this cache line.

-> If Same – cache hit else cache miss

-> Use the offset to extract the desired byte.

Potential Problem with direct mapped cache :-

Block 2 and 8 can be placed in same cache line. What if working set contains bytes from both of the blocks?

There is a conflict – Cache thrashes (Not likely but it can occur and can impact the performance)

The general form for solution : -

Combines features of Set associative cache and Direct map cache.

Example - 32 KB, 8 ways set associative.

Each associative memory is small and can hold 8 lines.

Many small associative memories.

64 number of sets * 8 lines per set * 64 block size = 32 KB

20 bits tag, 6 bits index, 6 bits offset.

Basic Operation :-

-> Extract the index.

-> This identifies which set to use.

-> Look in that set.

-> See if there is a line with matching tag.

-> Block 2 and 514 both are mapped to set 2 but set 2 can hold up to 8 blocks. Conflicts are much less likely.

Putting it all together : -

Which Data to Replace ? ( Cache Data Replacement Policy) :-

In a direct mapped cache, each address maps to a unique block and set. If a set is full when new data must be loaded, the block in that set is replaced with the new data. In set associative and fully associative caches, the cache must choose which block to evict when a cache set is full. The principle of temporal locality suggests that the best choice is to evict the least recently used block, because it is least likely to be used again soon. Hence, most associative caches have a least recently used (LRU) replacement policy. In a two-way set associative cache, a use bit, U, indicates which way within a set was least recently used. Each time one of the ways is used, U is adjusted to indicate the other way. For set associative caches with more than two ways, tracking the least recently used way becomes complicated. To simplify the problem, the ways are often divided into two groups and U indicates which group of ways was least recently used. Upon replacement, the new block replaces a random block within the least recently used group. Such a policy is called pseudo-LRU.

Example : Show the contents of an eight-word two-way set associative cache after executing the following code. Assume LRU replacement, a block size of one word, and an initially empty cache.

lw $t0, 0x04($0)

lw $t1, 0x24($0)

lw $t2, 0x54($0)

Solution:

The first two instructions load data from memory addresses 0x4 and 0x24 into set 1 of the cache, shown in Figure (a). U = 0, indicates that data in way 0 was the least recently used. The next memory access, to address 0x54, also maps to set 1 and replaces the least recently used data in way 0, as shown in Figure (b), The use bit, U, is set to 1 to indicate that data in way 1 was the least recently used.

Note: The Designer can implement their own logic to implement the LRU Policy in bigger size caches.

-----------------------------------------Thank You-------------------------------------------

Reference - Digital Design and Computer Architecture by Harris & Harris

I welcome your feedback , suggestions and corrections if any!!! - Happy Learning :)

VLSI Excellence

Digital System Design | Microarchitecture | Low Power VLSI Design | VLSI Projects

Monday, June 28, 2021

Design of Cache Controller

0 Comments:

Post a Comment

Popular Posts

Blog Archive

Report Abuse

About Me

Contact Form

Followers

Blogger news