## Automated Formal Memory Consistency Verification of Hardware

#### Yatin A. Manerkar

**Princeton University** 

June 23<sup>rd</sup>, 2019



http://www.cs.princeton.edu/~manerkar

#### The Rise of Parallelism...

42 Years of Microprocessor Trend Data



Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2017 by K. Rupp

#### The Rise of Parallelism...

42 Years of Microprocessor Trend Data



Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2017 by K. Rupp

#### The Rise of Parallelism...

42 Years of Microprocessor Trend Data



Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2017 by K. Rupp















### Parallel processors are hard to get right! How can we formally verify parallel hardware?



#### **Formal Methods Expert**





• Build proven-correct processor (e.g. Kami) or...



#### **Formal Methods Expert**









- Build proven-correct processor (e.g. Kami) or...
- ...construct formal model of implementation and verify that (REMS)



#### **Formal Methods Expert**









- Build proven-correct processor (e.g. Kami) or...
- ...construct formal model of implementation and verify that (REMS)
- Formal methods expert carries most of the verification burden



#### **Formal Methods Expert**









- Build proven-correct processor (e.g. Kami) or...
- …construct formal model of implementation and verify that (REMS)
- Formal methods expert carries most of the verification burden

#### **Computer Architect**





- Experts on building processors
- Generally not much formal methods expertise
- Can they share more of the verification burden?



**Formal Methods Expert** 

**Computer Architect** 

# <u>My work:</u> Automated tools that enable engineers to formally verify their systems <u>by themselves</u>!

#### **Case Study: Memory Consistency Verification**

 Formal methods expert carries most of the verification burden • Can they share more of the verification burden?



#### Talk Outline

#### Overview

- Memory Consistency Background
- PipeProof: All-Program Microarchitectural MCM Verification
- RTLCheck: MCM Verification of Verilog RTL
- Expanding to other domains
- Conclusion

#### **Processors Communicate via Shared Memory**



| Thread 0        | Thread 1                                      |
|-----------------|-----------------------------------------------|
| $1 \times = 1;$ | <pre>3 if (y == 1) print("Answer is:");</pre> |
| 2y = 1;         | <b>4</b> if (x == 1) print("42");             |

| Thread 0        | Thread 1                                     |
|-----------------|----------------------------------------------|
| 1 x = 1;        | <pre>3if (y == 1) print("Answer is:");</pre> |
| <b>2</b> y = 1; | 4 if (x == 1)<br>print("42");                |
|                 |                                              |

Can it print "Answer is: 42"? Yes, eg: 1234



| Thread O                       | Thread 1                                      |
|--------------------------------|-----------------------------------------------|
| $1 \times = 1;$                | <pre>3 if (y == 1) print("Answer is:");</pre> |
| 2y = 1;                        | <b>4</b> if (x == 1) print("42");             |
| Can it print <b>"Answer is</b> | : 42"? Yes, eg: 1234                          |
| How about just "42"?           | Yes, eg: 1342                                 |



| Thread O                        | Thread 1                        |   |
|---------------------------------|---------------------------------|---|
| $1 \times = 1;$                 | <b>3</b> if (y == 1)            |   |
|                                 | <pre>print("Answer is:");</pre> |   |
| 2y = 1;                         | 4 if $(x == 1)$                 |   |
|                                 | print("42");                    |   |
| Can it print <b>"Answer is:</b> | 42"? Yes, eg: 123               | 4 |
| How about just <b>"42"</b> ?    | <b>Yes</b> , eg: <b>134</b>     | 2 |
| Could it print <b>nothing</b> ? | Yes, eg: 341                    | 2 |



| Thread 0                        | Thread 1      |        |
|---------------------------------|---------------|--------|
| 1 = 1;                          | if (y == 1)   |        |
|                                 | print("Answer | is:"); |
| 2y = 1; 4                       | if (x == 1)   |        |
|                                 | print("42");  |        |
| Can it print "Answer is: 42"?   | Yes, eg:      | 1234   |
| How about just <b>"42"</b> ?    | Yes, eg:      | 1342   |
| Could it print <b>nothing</b> ? | Yes, eg:      | 8412   |

These executions obey **Sequential Consistency (SC)** [Lamport79], which requires that the results of the overall program correspond to some in-order interleaving of the statements from each individual thread.

| Thread 0        | Thread 1                                     |
|-----------------|----------------------------------------------|
| $1 \times = 1;$ | <pre>3if (y == 1) print("Answer is:");</pre> |
| <b>2</b> y = 1; | <pre>4 if (x == 1)</pre>                     |

How about "Answer is:"?





| Thread 0        | Thread 1                                     |
|-----------------|----------------------------------------------|
| $1 \times = 1;$ | <pre>3if (y == 1) print("Answer is:");</pre> |
| 2y = 1;         | <pre>4 if (x == 1) print("42");</pre>        |
|                 |                                              |

How about "Answer is:"? It depends!





| Thread 0        | Thread 1                                     |
|-----------------|----------------------------------------------|
| $1 \times = 1;$ | <pre>3if (y == 1) print("Answer is:");</pre> |
| 2y = 1;         | <pre>4 if (x == 1) print("42");</pre>        |

How about "Answer is:"? It depends!

2134



| Thread 0        | Thread 1                                      |
|-----------------|-----------------------------------------------|
| $1 \times = 1;$ | <pre>3 if (y == 1) print("Answer is:");</pre> |
| 2y = 1;         | <pre>4 if (x == 1) print("42");</pre>         |

2134

How about "Answer is:"? It depends!





## Most processors today implement "weak memory models" that relax orderings required by SC!



Message Passing (mp)

#### **Answer: Performance!**







Answer: Performance!Core 0Core 1x = 1;r1 = y;y = 1;r2 = x;Can r1=1 and r2=0?

Can improve performance by sending both stores to memory in parallel









Core 0

Message Passing (mp)

Core 1

Core 0

x = 1;

#### **Answer: Performance!**







Message Passing (mp)

#### **Answer: Performance!**







Message Passing (mp) Core 0 Core 1 **Answer: Performance!** r1 = y;x = 1;y = 1;r2 = x;Can r1=1 and r2=0? Core 0 Core 1 r1 = y = 1;**X** =  $r^2 = x = 0$ : Cache By the time store of x is **x: 1** y: 1 complete, Core 1 has observed reordering! Memory



Message Passing (mp)

#### **Answer: Performance!**

| Core Ø   | Core 1    |
|----------|-----------|
| x = 1;   | r1 = y;   |
| y = 1;   | r2 = x;   |
| Can r1=1 | and r2=0? |

Fence/synchronization instructions can enforce order between memory operations where needed





## Memory Consistency Models (MCMs)

- Instruction sets (ISAs) represent hardware operations (add, ld, st, ...)
- MCMs similarly represent the orderings among hardware memory ops

Compiler

Hardware


- Instruction sets (ISAs) represent hardware operations (add, ld, st, ...)
- MCMs similarly represent the orderings among hardware memory ops



Hardware



- Instruction sets (ISAs) represent hardware operations (add, ld, st, ...)
- MCMs similarly represent the orderings among hardware memory ops



Hardware

How much can I buffer and reorder memory operations?

- Instruction sets (ISAs) represent hardware operations (add, ld, st, ...)
- MCMs similarly represent the orderings among hardware memory ops



- Instruction sets (ISAs) represent hardware operations (add, ld, st, ...)
- MCMs similarly represent the orderings among hardware memory ops

# In a nutshell: MCMs specify what value will be returned when your program does a load!





Memory Consistency Models (MCMs) Specify rules and guarantees about the <u>ordering</u> and <u>visibility</u> of accesses to shared memory [Sorin et al., 2011].



Memory Consistency Models (MCMs)

Specify rules and guarantees about the <u>ordering</u> and <u>visibility</u> of accesses to shared memory [Sorin et al., 2011].



Memory Consistency Models (MCMs) Specify rules and guarantees about the <u>ordering</u> and <u>visibility</u> of accesses to shared memory [Sorin et al., 2011].



Memory Consistency Models (MCMs) Specify rules and guarantees about the <u>ordering</u> and <u>visibility</u> of accesses to shared memory [Sorin et al., 2011].



- MCMs are specified at interfaces between layers of the stack
  - Upper layers target MCM; lower layers must maintain it for all programs!



- MCMs are specified at interfaces between layers of the stack
  - Upper layers target MCM; lower layers must maintain it for all programs!



- MCMs are specified at interfaces between layers of the stack
  - Upper layers target MCM; lower layers must maintain it for all programs!



- MCMs are specified at interfaces between layers of the stack
  - Upper layers target MCM; lower layers must maintain it for all programs!





- Axiomatic specifications -> Happens-before graphs
  - <u>Cyclic</u> => Impossible, <u>Acyclic</u> => Possible
- Model Checking space of graphs using SMT solvers
- Most tools written in Gallina => can be proven correct



http://check.cs.princeton.edu



- **Axiomatic specifications -> Happens-before graphs** 
  - <u>Cyclic</u> => Impossible, <u>Acyclic</u> => Possible
- Model Checking space of graphs using **SMT solvers**
- Most tools written in Gallina => can be proven correct

So far, tools have found bugs in:

- Widely-used Research simulator
- Cache coherence paper
- IBM XL C++ compiler (fixed in v13.1.5)
- In-design commercial processors
- **RISC-V ISA specification**
- Open-source RTL (Verilog)
- C++ 11 mem model

h

SpectrePrime, MeltdownPrime



TriCheck [ASPLOS '17] [IEEE MICRO Top Picks]

COATCheck [ASPLOS '16] [IEEE MICRO Top Picks]

PipeCheck [Micro '14] [IEEE MICRO Top Picks] CCICheck [Micro '15] [Nominated for Best Paper Award]

h

RTLCheck [Micro '17] [IEEE MICRO Top Picks Honorable Mention]

So far, tools have found bugs in:

- Widely-used Research simulator
- Cache coherence paper
- IBM XL C++ compiler (fixed in v13.1.5)
- In-design commercial processors
- RISC-V ISA specification
- Open-source RTL (Verilog)
- C++ 11 mem model
- SpectrePrime, MeltdownPrime

#### • Axiomatic specifications -> Happens-before graphs

- <u>Cyclic</u> => Impossible, <u>Acyclic</u> => Possible
- Model Checking space of graphs using SMT solvers
- Most tools written in Gallina => can be proven correct



- Axiomatic specifications -> Happens-before graphs
  - <u>Cyclic</u> => Impossible, <u>Acyclic</u> => Possible
- Model Checking space of graphs using **SMT solvers**
- Most tools written in Gallina => can be proven correct

h •

SpectrePrime, MeltdownPrime

**RISC-V ISA specification** 

C++ 11 mem model

Open-source RTL (Verilog)

IBM XL C++ compiler (fixed in v13.1.5)

In-design commercial processors

#### Talk Outline

- Overview and Motivation
- Memory Consistency Background
- PipeProof: All-Program Microarchitectural MCM Verification
- RTLCheck: MCM Verification of Verilog RTL
- Expanding to other domains
- Conclusion



#### PipeProof proves that a microarchitecture respects its ISA MCM

- For all possible programs!
- How do we formally specify
  - ISA-level MCMs?
  - Microarchitectural orderings?



- MCMs often defined using relational patterns
  - [Shasha and Snir TOPLAS 1988] [Alglave et al. TOPLAS 2014]
- ISA-level executions are graphs
  - nodes: instructions, edges: ISA-level relations
- Eg: SC is  $acyclic(po \cup co \cup rf \cup fr)$



| wessage passing (mp) litmus test |                              |
|----------------------------------|------------------------------|
| Core 0                           | Core 1                       |
| (i1) x = 1;<br>(i2) y = 1;       | (i3) r1 = y;<br>(i4) r2 = x; |
| SC Forbids: $r1 = 1, r2 = 0$     |                              |



- Formal specifications of ISA + HLL MCMs in recent years
  - x86 [Owens et al. TPHOLS2009], ARM [Pulte et al. POPL2018], C11 [Batty et al. POPL 2011], ...
- Automated formal tools e.g. herd [Alglave et al. TOPLAS 2014]
  - Can formally analyse small test programs against these models

- MCMs often defined using relational patterns
  - [Shasha and Snir TOPLAS 1988] [Alglave et al. TOPLAS 2014]
- ISA-level executions are graphs
  - nodes: instructions, edges: ISA-level relations
- Eg: SC is  $acyclic(po \cup co \cup rf \cup fr)$



| Message passing (mp) litmus test |                              |
|----------------------------------|------------------------------|
| Core 0                           | Core 1                       |
| (i1) x = 1;<br>(i2) y = 1;       | (i3) r1 = y;<br>(i4) r2 = x; |
| SC Forbids: $r1 = 1$ , $r2 = 0$  |                              |



- Formal specifications of ISA + HLL MCMs in recent years
  - x86 [Owens et al. TPHOLS2009], ARM [Pulte et al. POPL2018], C11 [Batty et al. POPL 2011], ...
- Automated formal tools e.g. herd [Alglave et al. TOPLAS 2014]
  - Can formally analyse small test programs against these models

- MCMs often defined using relational patterns
  - [Shasha and Snir TOPLAS 1988] [Alglave et al. TOPLAS 2014]
- ISA-level executions are graphs
  - nodes: instructions, edges: ISA-level relations
- Eg: SC is  $acyclic(po \cup co \cup rf \cup fr)$



Legend: po = Program order co = coherence order rf = reads-from fr = from-reads

- Formal specifications of ISA + HLL MCMs in recent years
  - x86 [Owens et al. TPHOLS2009], ARM [Pulte et al. POPL2018], C11 [Batty et al. POPL 2011], ...
- Automated formal tools e.g. herd [Alglave et al. TOPLAS 2014]
  - Can formally analyse small test programs against these models

| Core 0                         | Core 1                       |
|--------------------------------|------------------------------|
| (i1) $x = 1;$<br>(i2) $y = 1;$ | (i3) r1 = y;<br>(i4) r2 = x; |
| SC Forbids: r1 = 1, r2 = 0     |                              |

- MCMs often defined using relational patterns
  - [Shasha and Snir TOPLAS 1988] [Alglave et al. TOPLAS 2014]
- ISA-level executions are graphs
  - nodes: instructions, edges: ISA-level relations
- Eg: SC is  $acyclic(po \cup co \cup rf \cup fr)$

- Formal specifications of ISA + HLL MCMs in recent years
  - x86 [Owens et al. TPHOLS2009], ARM [Pulte et al. POPL2018], C11 [Batty et al. POPL 2011], ...

tr

- Automated formal tools e.g. herd [Alglave et al. TOPLAS 2014]
  - Can formally analyse small test programs against these models

| Message passing | g (mp) litmus test |
|-----------------|--------------------|
| Core 0          | Core 1             |

| Core 0                     | Core 1                       |
|----------------------------|------------------------------|
| (i1) x = 1;<br>(i2) y = 1; | (i3) r1 = y;<br>(i4) r2 = x; |
| SC Forbids: r              | $1 = 1, r^2 = 0$             |



- Developed by PipeCheck [Lustig et al. MICRO 2014]
- Microarchitecture performs instrs. in stages
- Microarchitectural executions are µhb graphs
  - Nodes: instr. sub-events, edges: happens-before relationships
- Cyclic µhb graph  $\rightarrow$  unobservable, Acyclic  $\rightarrow$  observable



(i1) 
$$\stackrel{\text{po}}{\longrightarrow}$$
 (i2)  $\stackrel{\text{rf}}{\longrightarrow}$  (i3)  $\stackrel{\text{po}}{\longrightarrow}$  (i4)

Message passing (mp) litmus test

| <u> </u>                     |                              |
|------------------------------|------------------------------|
| Core 0                       | Core 1                       |
| (i1) x = 1;<br>(i2) y = 1;   | (i3) r1 = y;<br>(i4) r2 = x; |
| SC Forbids: $r1 = 1, r2 = 0$ |                              |

Legend: IF = Fetch EX = Execute WB = Writeback

- Developed by PipeCheck [Lustig et al. MICRO 2014]
- Microarchitecture performs instrs. in stages
- Microarchitectural executions are µhb graphs
  - Nodes: instr. sub-events, edges: happens-before relationships
- Cyclic µhb graph  $\rightarrow$  unobservable, Acyclic  $\rightarrow$  observable





| Nessage passing (mp) litmus test |                              |
|----------------------------------|------------------------------|
| Core 0                           | Core 1                       |
| (i1) x = 1;<br>(i2) y = 1;       | (i3) r1 = y;<br>(i4) r2 = x; |
| SC Forbids: r1 = 1, r2 = 0       |                              |

<sup>&</sup>lt;u>Legend:</u> IF = Fetch EX = Execute WB = Writeback

- Developed by PipeCheck [Lustig et al. MICRO 2014]
- Microarchitecture performs instrs. in stages
- Microarchitectural executions are µhb graphs
  - Nodes: instr. sub-events, edges: happens-before relationships
- Cyclic µhb graph  $\rightarrow$  unobservable, Acyclic  $\rightarrow$  observable





| wiessage passing (mp) intmus test |                              |
|-----------------------------------|------------------------------|
| Core 0                            | Core 1                       |
| (i1) x = 1;<br>(i2) y = 1;        | (i3) r1 = y;<br>(i4) r2 = x; |
| SC Forbids: r1 = 1, r2 = 0        |                              |



- Developed by PipeCheck [Lustig et al. MICRO 2014]
- Microarchitecture performs instrs. in stages
- Microarchitectural executions are µhb graphs
  - Nodes: instr. sub-events, edges: happens-before relationships
- Cyclic µhb graph  $\rightarrow$  unobservable, Acyclic  $\rightarrow$  observable





 Message passing (mp) litmus test

 Core 0
 Core 1

 (i1) x = 1;
 (i3) r1 = y;

 (i2) y = 1;
 (i4) r2 = x;

 SC Forbids: r1 = 1, r2 = 0



#### **Microarchitectural MCM Verification**

#### Microarchitecture









- µSpec DSL [Lustig et al. ASPLOS 2016] is similar to first-order logic (FOL)
  - forall, exists, AND (/\), OR (\/), NOT (~), implication (=>)
  - Has built-in predicates which take memory operations as input
    - e.g. ProgramOrder i j where i and j are loads/stores
  - Predicates can reference nodes and edges (µhb edges closed under transitivity)
    - -e.g. EdgeExists ((i1, Fetch), (i2, Fetch))

#### PipeProof: Automated All-Program MCM Verif.



[Yatin A. Manerkar, Daniel Lustig, Margaret Martonosi, and Aarti Gupta. PipeProof: Automated Memory Consistency Proofs for Microarchitectural Specifications. The 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), October 2018.]

19

#### Verifying Across All Possible Programs

- Are all forbidden programs microarchitecturally unobservable?
  - If so, then microarchitecture is correct
- Infinite number of forbidden programs
  - E.g.: For SC, must check all possibilities of  $cyclic(po \cup co \cup rf \cup fr)$
- Prove using abstractions and induction
  - Based on Counterexample-guided abstraction refinement [Clarke et al. CAV 2000]

#### Verifying Across All Possible Programs

- Are all forbidden programs microarchitecturally unobservable?
  - If so, then microarchitecture is correct
- Infinite number of forbidden programs
  - E.g.: For SC, must check all possibilities of  $cyclic(po \cup co \cup rf \cup fr)$
- Prove using abstractions and induction
  - Based on Counterexample-guided abstraction refinement [Clarke et al. CAV 2000]



All non-unary cycles containing **fr** (Infinite set)







All non-unary cycles containing **fr** (Infinite set)







Cycle = Transitive Chain (sequence) + Loopback edge (fr)



All non-unary cycles containing **fr** (Infinite set)







Cycle = Transitive Chain (sequence) + Loopback edge (fr) Transitive chain (sequence) of ISA-level edges



All non-unary cycles containing **fr** (Infinite set)





Cycle = Transitive Chain (sequence) + Loopback edge (fr) ISA-level **transitive chain =>** Microarch. level **transitive connection** 








#### The Transitive Chain (TC) Abstraction

#### Infinite!





#### The Transitive Chain (TC) Abstraction

#### Infinite!

Finite!





I<sub>1</sub>







Acyclic graph with transitive connection =>

**Abstract Counterexample** (i.e. possible bug)



Transitive connection (green edge) may

represent one or multiple ISA-level edges



Transitive connection (green edge) may

represent one or multiple ISA-level edges







#### **Refinement Loop: Concretization**

- Replaces transitive connection with a single ISA-level edge
  - All concretizations must be unobservable
  - Observable concretizations are counterexamples (bugs)



#### **Refinement Loop: Concretization**

- Replaces transitive connection with a single ISA-level edge
  - All concretizations must be unobservable
  - Observable concretizations are counterexamples (bugs)



#### **Refinement Loop: Concretization**

- Replaces transitive connection with a single ISA-level edge
  - All concretizations must be unobservable
  - Observable concretizations are counterexamples (bugs)





- Inductively break down transitive chain
  - Additional constraints may be enough to make execution unobservable





- Inductively break down transitive chain
  - Additional constraints may be enough to make execution unobservable



- Inductively break down transitive chain
  - Additional constraints may be enough to make execution unobservable

| factorial(n)      | = | factorial(n-1)      | * | n                 |
|-------------------|---|---------------------|---|-------------------|
|                   |   | ł                   |   | <br>              |
| Chain of length n | = | Chain of length n-1 | + | "Peeled-off" edge |





- Inductively break down transitive chain
  - Additional constraints may be enough to make execution unobservable





- Inductively break down transitive chain
  - Additional constraints may be enough to make execution unobservable



If decomposition is abstract

counterexample, repeat concretization

and decomposition!

#### Results

- Ran PipeProof on simpleSC (SC) and simpleTSO (TSO<sup>1</sup>) µarches
  - 3-stage in-order pipelines
- TSO verification made feasible by optimizations
  - Explicitly checking all decompositions => case explosion
  - Covering Sets Optimization (eliminate redundant transitive connections)
  - Memoization (eliminate previously checked ISA-level cycles)

|            | simpleSC  | simpleSC<br>(w/ Covering Sets + Memoization) |
|------------|-----------|----------------------------------------------|
| Total Time | 225.9 sec | 19.1 sec                                     |

|            | simpleTSO | simpleTSO<br>(w/ Covering Sets + Memoization) |
|------------|-----------|-----------------------------------------------|
| Total Time | Timeout   | 2449.7 sec (≈ 41 mins)                        |



## **PipeProof Takeaways**

- First Ever Automated All-Program Microarchitectural MCM Verification
  - Designers get both completeness and automation of verification
  - Engineers can verify microarchitectures themselves, before RTL is written!
- Based on techniques from formal methods (CEGAR) [Clarke et al. CAV 2000]
- Transitive Chain (TC) Abstraction models infinite set of executions
- Accolades:
  - Nominated for Best Paper at MICRO 2018
  - "Honorable Mention" in 2018 IEEE Micro Top Picks of Comp. Arch. Conferences



## Talk Outline

- Overview and Motivation
- Memory Consistency Background
- PipeProof: All-Program Microarchitectural MCM Verification
- RTLCheck: MCM Verification of Verilog RTL
- Expanding to other domains
- Conclusion

#### What if I want to verify RTL (Verilog)?

#### **ISA-Level MCM**



**Microarchitectural Orderings** 



Axiom "P0\_Fetch":
forall microop "i1", "i2",
SameCore i1 i2 /\ ProgramOrder i1 i2 =>
 AddEdge ((i1, IF), (i2, IF)).

acyclic (po U co U rf U fr)



Verified with

PipeProof

## What if I want to verify RTL (Verilog)?



[RTL Image: Christopher Batten]

## What if I want to verify RTL (Verilog)?



[RTL Image: Christopher Batten]





[Yatin A. Manerkar, Daniel Lustig, Margaret Martonosi, and Michael Pellauer. RTLCheck: Verifying the Memory Consistency of RTL Designs. The 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), October 2017.]





[Yatin A. Manerkar, Daniel Lustig, Margaret Martonosi, and Michael Pellauer. RTLCheck: Verifying the Memory Consistency of RTL Designs. The 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), October 2017.]

## SystemVerilog Assertions (SVA)

- SVA: Industry standard for RTL verification, e.g.: ARM [Reid et al. CAV 2016]
  - Based on Linear Temporal Logic (LTL) with regular operators
- Commercial tools (e.g. JasperGold) can formally verify SVA assertions
- Translating µspec to SVA => RTL MCM verification using industry flows
- But it's not that simple!



# Meaning can be Lost in Translation! 小心地滑 (Caution: Slippery Floor)



#### Meaning can be Lost in Translation!

# 小心地滑 (Caution: Slippery Floor)

[Image: Barbara Younger] [Inspiration: Tae Jun Ham]



#### The µspec/SVA Mismatch

- Tricky to translate µspec to SVA while maintaining µspec semantics
- SVA Verifiers (JasperGold) don't implement full SVA spec!
  - Causes further complications

#### Example: Outcome Filtering

• Filtering litmus test executions to those that have particular values for loads



- In this case, outcome filtering is <u>easy and efficient</u>
- Know load values, so can draw (red) edges based on these values
  - Example: i4 reads 0 => i4 must read mem before write i1



| mp litmus test             |                              |  |
|----------------------------|------------------------------|--|
| Core 0                     | Core 1                       |  |
| (i1) x = 1;<br>(i2) y = 1; | (i3) r1 = y;<br>(i4) r2 = x; |  |
|                            |                              |  |



- In this case, outcome filtering is <u>easy and efficient</u>
- Know load values, so can draw (red) edges based on these values
  - Example: i4 reads 0 => i4 must read mem before write i1



| mp litmus test               |                              |  |
|------------------------------|------------------------------|--|
| Core 0                       | Core 1                       |  |
| (i1) x = 1;<br>(i2) y = 1;   | (i3) r1 = y;<br>(i4) r2 = x; |  |
| SC Forbids: $r1 = 1, r2 = 0$ |                              |  |



- In this case, outcome filtering is <u>easy and efficient</u>
- Know load values, so can draw (red) edges based on these values
  - Example: i4 reads 0 => i4 must read mem before write i1



| mp litmus test               |              |  |
|------------------------------|--------------|--|
| Core 0                       | Core 1       |  |
| (i1) x = 1;                  | (i3) r1 = y; |  |
| (i2) y = 1;                  | (i4) r2 = x; |  |
| SC Forbids: $r1 = 1, r2 = 0$ |              |  |



- In this case, outcome filtering is <u>easy and efficient</u>
- Know load values, so can draw (red) edges based on these values
  - Example: i4 reads 0 => i4 must read mem before write i1



| mp litmus test             |              |  |
|----------------------------|--------------|--|
| Core 0                     | Core 1       |  |
| (i1) x = 1;                | (i3) r1 = y; |  |
| (i2) y = 1;                | (i4) r2 = x; |  |
| SC Forbids: r1 = 1, r2 = 0 |              |  |



## **Outcome Filtering with Temporal Logic**

assume property (a); // e.g. Load i4 returns 0
assert property (b); // e.g. i4 reads mem before write i1

//The above is equivalent to...
assert property ((always a) implies (always b));

In temporal logic syntax (G = always, F = eventually), this becomes:
G a -> G b = (~(G a)) \/ G b = (F ~a) \/ G b

- Assumptions introduce liveness: expensive to check! [Cerny et al. 2010]
- SVA verifiers approximate: only check assumptions until current state
  - This results in a property which is easier to check...
  - ...but makes outcome filtering impossible with such verifiers!
- RTLCheck Solution: Generate properties that handle all test outcomes

## **RTLCheck Takeaways**

#### First automated RTL MCM verification for litmus test suites

- Engineers can check MCM properties of their RTL themselves
- Compatible with existing industry flows and tools
- Novel algorithms to translate µspec axioms to temporal SVA properties
  - Ongoing work: Formalise mismatch between µspec and SVA
- Discovered bug in memory implementation of RISC-V V-scale processor
- Accolades:
  - "Honorable Mention" in 2017 IEEE Micro Top Picks of Comp. Arch. Conferences


## Talk Outline

- Overview and Motivation
- Background on MCM Specification and Verification
- PipeProof: All-Program Microarchitectural MCM Verification
- RTLCheck: MCM Verification of Verilog RTL
- Expanding to other domains
- Conclusion



# Security Analysis with CheckMate [Trippel et al. MICRO 2018]

Work by another member of our research group (Caroline Trippel)

#### Her key insight: µhb graphs can be used for reasoning about security!

#### **Microarchitecture + OS Specification in Alloy**



[CheckMate: Automated Exploit Program Generation for Hardware Security Verification. Caroline Trippel, Daniel Lustig, and Margaret Martonosi. In Proceedings of the 51st International Symposium on Microarchitecture (MICRO), October 2018.]<sup>38</sup>

# Security Analysis with CheckMate [Trippel et al. MICRO 2018]

Work by another member of our research group (Caroline Trippel)

#### Her key insight: µhb graphs can be used for reasoning about security!



[CheckMate: Automated Exploit Program Generation for Hardware Security Verification. Caroline Trippel, Daniel Lustig, and Margaret Martonosi. In Proceedings of the 51st International Symposium on Microarchitecture (MICRO), October 2018.]<sup>38</sup>

# Security Analysis with CheckMate [Trippel et al. MICRO 2018]

Work by another member of our research group (Caroline Trippel)

#### Her key insight: µhb graphs can be used for reasoning about security!



[CheckMate: Automated Exploit Program Generation for Hardware Security Verification. Caroline Trippel, Daniel Lustig, and Margaret Martonosi. In Proceedings of the 51st International Symposium on Microarchitecture (MICRO), October 2018.]<sup>38</sup>

# **Ongoing Work: Verifying Distributed Systems**

- Joint work with Themis Melissaris
- Distributed systems have some similarities to shared-memory systems
  - Distributed protocols (e.g. Paxos) similar to cache coherence protocols
  - Replicated data store consistency models similar to MCMs

# **Ongoing Work: Verifying Distributed Systems**

- Joint work with Themis Melissaris
- Distributed systems have some similarities to shared-memory systems
  - Distributed protocols (e.g. Paxos) similar to cache coherence protocols
  - Replicated data store consistency models similar to MCMs



# **Ongoing Work: Verifying Distributed Systems**

- Joint work with Themis Melissaris
- Distributed systems have some similarities to shared-memory systems
  - Distributed protocols (e.g. Paxos) similar to cache coherence protocols
  - Replicated data store consistency models similar to MCMs
- Also have features with no shared-memory analogue!
  - Correctness in the presence of node failures
  - Eventual consistency [Vogels CACM 2009]





## Talk Outline

- Overview and Motivation
- Background on MCM Specification and Verification
- PipeProof: All-Program Microarchitectural MCM Verification
- RTLCheck: MCM Verification of Verilog RTL
- Expanding to other domains
- Conclusion

## Conclusions

### Complexity of computing hardware is increasing

• Ubiquitous parallelism and increased heterogeneity

#### • Automated formal verification helps engineers handle this complexity

- Give engineers the ability to formally verify their systems themselves
- **PipeProof**: Automated All-Program Microarchitectural MCM Verification
- RTLCheck: Per-Program MCM Verification of RTL Designs

### Techniques for MCM analysis applicable to other domains

• e.g. Security [Trippel et al. MICRO 2018] and distributed systems



## **Collaborators**



**Margaret Martonosi** 



**Daniel Lustig** (NVIDIA)



Aarti Gupta



**Michael Pelluaer** (NVIDIA)



**Caroline Trippel** 



Hongce Zhang



# Automated Formal Memory Consistency Verification of Hardware

### Yatin A. Manerkar

**Princeton University** 

June 23<sup>rd</sup>, 2019



http://www.cs.princeton.edu/~manerkar

# **Backup Slides**

- Abstractly represent repeated ISA-level patterns
- Sometimes needed for refinement loop to terminate
- Inductively proven by PipeProof before their use in proof algorithms
- Example: checking for edge from i1 to i5 (TC abstraction support proof)
   Abstract Counterexample





- Abstractly represent repeated ISA-level patterns
- Sometimes needed for refinement loop to terminate
- Inductively proven by PipeProof before their use in proof algorithms
- Example: checking for edge from i1 to i5 (TC abstraction support proof)
   Repeating ISA-Level Pattern



- Abstractly represent repeated ISA-level patterns
- Sometimes needed for refinement loop to terminate
- Inductively proven by PipeProof before their use in proof algorithms
- Example: checking for edge from i1 to i5 (TC abstraction support proof)
   Repeating ISA-Level Pattern

Can continue decomposing in this way forever!



- Abstractly represent repeated ISA-level patterns
- Sometimes needed for refinement loop to terminate
- Inductively proven by PipeProof before their use in proof algorithms
- Example: checking for edge from i1 to i5 (TC abstraction support proof)
  Chain Invariant Applied



-po\_plus = arbitrary
number of repetitions of po
-Next edge peeled off will
be something other than po

- Each decomposition creates a new set of transitive connections
  - Can quickly lead to a case explosion
- The Covering Sets Optimization eliminates redundant transitive connections





- Each decomposition creates a new set of transitive connections
  - Can quickly lead to a case explosion
- The Covering Sets Optimization eliminates redundant transitive connections

Graph A has an edge from  $x \rightarrow z$  (tran conn.)





- Each decomposition creates a new set of transitive connections
  - Can quickly lead to a case explosion
- The Covering Sets Optimization eliminates redundant transitive connections

Graph A has an edge from  $x \rightarrow z$  (tran conn.)



Graph B has edges from  $y \rightarrow z$  (tran conn.) and  $x \rightarrow z$  (by transitivity)

- Each decomposition creates a new set of transitive connections
  - Can quickly lead to a case explosion
- The Covering Sets Optimization eliminates redundant transitive connections

Graph A has an edge from  $x \rightarrow z$  (tran conn.)



Graph B has edges from  $y \rightarrow z$  (tran conn.) and  $x \rightarrow z$  (by transitivity)

Correctness of A => Correctness of B (since B contains A's tran conn.) **Checking B explicitly is redundant!** 



- Base PipeProof algorithm examines some cycles multiple times
- Memoization eliminates redundant checks of cycles that have already been verified





- Base PipeProof algorithm examines some cycles multiple times
- Memoization eliminates redundant checks of cycles that have already been verified

Tran



- Base PipeProof algorithm examines some cycles multiple times
- Memoization eliminates redundant checks of cycles that have already been verified







- Base PipeProof algorithm examines some cycles multiple times
- Memoization eliminates redundant checks of cycles that have already been verified



Same cycle is checked 3 times!





- Base PipeProof algorithm examines some cycles multiple times
- Memoization eliminates redundant checks of cycles that have already been verified





Same cycle is checked 3 times!

<u>Procedure:</u> If all ISA-level cycles containing edge r<sub>i</sub> have been checked, do not peel off r<sub>i</sub> edges when checking subsequent cycles



## Filtering Invalid Decompositions

- When decomposing a transitive connection, the decomposition should guarantee the transitive connections of its parent abstract cexes.
- Decompositions that do not do this are invalid and filtered out







## The Adequate Model Over-Approximation

- Addition of an instruction can make unobservable execution observable!
- Need to work with over-approximation of microarchitectural constraints
- PipeProof sets all exists clauses to true as its over-approximation























execution that is observable) is often returned





# Mapping ISA-Level Edges to Microarchitecture

- Translate each edge in ISA-level cycle to microarchitectural constraints
- Do so with user-provided Mapping Axioms
- Example: Mapping of po edges



Axiom "Mapping\_po": forall microop "i", forall microop "j", (HasDependency po i j => AddEdge ((i, Fetch), (j, Fetch), "po\_arch", "blue")).



# Mapping ISA-Level Edges to Microarchitecture

- Translate each edge in ISA-level cycle to microarchitectural constraints
- Do so with user-provided Mapping Axioms
- Example: Mapping of po edges



Axiom "Mapping\_po":
forall microop "i",
forall microop "j",
(HasDependency po i j =>
 AddEdge ((i, Fetch), (j, Fetch), "po\_arch", "blue")).



# Mapping ISA-Level Edges to Microarchitecture

- Translate each edge in ISA-level cycle to microarchitectural constraints
- Do so with user-provided Mapping Axioms
- Example: Mapping of po edges

Blue edges between EX and WB stages added by other FIFO axioms (refer to µspec file)



Axiom "Mapping\_po":
forall microop "i",
forall microop "j",
(HasDependency po i j =>
 AddEdge ((i, Fetch), (j, Fetch), "po\_arch", "blue")).


#### Can "litmus tests" provide complete coverage?

Open question as to whether a set of litmus tests is complete

| mp Litmus Test         |         |  |  |  |
|------------------------|---------|--|--|--|
| Core 0                 | Core 1  |  |  |  |
| x = 1;                 | r1 = y; |  |  |  |
| y = 1;                 | r2 = x; |  |  |  |
| Forbid: r1 = 1, r2 = 0 |         |  |  |  |



Cyclic => Still unobservable

| Core 0            | Core 1            |  |  |  |  |
|-------------------|-------------------|--|--|--|--|
| x = 1;<br>r1 = y; | y = 1;<br>r2 = x; |  |  |  |  |
| Forbid: r1        | = 0, r2 = 0       |  |  |  |  |

ch litmus Tost

fr



Acyclic => BUG!

#### Can "litmus tests" provide complete coverage?

Open question as to whether a set of litmus tests is complete

| mp Litmus Test |        | sb Litmus Test |        |        |  |
|----------------|--------|----------------|--------|--------|--|
|                | Core 0 | Core 1         | Core 0 | Core 1 |  |
|                | x = 1; | r1 = y;        | x = 1; | y = 1; |  |

#### **Different tests catch different bugs!**

#### To catch all bugs, must verify across all programs!



- Don't filter based on outcome
  - Translate <u>all</u> possible outcomes
- Tag each case with appropriate load value constraints
  - reflect the data constraints required for edge(s)
- Ongoing work: Precisely formalise the µspec/SVA mismatch
  - How much is fundamental? How much is due to SVA verifier approximation?

Axiom "Read\_Values":

Every load either reads BeforeAllWrites OR reads FromLatestWrite

Property to check:

mapNode(Ld  $x \rightarrow St x$ , Ld x == 0) or mapNode(St  $x \rightarrow Ld x$ , Ld x == 1);



mp Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; SC Forbids: r1 = 1, r2 = 0

Note: Axioms and properties abstracted for brevity

- Don't filter based on outcome
  - Translate <u>all</u> possible outcomes
- Tag each case with appropriate load value constraints
  - reflect the data constraints required for edge(s)
- Ongoing work: Precisely formalise the µspec/SVA mismatch
  - How much is fundamental? How much is due to SVA verifier approximation?





mp

Note: Axioms and properties abstracted for brevity

- Don't filter based on outcome
  - Translate <u>all</u> possible outcomes
- Tag each case with appropriate load value constraints
  - reflect the data constraints required for edge(s)
- Ongoing work: Precisely formalise the µspec/SVA mismatch
  - How much is fundamental? How much is due to SVA verifier approximation?

```
Axiom "Read_Values":
Every load either reads BeforeAllWrites OR reads FromLatestWrite

\frac{Property \ to \ check:}{mapNode(Ld \ x \ \rightarrow \ St \ x, \ Ld \ x \ == \ 0)} \ or \ mapNode(St \ x \ \rightarrow \ Ld \ x, \ Ld \ x \ == \ 1);
```

(i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; SC Forbids: r1 = 1, r2 = 0

Core 0

mp

Core 1



Note: Axioms and properties abstracted for brevity

- Don't filter based on outcome
  - Translate <u>all</u> possible outcomes
- Tag each case with appropriate load value constraints
  - reflect the data constraints required for edge(s)
- Ongoing work: Precisely formalise the µspec/SVA mismatch
  - How much is fundamental? How much is due to SVA verifier approximation?



mp Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; SC Forbids: r1 = 1, r2 = 0

#### Multi-V-scale: a Multicore Case Study

Core 0 IF DX WB Memory



#### Multi-V-scale: a Multicore Case Study





#### Multi-V-scale: a Multicore Case Study



# Bug Discovered in V-scale Mem. Implementation

- When two stores are sent to memory in successive cycles, first of two stores is <u>dropped</u> by memory!
- Bug would occur even in single-core V-scale
- Fixed bug by eliminating intermediate wdata reg



# Bug Discovered in V-scale Mem. Implementation

- When two stores are sent to memory in successive cycles, first of two stores is <u>dropped</u> by memory!
- Bug would occur even in single-core V-scale
- Fixed bug by eliminating intermediate wdata reg



# Bug Discovered in V-scale Mem. Implementation

- When two stores are sent to memory in successive cycles, first of two stores is <u>dropped</u> by memory!
- Bug would occur even in single-core V-scale
- Fixed bug by eliminating intermediate wdata reg

