^{*}In 5 minutes if one skips the introduction and the words in the Blocks of Grey Text.
Click here to hide the introduction and all Blocks of Grey Text.
Word count includes only words not in the Blocks of Grey Text.
0th minute: Introduction 382 words
In CS, when we talk about a problem, we are referring to a set of questions, where each question is an instance of the problem. “Is 7 a prime number?” and “Is 2513414231 a prime number?” are both instances of a problem that can be expressed as: “Is n a prime number?”
Problems can be classified based on the time or space complexity of the algorithms used to compute an answer for every instance of the problem. Among the most easytounderstand NPcomplete problems is the Boolean Satisfiability Problem (aka SATISFIABILITY, or SAT).
The Boolean Satisfiability Problem is also the first problem proven^{{2}} to be NPcomplete. A problem is NPcomplete if it belongs to the set (or “class” if you prefer) of the hardest problems in NP  hardest in the sense that every problem ever exists in NP can be reduced to them. (Thus being able to solve a NPcomplete problem is equivalent to being able to solve every problem in NP).
NP is the set of decision problems for which there exists a nondeterministic Turing machine that would output a value 1, or 0, after at most O(n^{k}) steps where k is some constant and n is the complexity of the input. We often refer to such number of steps (or running time) as polynomial, hence Nondeterministic Polynomial time, or NP.
A Turing machine is basically a mathematical object that formalizes that idea of algorithm (in Turing model’s of computation). A nondeterministic Turing machine (NDTM) is, in simple words, a theoretical Turing machine that can, in a sense, engage in all possible computations while taking the same time/step as a deterministic Turing machine. (Note: nondeterministic doesn’t mean probabilistic.) One way I like to think about a nondeterministic Turing machine is that it can only exist in a universe where time is a 2 dimensional plane. A deterministic Turing machine (DTM) on the other hand computes in onedimension time (and is defined in terms of a transition function, unlike NDTM which relies on a transition relation).
Besides the Boolean Satisfiability Problem, the prime factorization problem, “Given $X$ and $N$, where $1< X < N$, is there a $d$ where $X < d < N$ and $d$ a prime factor of $N$?”, is also a problem in NP.
Decision problems^{{1}} are problems concerning only with the "trueness" of some statement. That is to say that algorithms for solving decision problems would output either
True
orFalse
(often in binary value). The problem above is a decision problem, and so is the problem “Is X a prime number?”I would be diving deeper into computational complexity in an upcoming essay. If you are interested in theoretical CS, I recommend you checking these out:
You can just think of NP as the set of problems whose solution can be verified by an efficient algorithm. And that is different from P, the set of problems whose solution can be found by an efficient algorithm. We can easily prove that each problem in P is also in NP, but we are not sure whether it is true that each problem in NP is too in P.
Take the prime factorization problem for example. At the moment the most efficient algorithm (for integers larger than 100 digits) we can implement into preexisting computers^{{3}} has a subexponential running time. But we can’t say for certainty that the it is not in P. Many brave men and women have tried but failed to come up with a polynomial time algorithm for every instance of the problem. It may simply be the case that we, members of the Human species, are not intelligent enough to design such algorithm. (Or it may be that $NP \not= P$ and the prime factorization problem is simply not in P, in which case we have yet to prove it.)
If we ever come up with an algorithm that can solve all Boolean Satisfiability Problem in polynomial time, it would mean that $P = NP$ since any problem in NP can be reduced into the Boolean Satisfiability Problem. (Here’s a courageous attempt.)
statement^{{1}}: More formally, a decision problem is defined as a set of strings, $L$ composed by an alphabet $\Sigma$, where there exists a Turing Machine TM,
So, by the formal definition, the set of prime numbers is a decision problem. And this decision problem is known as PRIMES. In 2002, it was proven that PRIMES is in P.
Here $L$ is often referred to as a language over $\Sigma$.
^{{2}}: First proven by Cook in his 1971 paper, The complexity of theoremproving procedure. It was later known as the Cook–Levin theorem.
^{{3}}: By “preexisting computerese I mean classical ones that don’t rely on quantum mechanics (for computation). The most efficient factorization algorithm we know of in the 21st century is the one described in Shor’s 1995 paper, a superpolynomial time algorithm that makes use of quantum Fourier transform. The sad (or good) news is that there yet exists a quantum computer powerful enough to have any practical uses of the algorithm.
1st minute: What is the Boolean Satisfiability Problem? 334 words
The Boolean Satisfiability Problem (or SAT) is the problem of determining if a proposition statement^{*} is satisfiable. A propositional statement is satisfiable when it is possible to assign some truefalse values for the variables in the statement such that the statement yields True
. Otherwise the statement is unsatisfiable.
A propositional statement is simply a string made up of variables, brackets and these 3 symbols:
$\neg True = False $
$\land$ with the meaning and.
$ False \land True = False $
$\lor$ with the meaning or.
$ False \lor True = True $
These symbols are also known as logical connectives. They’re arranged in the order of precedence above. Here is an example of a propositional statement:
And here is an instance of The Boolean Satisfiability Problem:
The answer is yes, it is satisfiable. One solution:
$b$ assigned: Any value
$c$ assigned: Any value
$d$ assigned: True
Meanwhile, the answer to the question
is no, it is unsatisfiable. That is because there is no value we can assign to $a$ such that $a \land \neg a$ would yield True
, and therefore it is impossible for the entire statement to yield True
.
^{*}: More technically, it is known as a “formula” or “wff” (“wellformed formula”). “Wellformed” in the sense that it is syntactically correct such that it can be interpreted in our context. “Formula” in the sense that it is something formed by putting symbols together.
”$\land\land\land A \land \lor B B B$” for example is a statement but not a formula since it is not wellformed.
p.s. In order to know how the symbol works here we rely on its association with the meaning of some lexical item in English, like the word “not” in the case of $\neg$, and our intuitive understanding of how “not” functions. We need to keep in mind that the symbols by themselves have no absolute meaning: some are even interpreted differently in different systems of logic (classical logic vs intuitionistic logic for example). But that is outside of the scope of this essay, and so for the moment let’s just take it as that these symbols indeed mean what the words mean.
p.s. Often in proposition logic, you would encounter these symbols as well:
$\Rightarrow$ with the meaning imply (if .. then ..) . [$\Leftarrow$ for different direction]
$ (False \Rightarrow True) = True $^{{4}}
$\Leftrightarrow$ with the meaning if and only if (twoway implication).
$ (False \Leftrightarrow True) = False $They are omitted because statements that use these symbols can be rewritten using the basic Boolean operations: $\lor$, $\land$, and $\neg$.
Since our main concern is satisfiability here, we’d like to be minimalist, and exclude symbols from our alphabet that are not necessary.
p.s. In model theory, we can view the assignment of booleans values to the variables in a statement as an interpretation of the statement, and we write
where $I$ is the interpretation and $F$ is the statement (or more technically, formula^{{1}}) and we say $I$ is a model of $F$. With this in mind, we can say that two formulas $F_1$ and $F_2$ are equivalent as long as each model of A is a model of B, and vice versa.
SAT is a member of problems known as the Constraint Satisfaction Problems (or CSPs). CSP is basically a generalization of SAT, where the followings are both to be defined:

the values a variable can be assigned to (often referred to as domain)

the constraints on what makes a satisfiable statement
Another member of CSPs is the Satisfiability Modulo Theories Problem (or SMT), where, instead of proposition statements (like what we are dealing with in SAT), it is the problem of determining the satisfiability of firstorder logic statements, and whether a firstorder logic statement is satisfiable depends on the theory of our choice.
Let’s say the theory of our choice is linear arithmetic, we would say that the statement below is satisfiable if variables can take in any value in $\mathbb{R}$, the set of real numbers.
It is, however, unsatisfiable if we define $\mathbb{Z}$, the set of integers, to be the domain instead.
● ● ●
Extra (skip this if you want) 213 words
When a satisfiable statement always yields True
no matter what the value the variables take (every set of truefalse assignments to the variables is a solution), we call it a tautology. Here is one:
Interestingly, the problem of determining if some statement is not a tautology can be easily reduced into the Boolean Satisfiability Problem.
Since a tautology always yields True
, negating a tautology would give us a statement that always yields False
, and that, my friend, is an unsatisfiable statement. The problem of determining if some statement, V, is not a tautology is thus equivalent to the problem of determining if the negation of the statement, V, is satisfiable. And that means the “NotTautology problem” is also in NP.
On the other hand, the Boolean Unsatisfiability Problem is in what’s known as coNP, and so is the problem of determining if a statement is a tautology.
coNP can be thought of as the set of problems whose nonsolution can be verified efficiently using a NonDeterministic Turing Machine. At the moment “NP = coNP?” remains an open question.
Formally, a coNP problem is a language^{*} over some alphabet Σ, for which there exists NonDeterministic Turing Machine that would output 0 in polynomial time, after taking in any string from the language as input.
Note that that is different from a definition of a NP problem, a language^{*} for which there exists exists NonDeterministic Turing Machine that would output 1 in polynomial time, after taking in any string from the language as input.
* : If you are confused by the usage of the word “language”, read {1} (in the block of Grey Text) in 0th minute: Introduction.
All problems in P are in (NP $\cap$ coNP). Among other known problems in (NP $\cap$ coNP) is the prime factorization problem.
● ● ●
From the 2nd minute on we’d be

using the word formula (instead of “statement”) [For the reason, see ^{{1}} in 1st minute: What is the Boolean Satisfiability Problem? if you haven’t]

using SAT as the abbreviation for the Boolean Satisfiability Problem
2nd minute: Classifying SAT 354 words
We can classify instances of SAT based on what form the formulas are in.
3SAT is the instances of SAT where the formulas are in the form
where $a_j,b_j,c_j$ are called literals. A literal is either an atomic formula or its negation. One example:
An atomic formula is a formula that has no deeper structure: such as “$a$” in “$a \land b$”.
$\bigwedge$ works the same as a loop. We can arrive at the formula above by taking $J = \{ 1,2 \} \text{ and }a_1 = a_2 = x, b_1 = \neg y, b_2 = c_1 = \neg x, c_2 = z $.
This is also known as the Conjunctive^{{1}} Normal Form (CNF). To be more precise, the form above is called 3CNF (because it has 3 literals in each clause^{{2}}). kCNF is the type of CNF with k number of literals in each clause.
More generally, any formula is in CNF as long as it’s of the form:
Clauses in a general CNF formula need not contain the same number of literals. Examples:
Instances of SAT on statements in kCNF are called kSAT, where k is a natural number. Great thing about 3SAT is that all instances of SAT are reducible into 3SAT. And that means 3SAT is also NPcomplete. 2SAT, on the other hand, is SAT in 2CNF and it is not NPcomplete.
2SAT is in P (and NLcomplete) and not all instances of SAT can be reduced into 2SAT (or else we can conclude that P = NP).
Lastly, we shall talk about HORNSAT, instances of SAT with Horn formula. When each clause contains at most 1 positive literal, we call it a Horn^{{3}} formula. Here are some examples:
Not only is HornSAT in P (just like 2SAT), it is also Pcomplete, and solvable in linear time. (That doesn’t mean P = LIN though.)
^{{1}} Conjunctive since it’s formed by $\land$. It yields
False
if one of its clauses yieldsFalse
.^{{2}}: Formally a clause is defined to be a disjunction of literal:
Dejunctive since it’s formed by $\lor$. It yields
True
if one of its literals yieldsTrue
.^{{3}} Initially in propositional logic, a Horn clause is in the form
In SAT, since we don’t use the $\Rightarrow$ symbol, the above is rewritten as:
or in the case that $b$ is an absolute
True
, we have $(\bigwedge_{a \in A} a) \Rightarrow True$, which gives usor in the case that $\bigvee_{a \in A}$ is an absolute
True
, we have $True \Rightarrow b$, giving us
This pretty much sums up generally how instances of SAT are classified. If you are interested in classifying SAT in a more ‘hard core’ manner (viewing SAT itself as a classification problem), check out this 2008 paper by David Devlin and Barry O’Sullivan.
Perhaps you’ve also heard of problems with names like MAXSAT and #SAT. Instances of these problems are not really instances of SAT; it’d be more accurate to say that they are variations of SAT. MAXSAT is the problem of finding a boolean assignment satisfying most clauses in a CNF formula, and #SAT is the problem of counting the number of satisfying assignments.
3rd minute : SAT Solvers  General Overview 337 words
There are many known algorithms capable of solving some (but not all) instances of SAT pretty efficiently. We often refer to them as SAT solvers. In general they can be classified into complete or incomplete.
Complete SAT solvers would always return either satisfiable
or unsatisfiable
. Incomplete SAT solvers, on the other hand, are less reliable  they may be unable to prove unsatisfiability (when it’s impossible for the formula to yield True
), or unable to find a solution when the formula is satisfiable  but perhaps a lot more useful in scenarios when the instances of SAT cannot be solved by complete algorithms in reasonable time.
To make thing simpler SAT solvers normally take in formulas in CNF form encoded in DIMACS format as inputs. Suppose we have a formula that isn’t in CNF form, we would first convert it into CNF form before inputting it into a SAT solver.
To compare the performance between different SAT solvers, we give them a large set of formulas and see how well each performs (e.g. by the total time taken, or number of formulas solved in nminutes, or using a methodology called careful ranking). The set of formulas is often referred to as a benchmark instance. A solver previously lost to another solver may do better if different type of benchmark instances are used.
In the SAT 2014 competition, Lingeling beat all the other solvers when evaluated using applicationtype benchmark instances, but lose to glucose when hardcombinatorialtype are used.
Due to the NPcomplete nature of SAT, we can use SAT solvers to solve real world problems in areas like circuit design (for Combinational Equivalence Checking), and artificial intelligence (for Automated Planning and Scheduling). Applicationtype benchmark instances consist of formulas from real world problems.
Many modern SAT solvers are based on the original DPLL algorithm designed in the 60’s. In summary, DPLL is a complete algorithm that works by assigning a variable some boolean value, checks if this breaks the overall satisfiability: if it doesn’t break, pick a new variable, and check again; if it breaks, alters the assigned value and check again (if both TrueFalse values are tried, go back to the previous variable, alter it and check). Repeatedly doing so until we arrive at either one of these two ends:

All variables are assigned to some boolean value and it does not break the overall satisfiability, and hence the formula is satisfiable.

No matter what boolean values a set of variables are assigned to, it would break the overall satisfiability, and hence the formula is unsatisfiable.
4th and 5th minutes : SAT Solvers  DPLL and More 613 words
You may have already imagined, DPLL is basically a depthfirst search, and that means at its heart it uses backtracking, a bruteforcelike technique that, rather than checking every single possible solution one by one (which is what’s known as bruteforce), it checks possible solutions by compositing part by part in a combinatorial fashion [e.g. building a solution variable by variable as described above], discarding sets of solutions that are deemed invalid by parts.
In short, a backtracking algorithm is very similar to a classical bruteforce algorithm, except that it is slightly smarter and thus more efficient. Here is an example. Imagine you are given 9 numbers
and your task is to find if there is a set of $n$ numbers adding up to a prime number $x$ where $n > 3$ and $x < 14000$. One example of an algorithm that uses backtracking would firstly
form a set of 3 numbers (e.g. first 3 numbers $4015,1635,8744$)
get its sum (in this case $4015+1635+8744=14394$ )
check if it makes sense to go on adding more numbers into this set of 3 numbers.
In our case $14394>14000$, so it makes no sense to go on and we abandon the last number, 8744, in our set, makes a new set of 3 numbers $4015,1635,1236$, and do the same thing to check if it makes sense to go on (in this case $4015+1635+1236=6886$). Since it did not exceed the limit of 14000, we continue to
add a new number into the the set
get its sum (in this case it didn’t exceed 14000. We can continue)
now that $n>3$ (we have 4 numbers), we check if it is prime: if not we add more numbers to it (repeat step 1)
once we exceed the limit we abandon the last number just like before (and hence backtracking)… Keep doing so until we have
either tried every sensible combination (sum < 14000), but there is no prime
or we arrive at a solution, a set of numbers that meets our criteria.
Read up on the Eight queens puzzle and its backtracking algorithm and you’d have a complete grasp of the concept (if you haven’t).
The other 2 techniques classical DPLL uses are

pure literal elimination: The idea is simple. If a literal appears only as positive or negative (but not both, and hence we call it pure) in the formula,

we assign a value to the literal for it to yield
True
. 
and delete all the clauses which contain it in the formula.
This is because since it is pure, we can just make it yield
True
and it won’t have any other consequence to the overall satisfiable.In the formula above, $a$ and $c$ are both pure, so we can give $a = True$ and $c = False$, and delete all the clauses that contain them. After that we’d end up with no clauses. And that means the formula is satisfiable with
$a$: True
$b$: Any value
$c$: False


unit propagation: It is obvious that for clauses made up of 1 literal, that 1 literal has to evaluate
True
for the formula to yieldTrue
. Therefore
we assign a value for the literal to yield
True

delete clauses that contains the literal, and instances of it in opposite polarity (the opposite of positive is negative, vice versa).
Take this CNF for example:
After 1st round of unit propagation, we would have $a = True$ and, after reduce the original forumla to:
Great! We can do a 2nd round of unit propagation with $\neg b$: make $b = False$ and reduce it into:
And here we have it, the original formula is satisfiable with
$a$: True
$b$: False
$c$: False
$d$: Any value
● ● ●
When we end up in a situation where there is an empty clause (a clause with 0 literals), it indicates that there is a set of variables no matter what value we assign them, we can’t satisfy the formula. And thus it is unsatisfiable.
Here is an example where after one round of unit propagation, we’d have an empty clause, indicating that no matter what values we assign $a$ and $b$, it won’t satisfy the formula.

More technically, we would say that unit propagation and pure literal elimination are used at each decision level. We start off at decision level 0, and the decision level increases by 1 each time as we assign a boolean value to a variable.
We often call such variable “assigned variable”. And that is different from “implied variables”, variables that get their values from pure literal elimination and unit propagation.
Here is a simple implementation of classical DPLL in Haskell:
To run it, simply invoke dpllStart
with a parameter in the form ([Int], [[Int]])
and it would return either UNSAT
with an empty list or SAT
with a list of boolean assignments.
I’ve integrated a parser for parsing CNF formula in DIMACS format into the implementation above.
Below is DPLL in pseudocode with JavaScriptlike syntax. (I have omitted the array that stores variable assignments.)
● ● ●
To improve the performance of a SAT solver, we can use a heuristic function to determine which variable to assign a boolean value to in each decision (rather than following a fixed order). We often refer to it as a decision heuristic. One example is VSIDS introduced in CHAFF, a SAT solver that won the 2006’s SAT competition.
We call something heuristic often when there is not much rigorous maths behind it so it does not guaranteed to be optimal or perfect but it works well after some tests and that is the main reason why it is being implemented & used. Usually, a heuristic function (or algorithm) is characterized by being simple, quick and effective.
The idea of VSIDS is simple: a variable is chosen for each decision according a ranking of these variables that changes over time. Each variable is ranked by its “activity score”, a floating point variable that would plus some constant, $C$, every time the variable shows up in a newly generated clause through a process called “conflict clause learning”. Every now and then, activity scores would be divided by another constant, $N$, hence shrinking over time, and “decaying” in the sense. For that reason, it is called VSIDS: Variable State Independent Decaying Sum.
To learn more about VSIDS you can check out Understanding VSIDS Branching Heuristics in ConflictDriven ClauseLearning SAT Solvers and The Effect of VSIDS on SAT Solver Performance.
Conflict clause learning is a technique that generates an “conflict clause” by using an implication graph that examines how conflict arises after unit propagation. Conflict happens when all literals in a clause are forced to yield
False
(which is the case when we have an empty clause, if our algorithm deletes False literal). Thus the process above is often referred to as conflict analysis. If you are interested in learning more about it, check out this great article (with pictures) by msoos from Wondering of a SAT geek. Wanna dive deeper? You’d certainly find these papers useful:
Towards Understanding and Harnessing the Potential of Clause Learning
Efficient Conflict Driven Learning in a Boolean Satisfiability Solver.
SAT solvers that use such technique are also known as ConflictDriven ClauseLearning (CDCL) SAT Solvers.
Other ways of improving performance of DPLLbased algorithms include

adding conflict clauses (generated by conflict analysis as described above) into the formula to prevent from getting into the same conflict again.

implementing unit propagation by using a technique that watches 2 literals at a time, instead of keeping track of all literals, in which case you’d need to sacrifice pure literal elimination. To learn more read this 2001 paper.

Other than using heuristic for decision as described above, we can use heuristic for backtracking. And that is actually what’s done in many CDCL SAT Solvers.
Backtracking like the one in classical DPLL is called chronological backtracking. Chronological in the sense that we we simply go back to the previous decision level. In heuristic backtracking, we go back to a decision level (or backtrack point) based on a heuristic function. The principal behind it is that we should learn from conflicts and avoid making the mistake of trying out all different variables assignments down a decision level only to find out that all the assignment at that decision level would only lead to unsatisfiability. That is to say that we would rely on the heuristic to help us in figuring out if the boolean assignment at some previous decision level may be a bad pick, and that we should backtrack all the way up.

implementing some restart policies, which force the solver to backtrack to decision level 0 (or some other level) when some condition is met. This is to prevent the solver from getting “stuck” at some part of the search space for a long time. To learn more about it, check out The Effect of Restarts on the Efficiency of Clause Learning and Dynamic Restart Policies.
and many and many more.
End Note  Further reading:

Boolean Satisfiability: Theory and Engineering (2014) 755 words

Why is 2SAT in P? (a CS.SE post) 300+ words

The Quest for Efficient Boolean Satisfiability Solvers (2002) ~9k words

The SAT Phase Transition (1994) ~5k words

Satisfiability Solvers a chapter from Handbook of Knowledge Representation

Empirical Study of the Anatomy of Modern Sat Solvers ~5k words

Handbook of Satisfiability a book

Algorithms for the Satisfiability Problem: A Survey (1996) ~60k words

Theory and Applications of Satisfiability Testing  SAT 2014 a book Also available on springer.