How to Fuzzy Matching

Recently, I’ve been trying to write a language server for Souffle. One of the most important jobs for a language server is to provide a list of auto-completion items for the editor. If you have used any modern editor/IDE, you know what I’m talking about, and you know this thing can be very smart!

Well, the functionality comes from two parts. First is that the server knows about your request’s semantic context, e.g., when your cursor stops at class_name->[cursor], it gives you a list of attributes name of that class. Second, the server fuzzy matches your inputs against a list of candidates so that you don’t need to type the precise prefix to get a match. For example, ‘uptr’ can be matched to “unique_ptr” with Clangd.

I’m gonna cover one of the algorithms for implementing fuzzy matches. The Needleman-Wunsch algorithm. DW is used mostly (or originated) in bioinformatics; due to the terminology and paradigm differences, it is a bit tedious to go through the materials in their fields. So I’ll talk about it from a SE’s perspective. Later, I’ll show you how native NW does not work directly in auto-completion use cases and requires extra tuning. There are also two implementations I found that uses NW ¹, ccls and clangd. You can find similar corresponding implementation in clangd source code for all the parts I covered in this post.

Needleman-Wunsch

The general goal for NW is to calculate a “score” from the pattern string $P$ to the word string $W$. The score is calculated by deciding what kind of mutations need to be taken to match $P$ and $W$.

Let’s first define some notions. Let $NW(P, W)$ be our optimal solution for string $P$ and $W$. Given $P = P_0 \ldots P_n$ and $W = W_0 \ldots W_m$. At each position, we have three choices. Each choice will result in either a score gain or a score loss. Starting from the last position $(n, m)$, we can:

Match $P_n$ and $W_m$ results in gain if two characters are matched. Otherwise, we lose scores. Then continue on matching string $P[0:n-1]$ and $W[0:m-1]$.
Ignore $P_n$ or $W_n$ and results in a score lost. If we ignore $P_n$, we continue on matching $P[0:n-1]$ with $W[0:m]$, vice versa.
Ignore $P_n$ and $W_n$, results in a (higher) score lost. Continue on matching $P[0:n-1]$ with $W[0:m-1]$, vice versa.

The intuition behind those rules should be pretty straightforward. All we need to do is calculate the best score we can possibly obtain. The formulation should come naively as a dynamic programming problem: \[ NW(P[0:n], W[0:m]) = Max \begin{cases} s + NW(P[0:n-1], &W[0:m-1]) & \newline s + NW(P[0:n], &W[0:m-1]) & \newline s + NW(P[0:n-1], &W[0:m]) & \newline \end{cases} \]

Where $s$ is either the bonus or the penalty, depends on the choice. I’ll skip through the tedious actual implementation and talk about how we implement heuristics.

Disallow Mismatch

The first thing that is interesting from clangd’s implementation is that they disallow mismatch. That is, “unw_ptr” will never matches “unique_ptr”. This first seems counter-propose — how is fuzzy matching useful if they disallow any typo? The reason is simple, it is not used in spell correction, which you must allow mismatch. For auto-completion, if we allow mismatch, we will inevitably provide many false positives. So it all depends on the use case. The chance user won’t be able to find their word and won’t be able to tell the typo is not something we should consider. The same thing goes to file fuzzy matching, where one / two mismatches are allowed because remembering all the file names can be hard — user might type “_” where the file name actually contains “-”. Also, if you use a fuzzy matcher to find historical commands in your terminal, they probably allow some mismatch. Anyway, to turn off mismatch is easy, you simply disallow the algorithm to go for the match

Disallow Ignoring Pattern

The pattern refers to the input from the user. That is, we must consume all the characters from the user input. All characters must match something. The rationale behind this is similar: we assume the user knows what they are after. Again, for an auto-completion tool, we are not providing tools so that users can make mistakes, our purpose is that users can type the word they want quickly. To disable this, simply disallowing the third rule for the algorithm.

Bonus and Penalty

Finally, we give an extra bonus/penalty based on the situation. In Clangd’s implementation, a bonus is given if the first characters of two words are matched. This makes sense since you usually don’t start from the middle when you type a word. Clangd’s also tokenized each character to a special role. For example, in “get_string”, “g” and “s” will have the role HEAD because they are the head of a segment; everything else will have the role TAIL. When two HEAD segments are matched, it will have a bonus point. On the other hand, characters of different rules matched will have a minor penalty, suggesting to the algorithm that this is probably not how it wants to match the two words (a suboptimal match).

Conclusion

OK, this is just a quick and short post. I talked briefly about the fuzzy matching algorithm and how to tune it to be used effectively in an auto-completion use case. This post is totally for me personally and has zero educational purposes. If you are interested, please go look for the source code from Clangd. The documentation is very clear and explains the idea way better than I just did. After studying the algorithm for half of my day, I just feel like it might be worth a post…