Writing a Crossword Helper

Sam Berg, Mon May 16 2022 • project ad-hoc

GitLab
Jump to Implementation

The Problem

I have gotten into crossword puzzles and I enjoy the thrill of finishing a puzzle. Sometimes though, I'll get stuck, and there's nothing in my entire imagination that I can so much as Google to get the answer.

Take the below situation for example. LAIC and RIME? These were both new words to me and they were intersecting.

I was at a total loss. Almost...

The Idea

What if I wrote a program that — now hear me out — guessed for me? I could use my computer's nigh-limitless imagination and have it do the creative work of coming up with guesses while I would inspect them to evaluate their sanity.

A Solution

Iteration I

Guessing Letters

The first program I wrote was a simple backtracking script that created every word possible given a string with spaces.

So python guesser.py -w 'an ' >> ['ana' ... 'anz']

With "LA C" and "R ME" I knew I needed just vowels, so I updated the script and got back ['laac', 'laic' ...].

It turned out that "RIME" looked familiar and sure enough "laic" and "rime" were the answers!

But it was a hollow victory. The app was still dumb. It didn't know English so guessing a word with many missing characters could take a very long time. With three missing letters it would need to make 26³ or 17,576 guesses. And I would have to read them.

I thought I could make it better.

A Solution I Can Be Proud Of

Iteration II

Use a Corpus

Basically the problem with the dumb guess implementation is the search space (ie. there are too many guesses). I needed to reduce the number of guesses my program makes, while not eliminating potential answers.

Once I took a computational linguistics class and learned about the Natural Language Toolkit and corpora. NLTK is a Python library for doing language stuff, and a corpus is a collection of text, often that's been marked up (eg. by marking the ends of words or paragraphs).

So I pulled a corpus into my application and used it to filter the guesses.

from nltk.corpus import brown

# ...
if guess in brown.words():
  print(guess)

Now it had better guesses, but it was very slow. The search space for me, the user, had gone way down but the program now had to make the same number of guesses and, for each guess, look through every word in the 1,161,192 words of the Brown Corpus.

Iteration III:

Use a Trie

All I wanted to do was say whether a string was a word. NTLK corpora can do a lot more than that. Some can give you synonyms for a word, or connect it to other words (think synonyms and roots).

My application performs a linear search when checking whether a string is a word. It just looks through every single word a corpus contains and checks whether one of them matches the input string. For this problem, there's already a much faster solution available: tries.

I learned about tries (AKA radix trees) from grinding Leetcode problems and their solutions. Using a tree to store the corpus would allow the application to search for words exponentially faster (O(log(n)) vs. O(n)) than the first implementation.

Because I didn't want to make a trie every time I ran the program, I serialized it. For fun I tried using both Pickler and Protobuf to store my trie. It turned out that protobuf stored the data three times more efficiently than Pickle.

Iteration IV: Use a Generative Algorithm

While implementing the trie I realized that my algorithm didn't have to check every letter for every unknown letter in a word. Instead it just needed to follow the tree-graph. If an edge 'e' didn't exist from a node, then the algorithm wouldn't even need to check whether "eee" is a word.

So here's my final implementation:

Trie Implementation

# Trie.py
import node_pb2

class Trie():

    def __init__(self, words=list(), root=node_pb2.Node()):
        self.children = dict()
        self.root = root
        self.term = False

        for word in words:
            if word.isalpha():
                self.add_word(word)

    def add_word(self, word):
        return self._add_word(word.lower(), self.root)

    def _add_word(self, word, root):

        if word == '':
            root.term = True
            return root

        res = None
        fl = ord(word[0])

        if root.children[fl]:
            res = self._add_word(word[1:], root.children[fl])
        else:
            root.children[fl] = Node()
            res = self._add_word(word[1:], root.children[fl])

        return res

    def __getitem__(self, word):
        return self.get_word(word, self.root)

    def get_word(self, word, root):
        if not root:
            return False

        if word == '':
            return root.term

        res = self.get_word(word[1:], root.children[ord(word[0])])

        return res;

# guesser.py

import node_pb2
from Trie import Trie

class Guesser(object):
    def __init__(self, trie_path='words.proto'):
        n = node_pb2.Node()

        with open(trie_path, 'rb') as f:
            n.ParseFromString(f.read())

        self.trie_corpus = Trie(root=n)

    def guess(self, word, curr='', idx=0, root=None):
        if not root:
            root = self.trie_corpus.root

        if len(curr) == len(word):
            if root.term:
                print(curr)
            return

        if word[idx] == ' ':
            for c in root.children.keys():
                curr += chr(c)
                self.guess(word, curr, idx+1, root=root.children[c])
                curr = curr[:-1]
        else:
            curr += word[idx]
            self.guess(word, curr, idx+1, root=root.children[ord(word[idx])])

    def get_n_lengths(self, n, root=None, res=list(), prefix=''):
        if not root:
            root = self.trie_corpus.root

        if n == 0:
            if root.term:
                res.append(prefix)

            return res

        for child in root.children.keys():
            prefix += chr(child)
            self.get_n_lengths(n-1, root=root.children[child], res=res, prefix=prefix)
            prefix = prefix[:-1]

        return res