Writing a Crossword Helper
The Problem
I have gotten into crossword puzzles and I enjoy the thrill of finishing a puzzle. Sometimes though, I'll get stuck, and there's nothing in my entire imagination that I can so much as Google to get the answer.
Take the below situation for example. LAIC and RIME? These were both new words to me and they were intersecting.
I was at a total loss. Almost...
The Idea
What if I wrote a program that — now hear me out — guessed for me? I could use my computer's nigh-limitless imagination and have it do the creative work of coming up with guesses while I would inspect them to evaluate their sanity.
A Solution
Iteration I
Guessing Letters
The first program I wrote was a simple backtracking script that created every word possible given a string with spaces.
So python guesser.py -w 'an '
>> ['ana' ... 'anz']
With "LA C" and "R ME" I knew I needed just vowels, so I updated the script and got back ['laac', 'laic' ...]
.
It turned out that "RIME" looked familiar and sure enough "laic" and "rime" were the answers!
But it was a hollow victory. The app was still dumb. It didn't know English so guessing a word with many missing characters could take a very long time. With three missing letters it would need to make 263 or 17,576 guesses. And I would have to read them.
I thought I could make it better.
A Solution I Can Be Proud Of
Iteration II
Use a Corpus
Basically the problem with the dumb guess implementation is the search space (ie. there are too many guesses). I needed to reduce the number of guesses my program makes, while not eliminating potential answers.
Once I took a computational linguistics class and learned about the Natural Language Toolkit and corpora. NLTK is a Python library for doing language stuff, and a corpus is a collection of text, often that's been marked up (eg. by marking the ends of words or paragraphs).
So I pulled a corpus into my application and used it to filter the guesses.
from nltk.corpus import brown
# ...
if guess in brown.words():
print(guess)
Now it had better guesses, but it was very slow. The search space for me, the user, had gone way down but the program now had to make the same number of guesses and, for each guess, look through every word in the 1,161,192 words of the Brown Corpus.
Iteration III:
Use a Trie
All I wanted to do was say whether a string was a word. NTLK corpora can do a lot more than that. Some can give you synonyms for a word, or connect it to other words (think synonyms and roots).
My application performs a linear search when checking whether a string is a word. It just looks through every single word a corpus contains and checks whether one of them matches the input string. For this problem, there's already a much faster solution available: tries.
I learned about tries (AKA radix trees) from grinding Leetcode problems and their solutions. Using a tree to store the corpus would allow the application to search for words exponentially faster (O(log(n)) vs. O(n)) than the first implementation.
Because I didn't want to make a trie every time I ran the program, I serialized it. For fun I tried using both Pickler and Protobuf to store my trie. It turned out that protobuf stored the data three times more efficiently than Pickle.
Iteration IV: Use a Generative Algorithm
While implementing the trie I realized that my algorithm didn't have to check every letter for every unknown letter in a word. Instead it just needed to follow the tree-graph. If an edge 'e' didn't exist from a node, then the algorithm wouldn't even need to check whether "eee" is a word.
So here's my final implementation:
Trie Implementation
# Trie.py
import node_pb2
class Trie():
def __init__(self, words=list(), root=node_pb2.Node()):
self.children = dict()
self.root = root
self.term = False
for word in words:
if word.isalpha():
self.add_word(word)
def add_word(self, word):
return self._add_word(word.lower(), self.root)
def _add_word(self, word, root):
if word == '':
root.term = True
return root
res = None
fl = ord(word[0])
if root.children[fl]:
res = self._add_word(word[1:], root.children[fl])
else:
root.children[fl] = Node()
res = self._add_word(word[1:], root.children[fl])
return res
def __getitem__(self, word):
return self.get_word(word, self.root)
def get_word(self, word, root):
if not root:
return False
if word == '':
return root.term
res = self.get_word(word[1:], root.children[ord(word[0])])
return res;
# guesser.py
import node_pb2
from Trie import Trie
class Guesser(object):
def __init__(self, trie_path='words.proto'):
n = node_pb2.Node()
with open(trie_path, 'rb') as f:
n.ParseFromString(f.read())
self.trie_corpus = Trie(root=n)
def guess(self, word, curr='', idx=0, root=None):
if not root:
root = self.trie_corpus.root
if len(curr) == len(word):
if root.term:
print(curr)
return
if word[idx] == ' ':
for c in root.children.keys():
curr += chr(c)
self.guess(word, curr, idx+1, root=root.children[c])
curr = curr[:-1]
else:
curr += word[idx]
self.guess(word, curr, idx+1, root=root.children[ord(word[idx])])
def get_n_lengths(self, n, root=None, res=list(), prefix=''):
if not root:
root = self.trie_corpus.root
if n == 0:
if root.term:
res.append(prefix)
return res
for child in root.children.keys():
prefix += chr(child)
self.get_n_lengths(n-1, root=root.children[child], res=res, prefix=prefix)
prefix = prefix[:-1]
return res
© Sam Berg.RSS