Coding Mistakes

thebarrowboy on Flickr

Appearances can be deceiving. Just because a block of code looks good doesn’t mean that it’s error free; the code could be inefficient, or confusing to read, or it could leave the door wide open for malicious hacking attacks. A good programmer wants to be vigilant against such coding faux pas!

Since the best way to learn is by example, we’ll start by diving into the nitty-gritty of a small 50-line Python script that acts as a simple “proof-reader”.

First, the program downloads a list of English-language words from the Internet and stores them in a list, creating a simple dictionary. Next, it prompts the user to enter the name of a text file. The program then reads though this file and puts the new words in a different list. Finally, the program works through the user’s words one by one to see if they’re in the dictionary. If not, it highlights the offending mistakes in red and displays the text back to the user.

There’s a few subtleties:

  1. The dictionary words are stored in a set, not a list. The only important difference (for us) is that lists have a specific order while sets don’t, and this makes it considerably faster to check for a particular word inside a set.
  2. The program “sanitizes” the user’s words, which is a fancy to say it cleans them up by removing attached punctuation marks and converting capital letters into lowercase equivalents. Otherwise words like “However,” and “as:” would be flagged as mistakes since they don’t exactly match the entries “however” and “as”.

Take a peek at the code and see if you can spot where it goes wrong, or where potential problems might crop up. Don’t be intimidated if there are lines you don’t instantly understand! If a function is unfamiliar, pop it into a Google search engine and see if you can find an explanation. You can also play around with removing lines to see how their absence changes the program.

The Code

import urllib2
from termcolor import colored

running = True
while running:

    # Import dictionary words from internet link
    filepath = "https://raw.githubusercontent.com/dwyl/english-words/master/words.txt"
    file = urllib2.urlopen(filepath).read()
    words = file.split("\n") # "\n" is a newline character (the enter key). Here, we're separating words on different lines

    # Sanitize inputs
    for i in range(len(words)):
        words[i] = words[i].lower() # make capital letters lowercase

    words = set(words) # convert from a list to a set

    # Import words from local file
    filepath2 = raw_input("Enter name of file: ") # prompt user for name of file
    print("") # Add space in console display
    file2 = open(filepath2, 'r').read()
    words2 = file2.split(" ") # separate words using spaces

    # Create list of sanitized inputs
    words3 = [] # create a new, empty list
    for i in range(len(words2)):
        words3.append(words2[i].lower())

        # Remove punctuation marks
        punctuation = [".", ",", "!", "?", ";", ":", "(", ")"]
        for symbol in punctuation:
            words3[i] = words3[i].replace(symbol, "")

    # Check for mistakes using dictionary, highlight mistakes in red in the original list
    for i in range(len(words3)):
        if words3[i] not in words:
            words2[i] = colored(words2[i], "red")

    # Display results in terminal
    for word in words2:
        print(word),
    print("\n")

    # Ask user if they'd like to proof another file
    answer = raw_input("Would you like to proofread another file? (y/n) ")
    if answer == "y":
        running = True
    else:
        running = False

    print("")

Unfortunately, this particular script can’t be run in an online Python IDE. The “urllib2” package, which we use to download content from the internet, must be pre-installed on a computer, and most online programming environments don’t offer access to custom packages.

To run the script yourself I recommend downloading the free version of PyCharm. Other IDEs like Komodo Edit or Atom work just as well, though each has a slightly different method for installing urllib2. The instructions for PyCharm can be found here.

Finally, make sure that the text file you want to open is in the same folder as your script; otherwise you’ll have to specify the entire file path.

When running, you’ll see something like this:

Problem #1: Naming Conventions

Variable names should be descriptive. If your data structure is a list of words, then “words” is a decent name choice, although “words_list” might be even better since it tells us about the structure and the content of the variable. This is extra important Python, which isn’t a typed language. However, even with that slight improvement, we still get code like this:

for i in range(len(words_list3)):
    if words_list3[i] not in words_list:
        words_list2[i] = colored(words_list2[i], "red")

Convoluted enough to confuse even experienced programmers!

When you have many similar variables, it’s a good idea to make their names extremely distinct, even if means using long awkward names. Consider:

for i in range(len(sanitized_user_words)):
    if sanitized_user_words[i] not in dictionary_words:
        original_user_words[i] = colored(original_user_words[i], "red")

Much more intuitive! Even though list interaction and indexing is tricky to read, we can tell the program is checking our sanitized words against the dictionary words, and if there’s a mistake, it updates the user’s original words to a red-coloured version of themselves.

Are there any other variable names you think could be improved?

Problem #2: Exception Handling

Whenever you open a file or connect to a website there’s always a chance something will go wrong. The file could be corrupted, or your internet connection could unexpectedly break. In these scenarios the program crashes without explanation, leaving the user annoyed and confused.

In programming, risky operations should always be enclosed in try and except clauses. The “try” block contains the risky code, and if an exception (error) occurs during runtime, the program switches over to the code in the “except” block. If everything goes hunky-dory, the “except” block is ignored.

Typically, the code in the “except block” is used to clean up the program — close files, databases, and network connections — and print a user-friendly message that explains the error or the crash. More advanced programs might attempt to diagnose and fix the error.

The modified looks like this:

try:
    dict_file = urllib2.urlopen(dict_filepath).read()
except:
    print("An error occurred while trying to import dictionary words.")
    sys.exit(0)

Problem #3: Validating User Input

Most users are smart, reasonable people, but there’s always that one guy who inputs a letter when the program asks for a number. As a rule of thumb, always validate user input. Are there any letters in that phone number? Does the user-entered e-mail address contain the mandatory “@“ symbol? At the very least, input validation prevents awkward crashes, and at best, it stops hackers using a program to break into a computer.

In our proof-reader, we ask the user to enter the name of the file. Perhaps we want to check the file extension — does the name end in “.txt”, currently the only type of text file we can handle?

user_filepath = raw_input("Enter name of file: ") # prompt user for name of file
if not user_filepath.endswith(".txt"):
    print("Invalid file name")
    sys.exit(0)

Problem #4: What goes in the While Loop?

If you run the program yourself, you’ll notice that creating the dictionary takes a lot of time — a few seconds, to be precise. To be fair, we’re connecting to a server over a network and downloading almost 47,000 words. A few seconds may not seem like much, but it’s irritating to the user waiting in limbo.

So here’s the question — do we really need to make the dictionary from scratch at every iteration? Or would the program work equally well if we make the dictionary once and then start the loop?

Costly, slow operations may be unavoidable, but your job as a programmer is to minimize them as much as possible. This could mean being clever about when you start your loop. It could mean “caching” — keeping big files stored locally on your computer so that you don’t have to re-fetch them from a website or database. Or, in extreme cases, it could mean finding a more efficient programming language or algorithm.

Typically, “costly” programming operations involve networking, reading and writing to databases, manipulating graphics or iterating through lists with lots of elements.

“Hard-coding” means using a value directly inside your code instead of encapsulating that value inside a variable. For example:

words = file.split("\n")

Instead of:

separator = "\n"
words = file.split(separator)

This can lead to problems when expanding a program, especially when there’s numerous (sometimes thousands!) of files of code. What if you change your program to download the dictionary from a different source, only this source separates its words using semi-colons? Since hard-coding is subtle, the offending lines can easily get lost. Programmers might even forget that the hard-coded value exists!

As a general rule, use variables for any and all values that might change in a program’s future. A helpful practice is to cluster variables at the top of your file, so that they’re easy to find and you can’t forget about them. In the proof-reader, you might include the following after importing packages but before entering the while loop:

filepath = "https://raw.githubusercontent.com/dwyl/english-words/master/words.txt"
punctuation = [".", ",", "!", "?", ";", ":", "(", ")"]
separator = "\n"
highlight_color = “red"

That said, hard-coding isn’t always bad. Here’s one example:

file2 = open(filepath2, 'r').read()

The second argument — ‘r’ — indicates that we’re opening the file for a reading operation and not a writing operation. This is a core feature of the program; it’s not going to change no matter how many tools we add to the software. In this case, using an extra variable would just clutter up the code.

What if you can guarantee that the code is never, ever going to change? If you’re writing a quick script at a hackathon, or as a proof-of-concept, then hard-coding is convenient and saves time. But in a professional setting software always changes and evolves. Better to build good habits and avoid hardcoding from the start.

Updated Code Listing

import urllib2
import sys
from termcolor import colored

dict_filepath = "https://raw.githubusercontent.com/dwyl/english-words/master/words.txt"
punctuation = [".", ",", "!", "?", ";", ":", "(", ")"]
highlight_color = "red"

print("Welcome to the simple proof-reader! Please wait while we prepare the program.\n")

# Import dictionary words from internet link
try:
    dict_file = urllib2.urlopen(dict_filepath).read()
except:
    print("An error occurred while trying to import dictionary words.")
    sys.exit(0)

dictionary_words = dict_file.split("\n")  # separate words on different lines

# Sanitize inputs
for i in range(len(dictionary_words)):
    dictionary_words[i] = dictionary_words[i].lower()  # make capital letters lowercase

dictionary_words = set(dictionary_words)  # convert from a list to a set

""" START THE MAIN LOOP """
running = True
while running:

    # Import words from local file
    user_filepath = raw_input("Enter name of file: ") # prompt user for name of file
    if not user_filepath.endswith(".txt"):
        print("Invalid file name")
        sys.exit(0)

    print("") # Add space in console display

    try:
        user_file = open(user_filepath, 'r').read()
    except:
        print("Oops! That file could not be opened.")
        sys.exit(0)

    original_user_words = user_file.split(" ") # separate words using spaces

    # Create list of sanitized inputs
    sanitized_user_words = [] # create a new, empty list
    for i in range(len(original_user_words)):
        sanitized_user_words.append(original_user_words[i].lower())

        # Remove punctuation marks
        for symbol in punctuation:
            sanitized_user_words[i] = sanitized_user_words[i].replace(symbol, "")

    # Check for mistakes using dictionary, highlight mistakes in red in the original list
    for i in range(len(sanitized_user_words)):
        if sanitized_user_words[i] not in dictionary_words:
            original_user_words[i] = colored(original_user_words[i], highlight_color)

    # Display results in terminal
    for word in original_user_words:
        print(word),
    print("\n")

    # Ask user if they'd like to proof another file
    proof_again = raw_input("Would you like to proofread another file? (y/n) ")
    if proof_again == "y":
        running = True
    else:
        running = False

    print("")

Conclusion

What a makeover!

While we could still suggest more changes, mostly of these “improvements” are a question of taste. For example, we could make our prompts more descriptive and user-friendly. We could also split the main body of code into separate functions, which might help when expanding our program by allowing us to reuse snippets of code. Some programmers would claim this makes the code more confusing. Others programmers would find it easier to read.

Writing good code is an art as well as a science. If all these “best practices” seem overwhelming, just remember: if your code works, then you’ve already done the hard part! The rest is just polish.

Learn More

Exception Handling in Python

http://www.techbeamers.com/use-try-except-python/
https://www.tutorialspoint.com/python/python_exceptions.htm

Tips for better programming

https://www.sitepoint.com/10-tips-for-better-coding/
https://www.codementor.io/learn-programming/steve-klabniks-9-words-advice-programming-beginners
https://realpython.com/python-beginner-tips/

How to choose variable names

https://www.w3resource.com/python/python-variable.php
http://wiki.c2.com/?GoodVariableNames
https://www.python.org/dev/peps/pep-0008/#naming-conventions

Installing urllib2 package in PyCharm

https://www.jetbrains.com/help/pycharm/installing-uninstalling-and-upgrading-packages.html

Free Python IDEs

https://www.jetbrains.com/pycharm/download/#section=windows
https://www.activestate.com/komodo-ide/downloads/edit
https://atom.io/