Writing an S-Expression parser in Ruby

S-Expressions are a very versatile syntax for defining data structures and program code that are used by most Lisp-derived programming languages.

When experimenting with new languages S-Expressions are fantastic because they are extremely flexible but only require minimal effort to parse. In this article I am going to cover the basic principles of creating a robust S-Expression parser in Ruby.

Note: if you just want a complete and working S-Expression parser library for Ruby please check out Sexpistol

Note: Since writing this article I have found a more performant and concise parsing method using StringScanner. I will be writing an updated post shortly. If you would like to to have a look at the completed code using the new method check it out at GitHub. The code in this article is still relevant as an introduction.

Our parser

One of the first things we should have a look at is a simple example of some program code written as an S-Expression:

1 (define test (lambda () (
2   begin 
3   (print "test")
4   (print 1)
5 )))
6 
7 (test)

The first step in most parsers is to break the input text up into ‘tokens’, and in order to do that we have to know what the possible tokens are in our target grammar. In our case we have it easy as there are only 5 distinct tokens that we care about initially:

Opening parentheses (
Closing parentheses )
Symbols define begin print etc...
String literals "test" "foo" etc...
Integer literals 2 5 99 etc...

The first thing we should notice is that one of these tokens is a ‘special case’ in that it can contain other sequences of characters that we would normally perceive as tokens in of themselves. This complicated token is the String literal.

In order to simplify the problem for ourselves the first thing we are going to do is find all of the string literals, copy them into an array, and then replace them with a special placeholder string:

 1 def extract_string_literals( string )
 2   string_literal_pattern = /"([^"\\]|\\.)*"/
 3   string_replacement_token = "___+++STRING_LITERAL+++___"
 4   # Find and extract all the string literals
 5   string_literals = []
 6   string.gsub(string_literal_pattern) {|x| string_literals << x}
 7   # Replace all the string literals with our special placeholder token
 8   string = string.gsub(string_literal_pattern, string_replacement_token)
 9   # Return the modified string and the array of string literals
10   return [string, string_literals]
11 end

After running a string through this method we have a new string in which we do not have to worry about any special cases, there is no possibility a parentheses is going to be detected inside a string literal which makes our life much easier.

Next let’s split up the input into its individual tokens. We can do this quite simply by adding spaces around each opening or closing parentheses and then splitting the string up on whitespace like so:

1 def tokenize_string( string )
2   string = string.gsub("(", " ( ")
3   string = string.gsub(")", " ) ")
4   token_array = string.split(" ")
5   return token_array
6 end

Now that we have an array of tokens we actually need to add our string literals back into their correct places in the array, we can do this by iterating over our array of tokens and detecting our string replacement token ___+++STRING_LITERAL+++___:

 1 def restore_string_literals( token_array, string_literals )
 2   return token_array.map do |x|
 3     if(x == '___+++STRING_LITERAL+++___')
 4       # Since we've detected that a string literal needs to be replaced we
 5       # will grab the first available string from the string_literals array
 6       string_literals.shift
 7     else
 8       # This is not a string literal so we need to just return the token as it is
 9       x
10     end
11   end
12 end

Finally we have a complete and clean array of tokens from our S-Expression! The next step is parsing these tokens into whatever form we require. In this case what we are going to do is turn each token into an object of it’s relevant Ruby class. Symbols to Symbol, integers to Fixnum, and strings to String. To do this we need to first be able to detect each type of token, we will define four simple methods to help us with this:

 1 # A helper method to take care of the repetitive stuff for us
 2 def is_match?( string, pattern)
 3   match = string.match(pattern)
 4   return false unless match
 5   # Make sure that the matched pattern consumes the entire token
 6   match[0].length == string.length
 7 end
 8 
 9 # Detect a symbol
10 def is_symbol?( string )
11   # Anything other than parentheses, single or double quote and commas
12   return is_match?( string, /[^\"\'\,\(\)]+/ ) 
13 end
14 
15 # Detect an integer literal
16 def is_integer_literal?( string )
17   # Any number of numerals optionally preceded by a plus or minus sign
18   return is_match?( string, /[\-\+]?[0-9]+/ ) 
19 end
20 
21 # Detect a string literal
22 def is_string_literal?( string )
23   # Any characters except double quotes 
24   # (except if preceded by a backslash), surrounded by quotes
25   return is_match?( string, /"([^"\\]|\\.)*"/) 
26 end

Now we are able to detect all of the tokens in our grammar! Next we need to convert them to our target Ruby objects:

 1 def convert_tokens( token_array )
 2   converted_tokens = []
 3   token_array.each do |t|
 4     converted_tokens << "(" and next if( t == "(" )
 5     converted_tokens << ")" and next if( t == ")" )
 6     converted_tokens << t.to_i and next if( is_integer_literal?(t) )
 7     converted_tokens << t.to_sym and next if( is_symbol?(t) )
 8     converted_tokens << eval(t) and next if( is_string_literal?(t) )
 9     # If we haven't recognized the token by now we need to raise
10     # an exception as there are no more rules left to check against!
11     raise Exception, "Unrecognized token: #{t}"
12   end
13   return converted_tokens
14 end

Now we have a nice array of tokens that represent our data! Unfortunately it is still a flat array, so our last step needs to be to re-create the structure defined in the input text using arrays. To do this in an elegant way we are going to use a recursive method:

 1 def re_structure( token_array, offset = 0 )
 2   struct = []
 3   while( offset < token_array.length )
 4     if(token_array[offset] == "(")
 5       # Multiple assignment from the array that re_structure() returns
 6       offset, tmp_array = re_structure(token_array, offset + 1)
 7       struct << tmp_array
 8     elsif(token_array[offset] == ")")
 9       break
10     else
11       struct << token_array[offset]
12     end
13     offset += 1
14   end
15   return [offset, struct]
16 end

Now that we have all the pieces we can parse S-Expressions by simply chaining the methods together!

1 def parse( string )
2   string, string_literals = extract_string_literals( string )
3   token_array = tokenize_string( string )
4   token_array = restore_string_literals( token_array, string_literals )
5   token_array = convert_tokens( token_array )
6   s_expression = re_structure( token_array )[1]
7   return s_expression
8 end

Feeding in an S-Expression like:

1 (this (is a number 1( example "s-expression")))

produces:

1 [[:this, [:is, :a, :number, 1, [:example, "s-expression"]]]]

Obviously we would like to encapsulate this code into a class for neatness and convenience, if you don’t want to do this yourself you can check out Sexpistol, a pre-implemented and tested S-Expression parser library for Ruby.

For the full example code please see this Gist