Reading CSV Files Using the Scanner Class

For the third time in the last couple of years, I’ve found myself reaching for Apple’s Scanner class. Every time I use it I find myself tripping over the deceptively simple API.

This is my attempt to document it in a way that I understand for future reference.

(You’re welcome, future me.)

The Problem

Let’s say we have a string which represents the contents of a CSV file:

let csvString = """
"Apple, Inc.","One Infinite Loop",Cupertino,CA,95014
"Google, Inc.","1600 Amphitheatre Parkway","Mountain View",CA,94043 
"""

We want to break the cells up into individual strings. At first glance, this seems like it should be easy:

lines = csvString.components(separatedBy: .newlines)
var cells : [String] = []
for line in lines {
	cells.append(contentsOf: line.components(separatedBy: ",") )
}
print(cells)

Done.

Except, not. Here’s the output:

["\"Apple", " Inc.\"", "\"One Infinite Loop\"", "Cupertino", "CA", "95014", "\"Google", " Inc.\"", "\"1600 Amphitheatre Parkway\"", "\"Mountain View\"", "CA", "94043"]

It’s broken up "Apple, Inc." into "Apple and Inc.". It’s done the same with "Google, Inc.". It’s kept the double quotes but these are now separated into multiple cells.

It’s a mess, in other words.

We need some rules here:

  1. If the cell separator is in between two quotes, then it should be ignored.
  2. Otherwise, it should be used to separate the cells into individual strings.
  3. If a cell is surrounded by quotes, then those quotes should be discarded

These rules already make what we need more complicated than what the basic String methods can provide.

We could use regular expressions if we wanted to spend days writing something that we would never be able to understand ever again, or…

Break Out the Scanner!

Scanner (documentation here) is a string parser that is built into Foundation and is simple but incredibly powerful. Pass it a string, then tell it what to look out for and it will go through the string character by character.

It’ll report back when it finds something interesting in and give us the opportunity to do something with what it has found. This could be a single character or a string of characters depending on how far its been since our last query.

The methods to interact with it are simple to understand at an abstract level, but it is worth breaking things down into stages to fully understand what’s going on as it moves through the string.

We’ll start with a new Playground and create a function stub that looks like this:

import Cocoa
func scan( _ line : String, separatedBy separator : String = "," ) -> [String] {
}

(I’m using a macOS Playground, but as this is a Foundation class it should work equally well with an iOS Playground).

In order to make sure that my CSV scanner is comprehensive, I’ll create a small test suite of lines of different CSV formatting variants

// 1. 
let cellCount = 6
// 2. 
let line1 = "Item1,Item2,Item3,Item4,Item5,Item6"
let line2 = "\"Item1\",,Item3,Item4,,"
let line3 = "\"\",\"\",Item3,Item4,Item5,"
let line4 = "\"\",\"\",\"Item3\",\"Item4\",,Item6"
let line5 = "\"\",\"I,t,e,m,2,\",\"Item3\",\"Item4\",,"
let line6 = "Item1,,\"Item3\",Item4,,"
let line7 = ",\"\",\"Item3\",Item4,,"
let lines = [line1, line2,line3,line4, line5,line6,line7]
  1. Every line will have the same number of cells. This variable is used to verify that all the cells are represented in the array returned from the scan function.
  2. Declare all the individual lines we want to test and put them in an array.
// 1. 
for (idx, line) in lines.enumerated() {
	let cells = scan(line)
	// 2.
	assert(cells.count == cellCount, "Cell count should always be \(cellCount) (Current count: \(cells.count) on line \(idx + 1)")
	// 3.
	for (cellIdx, cell) in cells.enumerated() {
		if line == line1 {
			assert(cell == "Item\(cellIdx + 1)", "Item is not correct")
		}
		if line == line2 {
			switch cellIdx {
			case 1,4,5:
				assert(cell == "", "Item is not correct on line \(idx + 1): \(cell)")
			default:
				assert(cell == "Item\(cellIdx + 1)", "Item is not correct on line \(idx + 1): \(cell)")
			}
		}
		// ... Rest of the line tests go here
	}
}
  1. Go through each line and run the scan function.
  2. Assert that the correct number of cells is being returned using the cellCount variable. One of the main reasons why the incorrect number might be returned is if the scanner function doesn’t include empty strings in the returned array to represent cells with nothing in them.
  3. Finally, go through each cell and ensure that strings within the cells are correct based on the cell index. Each cell should be empty or include a variation on the string Item<cellNumber> at the correct cell index (e.g. Item1 at cell index 0, Item2 at cell index 1, etc.).

Right now, it fails at the first assertion.

Time to fill out the function!

Low-Hanging Fruit

Let’s deal with the easy situation first. If there are no "s in the line, then we can safely break the string apart by the comma using the naive solution.

Under the scan(_ line : String, separatedBy separator: String = ",") -> [String] { declaration, we’ll add:

guard line.range(of: "\"") != nil else {
	return line.components(separatedBy: separator)
}

Check the log and we see:

Line 1 passed OK

Easy!

The Hard Part

That won’t work for all but the simplest CSV files. So, under the guard statement, we’ll set up some variables:

// 1. 
var cells: [String] = []
// 2. 
var value:NSString?
// 3. 
var separatorValue : NSString?
// 4. 
let textScanner = Scanner(string: line)
  1. This empty array will be populated with strings representing each cell value in the CSV input (probably, hopefully).
  2. This NSString will be passed into the inout parameter of the scanner functions as a reference. The scanner will initialise it and populate it with the characters it has scanned when the functions are called, or leave it nil if no characters have been scanned.
  3. I want to keep track of the values of the cells separately from the separators. This second NSString will allow me to keep hold of any separators the scanner finds as it passes through the string as there are some situations where this information is going to be important.
  4. Finally, initialise the scanner with the current line of the CSV (a String type, passed as a variable into my scan(separatedBy:) function).

Let the scanning begin!

Under the variables, we’ll start with this:

while !textScanner.isAtEnd {
	value = nil
	// Loop contents go here...
}
return cells

The scanner works by stepping through each character in a string sequentially. There’s an internal pointer that keeps track of its position and, when it reaches the end, the isAtEnd property is set to true.

While it is not at the end, we should reset the value variable to nil at each pass through the loop.

Right now, the scanner will never reach the end as we don’t call any methods that force the pointer to move forward.

Let’s do that next. Under the opening brace of the while loop, we’ll add:

// 1.
if textScanner.scanString("\"", into: nil) {
	// 2.
	textScanner.scanUpTo("\"", into: &value)
	// 3. 
	textScanner.scanString("\"", into: nil)
	// 4.
	if let hasValue = value as String? {
		cells.append(hasValue)
	}
}
  1. The scanString(_,into:) method looks at the next characters and, if it matches the passed string, populates the second parameter (an inout NSString object). It only returns true if the next characters match the passed string. This means we can use it to check if the next character in the line is a double quote. If it is, we need to start capturing the string inside the quote. By passing nil as the second parameter, we are telling the scanner to ignore this first double quote character and move its location forward by one.
  2. The scanUpTo(_, into:) method starts from the current location of the scanner and goes through the string until it finds the first string parameter. If it does, it populates the second inout NSString parameter with what it found between where it started and the position of the double quote character. Note that, because it is scanning up to the character(s), it does not include the character(s), which is why…
  3. We need to disappear the second double quote. Again, passing nil as the second parameter is effectively saying ignore this character and move the location forward by one.
  4. If value is not nil, this means that characters were found between the first and second double quotes and we should add them to our cells array. If it is nil, that means there were some double quotes but nothing between them.

We’ll handle this nil case when we deal with the cell separators below.

Speaking of which, let’s continue:

// 1.
if textScanner.scanUpTo(separator, into: &value) {
	// 2. 
	textScanner.scanString(separator, into: &separatorValue)
	// 3.
	cells.append(value! as String)
	// 4.
	if textScanner.isAtEnd, separatorValue != nil {
		cells.append("")
	}
} else {
	// 5.
	textScanner.scanString(separator, into: nil)
	// 6. 
	if value == nil {
		cells.append("")
	}
	// 7.
	if textScanner.isAtEnd {
		cells.append("")
	}
}

1. If there has been a double quoted item or the location is at an empty item (for example: the string “,,,,” is a valid CSV line, but it is all empty items), then this will be false because the next character is the separator (and therefore the scanner hasn’t moved forward). If it’s not false, then there are characters before the separator that do not have double quotes around them.

Scan them into the value string (which will still be nil, as the only way it could have been populated was if there was a double quoted item. As we’ve already taken care of that earlier in the loop, we would have reached a separator and the textScanner.scanUpTo(separator, into: &value) would have returned false.

2. This time we need to keep hold of the scan result, as it will tell us if we’re at the end of the line or not.

3. We’ll add everything up to the separator to the cells array.

4. If the separatorValue is not nil and we’re at the end of the line, then that means the last item of this line was a single separator character (e.g. a comma). We therefore need to add an empty item to represent the last (empty) cell (e.g. a CSV that looks like this "Item 1", "Item 2",, represents 4 cells with the last two being empty).

5. If the next character was a separator, then we first need to ignore and move past this separator.

6. If we’ve reached this far, and value is still nil, then either the first item was empty (,Item2,Item3,Item4) or the string between the quotes was empty (“”,“Item 2”,“Item 3”,“Item 4”). Either way, add a blank item to the cells array to represent it (this takes care of the empty double quotes case mentioned earlier).

7. Finally, we just ate a separator which pushed the scanner forward one character. If that takes us to the end of the line, then that means the last character on the line was a separator and we need to account for that by adding an empty string to the cells array.

Run through all the test strings, and the console shows this:

Line 1 passed OK
Line 2 passed OK
Line 3 passed OK
Line 4 passed OK
Line 5 passed OK
Line 6 passed OK
Line 7 passed OK

Excellent!

A lot of dealing with Scanner is imagining where the location is in the string and how we might move it forward through the string.

This is complicated by the fact that all of this takes place in a loop and so we have to imagine where the location is at the end of every loop and what characters might be next up when the loop restarts.

Despite the necessity of these mental gymnastics, the results of using Scanner are often robust and cover many of the edge cases that more simple solutions might miss.

I just have to hope that this explanation will still be clear enough for me six months from now when I next have to do some string parsing…

The Playground for this post is available on Github.