Monday, January 18, 2010

sscanf Tips and Tricks

It's rare to see sscanf used for string parsing in production code. I'm not entirely sure why it's shunned, because sscanf can parse a wide variety of strings in a single line of code, and the alternatives are typically extremely verbose and complicated in comparison. Parsing character strings letter-by-letter? Yuck. I think the issue is that its powers are hard to wield, and there aren't a lot of "Beginner's Guide to Sscanf Mastery" like we have for regular expressions. Also, the language of sscanf is quirkier and less thorough than regular expressions, so lots of users give up on it early. Don't be so hasty.

Before we begin, things to know about sscanf:

• sscanf's return value is the number of matched elements. This differs from sprintf which returns the number of generated characters. The number of matches doesn't seem too useful on the surface--shouldn't it just match the number of elements in the format string? As it turns out, this value can actually be quite useful. (And if you need to know the number of characters sscanf has consumed, you can use %n for this. Note that Microsoft actually disables %n by default, contrary to the standard, because of security implications, so you need to re-enable it before calling sscanf.)

• sscanf typically requires your format string body to match the input string exactly. For instance, if your string is "You scored 1000 points!" and your format is "You scored %d points!", your number will be read in without incident. But if your format string is "YOU SCORED %d POINTS!", sscanf will give up before it reads the "1000" value, because the strings up until that point don't match up. It's even case-sensitive, and unlike regular expressions, there's no option to disable case sensitivity. To work around this limitation, you can use _strlwr if you're using Visual C++ to make your input string all lowercase. Other platforms will need to make do with a standards-compliant _strlwr equivalent, such as the inelegant but compact:
std::transform(myString, myString + strlen(myString), myString, tolower);

• sscanf treats any and all whitespace as equal. In other words, sscanf will match any whitespace with any other whitespace, in any quantity. So if your input string is "This\n is a \t\t\t test", it will match the format string "This is \n\r\t a test". No problem. If you have a need for precise whitespace matching, you can sometimes use %c and then verify the character manually, but generally sscanf isn't going to be very convenient for you. Fortunately, in most cases, you don't care about the precise type or quantity of whitespace. It's a little odd that the API is completely strict about case and completely non-strict about whitespace, but that's how it works.

• sscanf's %s specifier stops reading the string as soon as it encounters whitespace. This may be useful in some cases, but in many other cases I've found it to be unhelpful.

• One of sscanf's least-known tricks is that it can do character groups, just like regular expressions. For instance, "%[A-Za-z]" is like %s, but will stop reading as soon as it encounters any non-alphabet character. The caret inverts the effect; "%[^=:]" will keep reading and consuming any character until it finds an equal sign or colon.

(to be continued)

No comments:

Post a Comment