[wp-hackers] Portable tokenising from the shell
David Anderson
david at wordshell.net
Sat Dec 1 16:57:08 UTC 2012
Hi,
Some of you may remember an earlier discussion about parsing JSON
output, which is one of the formats available from api.wordpress.org.
JSON was the most suitable for portably parsing from a Bourne/Bash shell.
This guy has implemented such a parser already:
http://github.com/dominictarr/JSON.sh
One part of the parser is this. It's the tokeniser, splitting up the
JSON into parts:
local ESCAPE='(\\[^u[:cntrl:]]|\\u[0-9a-fA-F]{4})'
local CHAR='[^[:cntrl:]"\\]'
local STRING="\"$CHAR*($ESCAPE$CHAR*)*\""
local NUMBER='-?(0|[1-9][0-9]*)([.][0-9]*)?([eE][+-]?[0-9]*)?'
local KEYWORD='null|false|true'
local SPACE='[[:space:]]+'
grep -E -o "$STRING|$NUMBER|$KEYWORD|$SPACE|."
It's an interesting use of grep; basically it matches *everything*, but
splits it up based on certain separators, in a certain order.
However... my research shows that the "-o" switch (which causes grep to
output only each matched portion, one per line) is not part of POSIX,
but is nonetheless available in GNU (hence Linux and Cygwin),
Free/Net/OpenBSD and Mac OS X - but not in Solaris (either in the grep
in /usr/bin or in /usr/xpg4/bin).
So it's not quite totally portable. My question: does anyone have
sufficient sed or awk skills to advise me how to reproduce the above in
one of those? As I said, it's a tokeniser, that splits the input into
the discrete chunks indicated. I'm an awk novice. I'm trying to write
code that assumes only POSIX, or failing that the common subset of
GNU/BSD/Mac/Solaris. If I fail I can use various hacks (e.g. search for
perl, use that if found, search for PHP, use that), but it'd be nice if
I didn't have to resort to multiple code paths in that way.
Many thanks,
David
--
WordShell - WordPress fast from the CLI - www.wordshell.net
More information about the wp-hackers
mailing list