Friday, January 24, 2014

Windows Powershell - Extract Json property values

Hi guys,

I've had to extract json property values, from a CSV document. Sounds bizarre, right? We've got the CSV file exported from a database, where a JSON was stored in one of the columns. As the JSON is not any close to native to relational database, we couldn't query the JSON column to extract data from it.

The problem

Can be restated as "How can you extract JSON property values from any text file" or even generalized "How to extract anything from a large file". Our requirement would be 'do this fast and easy in Windows environment and be able to process any text file'.

Sample data

We'll be using this short and simple json file for our example:


client.js
[
    {
        "firstName": "A",
        "clientId": 2
    },
    {
        "firstName": "B",
        "clientId": 14
    },
    {
        "firstName": "C",
        "clientId": 15
    },
    {
        "firstName": "D",
        "clientId": 3
    }
]

Decompose

Let's decompose the problem into few simple steps:


  • Searching through text is a complex operation. The most common approach would be to use Regular Expressions as they provide flexibility and support in practically any programming language.
    • The json language complies to several simple rules, so extracting from it would be easy. A regex for extracting the client id is: '"clientId": (?(\d+))'. It matches the "clientId" property name, then a colon :, then a new  named group starts, which expects a number (\d+).
  • Instant coding and execution leads us to some scripting language. The scripting language does not need compilation, creating new projects or anything alike which could slow us down.
    • We'll be using Powershell to execute the regular expression, as it's a common language for executing such activities on Windows.

Script

# set our client id regex, which can extract client id fields from the text
$clientIdRegex = '"clientId": (?(\d+))'

# search through the file with "select-string", providing our regex as a pattern and find all matches
# pipe the matches to a foreach, which traverses them and then pipe each match to a foreach which extracts
# the group clientid and get's its value.
select-string -Path .\clients.js -Pattern $clientIdRegex -AllMatches | ForEach-Object {$_.Matches } | ForEach-Object { $_.groups["clientid"].Value }

And that's the whole script - just two lines!

# If we need to output the values in a file, we need just to pipe the output into a file like this: | ForEach-Object { $_.groups["clientid"].Value } > output.txt

Note: In most powershell scripts, you would see the usage of the alias "%" instead of ForEach-Object. Here it's used to illustrate the idea.

In conclusion

Knowing script languages is powerful tool in our utility belt. They can facitilate actions which should be done only once and there's no need for a maintainable and robust solution. Plus, they are easily edited and executed.