Fun with PowerShell: Deduplicating Records

Let's deduplicate the list of Avengers movies we got in the previous post.

In the previous post, we got a list of Avengers movies from the Open Movie Database and printed it onto the screen.

$movies = Invoke-RestMethod "http://www.omdbapi.com/?apikey=$key&s=Avengers"

$movies.search | Format-List
Title  : The Avengers
Year   : 2012
imdbID : tt0848228
Type   : movie
Poster : https://m.media-amazon.com/images/M/MV5BNDYxNjQyMjAtNTdiOS00NGYwLWFmNTAtNThmYjU5ZGI2YTI1XkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_SX300.jpg

Title  : Avengers: Age of Ultron
Year   : 2015
imdbID : tt2395427
Type   : movie
Poster : https://m.media-amazon.com/images/M/MV5BMTM4OGJmNWMtOTM4Ni00NTE3LTg3MDItZmQxYjc4N2JhNmUxXkEyXkFqcGdeQXVyNTgzMDMzMTg@._V1_SX300.jpg

Title  : Avengers: Infinity War
Year   : 2018
imdbID : tt4154756
Type   : movie
Poster : https://m.media-amazon.com/images/M/MV5BMjMxNjY2MDU1OV5BMl5BanBnXkFtZTgwNzY1MTUwNTM@._V1_SX300.jpg

Title  : The Avengers
Year   : 1998
imdbID : tt0118661
Type   : movie
Poster : https://m.media-amazon.com/images/M/MV5BYWE1NTdjOWQtYTQ2Ny00Nzc5LWExYzMtNmRlOThmOTE2N2I4XkEyXkFqcGdeQXVyNjUwNzk3NDc@._V1_SX300.jpg

Title  : The Avengers: Earth's Mightiest Heroes
Year   : 2010–2012
imdbID : tt1626038
Type   : series
Poster : https://m.media-amazon.com/images/M/MV5BYzA4ZjVhYzctZmI0NC00ZmIxLWFmYTgtOGIxMDYxODhmMGQ2XkEyXkFqcGdeQXVyNjExODE1MDc@._V1_SX300.jpg

Title  : Ultimate Avengers
Year   : 2006
imdbID : tt0491703
Type   : movie
Poster : https://m.media-amazon.com/images/M/MV5BNDFmZTkxMjktMzRiYS00YzMwLWFhZDctOTQ2N2NlOTAyZDJhXkEyXkFqcGdeQXVyNjgzNDU2ODI@._V1_SX300.jpg

Title  : Ultimate Avengers II
Year   : 2006
imdbID : tt0803093
Type   : movie
Poster : https://m.media-amazon.com/images/M/MV5BZjI3MTI5ZTYtZmNmNy00OGZmLTlhNWMtNjZiYmYzNDhlOGRkL2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyNTAyODkwOQ@@._V1_SX300.jpg

Title  : The Avengers
Year   : 1961–1969
imdbID : tt0054518
Type   : series
Poster : https://m.media-amazon.com/images/M/MV5BZWI4ZWM4ZWQtODk1ZC00MzMxLThlZmMtOGFmMTYxZTAwYjc5XkEyXkFqcGdeQXVyMTk0MjQ3Nzk@._V1_SX300.jpg

Title  : Avengers Assemble
Year   : 2013–
imdbID : tt2455546
Type   : series
Poster : https://m.media-amazon.com/images/M/MV5BMTY0NTUyMDQwOV5BMl5BanBnXkFtZTgwNjAwMTA0MDE@._V1_SX300.jpg

Title  : Avengers Assemble
Year   : 2013–
imdbID : tt2455546
Type   : series
Poster : https://m.media-amazon.com/images/M/MV5BMTY0NTUyMDQwOV5BMl5BanBnXkFtZTgwNjAwMTA0MDE@._V1_SX300.jpg

This is great, but the real power of PowerShell comes from the ability to manipulate these results. Let's extract just the title from each element of the list.

$movies.Search | foreach Title
The Avengers
Avengers: Age of Ultron
Avengers: Infinity War
The Avengers
The Avengers: Earth's Mightiest Heroes
Ultimate Avengers
Ultimate Avengers II
The Avengers
Avengers Assemble
Avengers Assemble

Pretty neat, but I spot a couple of duplicate. Let's take a look at the raw data again.

$movies.Search | select title, year, imdbID
Title                                  Year      imdbID
-----                                  ----      ------
The Avengers                           2012      tt0848228
Avengers: Age of Ultron                2015      tt2395427
Avengers: Infinity War                 2018      tt4154756
The Avengers                           1998      tt0118661
The Avengers: Earth's Mightiest Heroes 2010–2012 tt1626038
Ultimate Avengers                      2006      tt0491703
Ultimate Avengers II                   2006      tt0803093
The Avengers                           1961–1969 tt0054518
Avengers Assemble                      2013–     tt2455546
Avengers Assemble                      2013–     tt2455546

The select command (a shorthand for Select-Object) allows us to pare down the data we're getting so we can take a closer look. It looks like we have two exact copies of Avengers Assemble (with the same imdbID) but three different movies called The Avengers.

What we want to do is to dedupe the list of movies based on their imdbID. The sort command (shorthand for Sort-Object) has exactly what we need. We can sort a list of objects by some property in the object, and if we pass the -uniq argument to sort, it will eliminate all but the first copy of the object.

$movies.Search | sort imdbId -uniq | select title, year, imdbId
Title                                  Year      imdbID
-----                                  ----      ------
The Avengers                           1961–1969 tt0054518
The Avengers                           1998      tt0118661
Ultimate Avengers                      2006      tt0491703
Ultimate Avengers II                   2006      tt0803093
The Avengers                           2012      tt0848228
The Avengers: Earth's Mightiest Heroes 2010–2012 tt1626038
Avengers: Age of Ultron                2015      tt2395427
Avengers Assemble                      2013–     tt2455546
Avengers: Infinity War                 2018      tt4154756

Pretty cool so far, but what if we want to convert the result to JSON?

No problem! The ConvertTo-Json command will convert the objects to JSON for us.

$movies.Search | sort imdbId -uniq | select title, year, imdbId | ConvertTo-Json
[
  {
    "Title": "The Avengers",
    "Year": "1961–1969",
    "imdbID": "tt0054518"
  },
  {
    "Title": "The Avengers",
    "Year": "1998",
    "imdbID": "tt0118661"
  },
  {
    "Title": "Ultimate Avengers",
    "Year": "2006",
    "imdbID": "tt0491703"
  },
  {
    "Title": "Ultimate Avengers II",
    "Year": "2006",
    "imdbID": "tt0803093"
  },
  {
    "Title": "The Avengers",
    "Year": "2012",
    "imdbID": "tt0848228"
  },
  {
    "Title": "The Avengers: Earth's Mightiest Heroes",
    "Year": "2010–2012",
    "imdbID": "tt1626038"
  },
  {
    "Title": "Avengers: Age of Ultron",
    "Year": "2015",
    "imdbID": "tt2395427"
  },
  {
    "Title": "Avengers Assemble",
    "Year": "2013–",
    "imdbID": "tt2455546"
  },
  {
    "Title": "Avengers: Infinity War",
    "Year": "2018",
    "imdbID": "tt4154756"
  }
]

You might be asking: how are you supposed to find the ConvertTo-Json command? The cool thing about PowerShell is that commands are named, conventionally, as <Verb>-<Noun>. As a result, PowerShell comes with a command Get-Command that lets you query all of the commands for the ones you're looking for.

Get-Command  -Noun Json
CommandType     Name                                               Version    Source
-----------     ----                                               -------    ------
Cmdlet          ConvertFrom-Json                                   6.1.0.0    Microsoft.PowerShell.Utility
Cmdlet          ConvertTo-Json                                     6.1.0.0    Microsoft.PowerShell.Utility
Cmdlet          Test-Json                                          6.1.0.0    Microsoft.PowerShell.Utility

One pretty cool thing about Get-Command: because it returns an array of objects like any other PowerShell command, we can use what we already learned to tone down the noise of this list.

Get-Command -Noun Json | select Name
Name
----
ConvertFrom-Json
ConvertTo-Json
Test-Json

We already covered the fact that you can get help about any command by using -? or Get-Help. But on top of printing out the help in your console, you can open up the help in your default web browser.

Get-Help ConvertTo-Json -Online

Opens up your default web browser at the URL of the documentation for the help.

Shorthand vs. Longhand

Now that we've seen Get-Command -Noun and Get-Command -Verb, we can understand why PowerShell's commands can seem so verbose, but also why we wouldn't want the PowerShell designers to just stick with the shorthand names.

Let's take a look at all of the commands that work with objects that have aliases:

How did I get that list?
Get-Alias |
foreach ResolvedCommand |
where Noun -eq "object" |
foreach name

First, I asked PowerShell for a list of all of the aliases in the system. Next, I mapped the aliases over the ResolvedCommand property. Next, I restricted the results to the resolved commands whose noun was Object, and extracted the name property from each of the matching resolved commands.

How did I figure that out?

PS1> Get-Alias
CommandType     Name                                               Version    Source
-----------     ----                                               -------    ------
Alias           % -> ForEach-Object
Alias           ? -> Where-Object
Alias           ac -> Add-Content
Alias           algm -> Add-LocalGroupMember                       1.0.0.0    Microsoft.PowerShell.LocalAccounts
Alias           cat -> Get-Content
Alias           cd -> Set-Location
...

PS1> # look for members inside of the alias object that has something to do
     # with a command
PS1> Get-Alias | Get-Member | where Name -like "*command*"
Name                MemberType     Definition
----                ----------     ----------
CommandType         Property       System.Management.Automation.CommandTypes CommandType {get;}
ReferencedCommand   Property       System.Management.Automation.CommandInfo ReferencedCommand {get;}
ResolvedCommand     Property       System.Management.Automation.CommandInfo ResolvedCommand {get;}
ResolvedCommandName ScriptProperty System.Object ResolvedCommandName {get=$this.ResolvedCommand.Name;}

PS1> # map over the resolved commands, which give us back the same
     # thing as Get-Command, which we used before, so we know how to
     # filter it by Noun. Let's map over noun, just to be sure. We'll
     # use `select -first`, an important feature we haven't covered yet
     # to avoid a massive amount of spew, and then pipe into ConvertTo-Json
     # to make the output clear.
PS1> Get-Alias | foreach ResolvedCommand | foreach Noun | select -first 10 | ConvertTo-Json

[
    "Object",
    "Object",
    "Content",
    "PSSnapin",
    "Content",
    "Location",
    "String",
    "Location",
    "Content",
    "Host"
]

Looks good! We can now use where and foreach to finish the job.

Because I was lazy, I also used PowerShell to create the bulleted list to paste into my blog software.

Get-Alias |
foreach ResolvedCommand |
where Noun -eq "object" |
foreach name |
foreach { "* ``$psitem``" }
* `ForEach-Object`
* `Where-Object`
* `Compare-Object`
* `Compare-Object`
* `ForEach-Object`
* `Group-Object`
* `Measure-Object`
* `Select-Object`
* `Sort-Object`
* `Tee-Object`
* `Where-Object`

In general, the aliases are much prettier:

How'd I get this list? ```powershell Get-Alias | where { $psitem.ResolvedCommand.Noun -eq "object" } | foreach DisplayName | foreach { "* ``$_``" } ```

When working interactively in the shell, the short names are really great. But the longer canonical names follow the <Verb>-<Noun> convention which makes them more discoverable. Generally speaking, long-time PowerShell authors also find the longer names to be more readable when writing scripts that will need to be maintained.

Bottom line: both the longhand and the shorthand have their place, and you will probably find yourself using the shorthand versions almost exclusively when working interactively in the command-line.

Bonus Points: Digging Into the Command

One last thing: even though Get-Command gives us a simplified table of the commands that matched, there's way more information inside.

Get-Command -Name ConvertTo-Json | Get-Member


   TypeName: System.Management.Automation.CmdletInfo
Name                MemberType     Definition
----                ----------     ----------
Equals              Method         bool Equals(System.Object obj)
GetHashCode         Method         int GetHashCode()
GetType             Method         type GetType()
ResolveParameter    Method         System.Management.Automation.ParameterMetadata ResolveParameter(string name)
ToString            Method         string ToString()
CommandType         Property       System.Management.Automation.CommandTypes CommandType {get;}
DefaultParameterSet Property       string DefaultParameterSet {get;}
Definition          Property       string Definition {get;}
HelpFile            Property       string HelpFile {get;}
ImplementingType    Property       type ImplementingType {get;}
Module              Property       psmoduleinfo Module {get;}
ModuleName          Property       string ModuleName {get;}
Name                Property       string Name {get;}
Noun                Property       string Noun {get;}
Options             Property       System.Management.Automation.ScopedItemOptions Options {get;set;}
OutputType          Property       System.Collections.ObjectModel.ReadOnlyCollection[System.Management.Automation.PSTypeName] OutputType {get;}
Parameters          Property       System.Collections.Generic.Dictionary[string,System.Management.Automation.ParameterMetadata] Parameters {get;}
ParameterSets       Property       System.Collections.ObjectModel.ReadOnlyCollection[System.Management.Automation.CommandParameterSetInfo] ParameterS…
PSSnapIn            Property       System.Management.Automation.PSSnapInInfo PSSnapIn {get;}
RemotingCapability  Property       System.Management.Automation.RemotingCapability RemotingCapability {get;}
Source              Property       string Source {get;}
Verb                Property       string Verb {get;}
Version             Property       version Version {get;}
Visibility          Property       System.Management.Automation.SessionStateEntryVisibility Visibility {get;set;}
DLL                 ScriptProperty System.Object DLL {get=$this.ImplementingType.Assembly.Location;}
HelpUri             ScriptProperty System.Object HelpUri {get=$oldProgressPreference = $ProgressPreference…

Armed with that information, we can give ourselves a better table of information about the JSON facilities that come with PowerShell.

Get-Command -Noun Json | select Name, Noun, Verb, HelpUri
Name             Noun Verb        HelpUri
----             ---- ----        -------
ConvertFrom-Json Json ConvertFrom https://go.microsoft.com/fwlink/?LinkID=217031
ConvertTo-Json   Json ConvertTo   https://go.microsoft.com/fwlink/?LinkID=217032
Test-Json        Json Test        https://msdn.microsoft.com/en-us/library/bb299886.aspx

By now you should be getting the picture: since PowerShell works with collections of objects rather than text files, you can use your basic knowledge on any kind of collection you come across.

As your fundamental skills improve, you'll be able to manipulate not only JSON documents, but also processes, files, and even functions and aliases in the same way!