Fast extraction of duplicate array items into a new array

Posted by blackrat on April 18, 2007

Whilst creating a natural language parser, one of the things I was presented with was multiple merged dictionaries, which needed some processing. I was asked to supply a list of duplicated words back, and after trawling the web finding slow code, I decided that going the fast, but inefficient (in terms of space) was the way to go.
The object is to return an array of just those elements that are duplicated in as little time as possible, from a 50,000 word list. One sort and one lookup per element was what I came up with


array=["long","array","with","lots","and","lots","and","lots","of","array","duplicates","long","and","array"]
arr2=array.sort
arr3=[]
arr4=[]
arr2.each { |a| (arr3[-1]==a) ? (arr4[-1]!=a) ? arr4 << a : "" : arr3 << a }
arr3 => [”and”,”array”,”duplicates”,”long”,”lots”,”of”,”with”]
arr4 => [”and”,”array”,”long”,”lots”]

If you are sure that there are fewer duplicates in the returned array of just duplicated elements, you can remove the arr4 checks at the expense of a final .uniq! pass. i.e.


array=["long","array","with","lots","and","lots","and","lots","of","array","duplicates","long","and","array"]
arr2=array.sort
arr3=[]
arr4=[]
arr2.each { |a| (arr3[-1]==a) ? arr4 << a : arr3 << a }
arr4.uniq!

arr3 => [”and”,”array”,”duplicates”,”long”,”lots”,”of”,”with”]
arr4 => [”and”,”array”,”long”,”lots”]
Trackbacks

Use this link to trackback from your own site.

Comments

You must be logged in to leave a response.