PHP Faster in Array

Recently I was porting some legacy PHP code into a Laravel framework. I came upon an interesting function in the legacy code with a name that intrigued me and got me wondering how useful this could be.

/** * Faster array lookup than in_array() where the needle is a key of the haystack. * Haystack should be in the form of [ 'key' => 1 ]. * The value of the haystack is arbitrary since it's just a placeholder. The key is used for the lookup. * * @param $needle * @param $haystack * * @return bool */ function fast_in_array($needle, $haystack) { return isset($haystack[$needle]); }

In the place it was being used, the array only had one element in it, so not very useful. But I noted it for possible later use.

One of the requirements for the project was to send out blast emails. Blasts could be sent out to a few users or up to 50,000 users. They contained optional sections, substitutions variables, custom signatures, tracking codes etc.

The emails were written by professional writers and customized for each sender to appear as a personal email from sender to recipient. For this reason, it is very important that the same email never went out to the same person twice.

Upon sending, we would look through sent logs (in a data store) and remove any users that already received that specific email.

Originally I looped through all the “sent” objects and compare their emails to recipients objects and remove any matches. So two nested for each loops. Almost an N-Squared O(N2) solution. In other words, an algorithm whose performance is directly proportional to the square of the size of the input data set. Not quite in this case since it was not iterating over the same data sets, but close enough.

This is fine when working will small sets of data. However as that data grows, it will start increasing run times dramatically.

foreach ( $sents as $sent ){ foreach ($users as $key => $user ){ if($user->email === $sent->email){ unset($users[$key]); } } }

Here are some time results with different data sets using nested iterations:

Number of users: 300
Number of sents : 100
Elapsed time: 0.0031440258026123

Very decent for a small data sets. But what happens when the data set size increases to 30,000 users and 10,000 “sents".

Number of users: 30000
Number of sents : 10000
Elapsed time: 104.06943202019

We can see that it is now taking a significant amount of time to process this.

And what happens if we increase it to 60,000 users but keeps the “sents" at 10,000.

Number of users: 60000
Number of sents : 10000
Elapsed time: 226.92668390274

Doubling the original users array more than doubles the processing time. So lets try implementing the fast_in_array function.

The first thing we need to do is create a new array which has the element we are searching for as the key.

$haystack = []; foreach ($sents as $sent){ $haystack[$sent->email] = 1; }

Next, we still need to loop through our users once but instead of the nested loop we use our fast_in_array function.

foreach ( $users as $key => $user ){ if ( fast_in_array( $user->email, $haystack ) ){ unset($users[$key]); } }

At 300 users and 100 "sents" we get:

Number of users: 300
Number of sents : 100
Elapsed time: 0.00028800964355469

A significant improvement over the nested loops, but insignificant at this point.

So lets increase it to 30,000 users and 10,000 sents.

Number of users: 30000
Number of sents : 10000
Elapsed time: 0.040152072906494

We go from 104.069 secs to 0.040 secs. Now that is a very significant improvement.

And when we try 60,000 users and 10,000?

Number of users: 60000
Number of sents : 10000
Elapsed time: 0.078256845474243

We now go from 226.927 secs to 0.078 secs. Unbelievable performance gain.

So why is this?

Surely a few reasons, but we know PHP stores arrays in C using a hash table. My guess would be that the keys are sorted and when using PHP’s isset() it does not need to iterate over the complete set of elements before it knows when it is there or not. But I would need to dig in some PHP C code to confirm that.

You can view the complete PHP code used to perform these tests at: https://github.com/mikepmtl/faster-in-array

.