How I found you?

Table of Contents

*
*

1. Introduction

In this project I want to gather as much agencys email addresses and additional information to justify who to prioritize first, and when to send my proposal email.

2. California

2.1. Plan

  • 1st stage
    • Scraping marketing agencies record from Google map.
  • 2nd stage
    • Removing duplicates from the database.
    • Removing agencies with no websites from the database.
  • 3rd stage
    • Scraping latest 10 reviews.
    • Removing agencies' with recent 10 reviews having one star.
    • Removing agencies' with recent 10 reviews being submitted more than two months ago.
  • 4th stage

Below is a plot of the agencies count in each step.

2.2. Analysis

  • 2007 is the starting year for marketing agencies era, from this acticle I found, Link, here are the reasons:
    • The iPhone 1 released in 2007
    • Facebook launched Facebook Pages in 2007
    • The tipping point for Twitter's popularity was SXSW 2007
    • Google launched universal search in 2007
    • Radiohead self-released their In Rainbows album in 2007
    • Amazon Kindle launched in 2007
    • HubSpot released its first product in 2007
    • The first edition the book The New Rules of Marketing and PR released in June 2007

Below is a plot that show the relationship between, The business' filing date, domain creation date, and their overall success measured by the rating and the review amount.

  • Where:
    • X-axis is the business' filing date.
    • Y-axis is the the interval in years betnween the domain creation and the business filing; if:
      • Y is negative: the domain was created before the business.
      • Y is positive: the business was created before the domain.
    • Points size is determined by the business' amount of review on Google map.
    • Points color is determined by the business' rating on Google map.

Below are some self explanatory plots.

3. Australia

3.1. Plan

  • 1st stage
    • Scraping marketing agencies record from Google map.
  • 2nd stage
    • Removing duplicates from the database.
    • Removing agencies with no websites from the database.
  • 3rd stage
    • Scraping email addresses from agencies' website.
    • Removing agencies' record with no email address from the database.
    • Validating agencies' email address.
    • Removing agencies with invalid/unreachable email address.
  • 4th stage
    • Scraping reviews data.
    • Scraping domain data.
    • Scraping business data.

Below is a plot of the agencies count in each step.

3.2. Analysis

Disclamer Data regarding agenceis creation data, whether from business or domain record, is not availible to all.
color are taken from here: https://en.wikipedia.org/wiki/Australian_state_and_territory_colours

  • when were agencies created?
  • gassuming that review count is as metric of sucess, do agencies that started early get more reviews?

Below are some self explanatory plots.

4. Plan

4.1. Scraping Google map

Due to google restrictions against bots, this process will be semi-automated.
I have used the Selenium browser to open up Google Map, searched "x agency near x city", and executed a script that record the result from the browser into a data frame.

The full raw data can be found here, but here is a sample:

name gurl wurl stars btype
Sojern Inc url url 3.4 stars 7 Reviews Advertising agency
Segal Communications url url 5.0 stars 5 Reviews Public relations firm
Cyrusson Inc url url 5.0 stars 14 Reviews Marketing agency

4.2. Getting more Geo-data

From the variable "gurl" in the raw data above, I can extract the coordinate of a given agency.
For me to send emails exactly during the working hours I needed the timezone that each agency is in, so I used the coordinates to get the city and country name, among much other information.

The full raw data can be found here.

PS: If you were wondering why I couldn't get the city name in the first phase? The results of Google Maps aren't constant when it comes to displaying that information.


4.3. Getting Domain public Informations

The older an agency the better, or maybe so; getting the year of an agency's creation will be a good variable to put in the prioritization matrix, but where to find this information?

Luckily, I had the idea of looking for the agency's web domain creation date instead; sadly, some agencies have restricted such info from being publicly shared, but I can work without.

Also, another variable for the metric was the agency's web domain expiration date; we are in 2024, if a web domain owner extends the expiration date to 2028, it means that the owner trust in the agency's success in the future.

The full raw data can be found here.

4.4. Getting Website data

The data that should be gathered from each agency website is:

  • Email adress
  • Keywords like: career, job, opportunity, and remote.
  • Data about web page content:
    • pagel: Number of characters(letters, number, symbols) found in the agencys homepage HTML.

      url count
      lowkeydigital.com.au 7000855
      cavalry.co.nz 14156

      My reasoning for capturing this data on websites is, the more characters are there in HTML, the lower the chances the website was made by a full web stack developer.

    • paget: Number of characters(letters) found in the agencys homepage text.
      Paired with the above variable, I can get a ration of how much elements are in a page vs how much text, and with trial and error, hopefully, I can get a metric for prioritizing, more in the later chapter.

      url count
      limit.agency 136655
      madebymoment.com 200
    • pagei: Number of image tags in the the agencys homepage.

      url count
      infinitymediala.com 7652
      36creative.com 0
    • pagea: Number of anchor/link tags in the the agencys homepage.

      url count
      seoagencylosangelesca.com 4644
      momentumagency.co.nz 1

Data can be found here

4.5. Prioritiznig metric

  • Removing all observation without an email on their website
  • Making a metric for website dataframe
    • The more text, images the better
    • The less HTML elements, spamy links the better
    • Additional keywords counts like: career, job, opportunity, and remote.
      \[Wf_{score}=\frac{\log{(Page_t)}*\log{(Page_i+2)}^2}{\log{(Page_l)}*\log{(Page_a+1)}}+\log{(KeywordsCount)*2}\]
  • Laplace's rule of sucession
    We could use Google map reviews, but if an agency have 5 stars from 3 reviews, and another have 4.9 stars from 10 reviews… which one is better to choose?
    The Laplace's rule of succession answer this question by adding two reviews to the rating, one is a one-star review, the other is a five-star review; more about this method in 3Blue1Brown's video.
    \[LRS_i=\frac{Rating_{x_i}*RatingCount_{x_i}+1+5}{RatingCount_{x_i}+2}\]
  • Domain importance
    The older an agency domain is created the more stable the business is, and, the more months are left before the domain expiration the more confident and responsible a business is.
    \[Domain_{im}=\log(CreationDate+1)+\log(MonthsUntillExpiration+1)\]
  • Score

    • PS: The score is not, by any means, an evaluation of an agency's success; it is just a way, for me, to prioritize reaching agencies via email, giving the data that I collected.
    • PPS: I just added all the scores together, because some businesses have domain information private, and a zero in \(Domain_{im}\) is not a true zero to use multiplication.

    \[Score=Wf_{score}+LRS_i+ Domain_{im}\]

the final data can be found here

4.6. Automating e-mail reaching

A lot of factors were added when it comes to sending email

  1. ProtonMail, the email provider I used because my full name was available at the time, is restricting the reaching to 50 email per hour, and 150 email per day.
  2. The time zones are a thing… if I want my email to be read, and not overshadowed, I will need to send it in working hours, preferable 11 AM; thus making the prioritization metric kinda useless.
    Example: if an agency in LA scored higher than one in NY, the NY agency will receive the email first … 'cause time; as I am writing this, I found that I need to understand timezone.

4.7. Letter example

Date: 2024-03-17 Sun 23:46

Created: 2025-07-28 Mon 01:55

Validate