Data Science, Big Data and SEO: Connecting the Dots

Chris Tomlinson April 2016

Data Science and Search Engine Optimization by Chris Tomlinson

We live amidst an explosion of information. The Internet and digital technology have fostered not only access to 99% of all the accumulated knowledge of humankind, but granted access to an incredible diversity of data about almost anything imaginable – so-called “big data”.

This information is invaluable to businesses – it provides them with concrete, hard evidence on which to base virtually any type of business decision, from choosing a bricks-and-mortar location to creating products, adding product features, marketing to potential customers, website SEO and everything in between.

However, just because that information is out there doesn’t mean that it’s easily accessible. Most of it is raw, meaning that it needs to be parsed, handled, shaped and moulded into something with which you can work. Otherwise, it’s just so much gibberish. This is where data science comes in. Data science takes big data’s information and makes it usable and understandable to mere humans.

What is data science, and how can it affect SEO and your traction online? In this guide, we’ll explore the connections between data science, big data and search engine optimisation.

What Is Data Science?

According to NYU, data science is, “an evolutionary step in interdisciplinary fields like business analysis that incorporate computer science, modelling, statistics, analytics and mathematics.” Of course, that sheds very little light on the topic.

For a slightly more enlightening definition, we should turn to Forbes. In an article for the magazine titled A Very Short History of Data Science, author Gil Press writes, “The term “data science” has emerged only recently to specifically designate a new profession that is expected to make sense of the vast stores of big data. But making sense of data has a long history and has been discussed by scientists, statisticians, librarians, computer scientists and others for years.”

So, data science is both new and very old. In fact, humans have been utilising data science for as long as there has been a need to analyse information in order to make decisions. It’s not solely related to business, either. Caesar used rudimentary data science to analyse information provided by his generals and commanders. It was used by Carthaginian merchants when determining where to create their trade routes. Hannibal used it when crossing the Alps with his elephants – it goes on and on, back to into the misty depths of history.

Really, it’s only the name that has changed, along with the fact that data scientists, who are tasked with making sense of this information, now have an incredibly vast amount of data with which to work. Here are a few facts that might be rather “eye opening”:

  • The past two years have seen more information created than in the entire history of humanity.
  • By 2020, we will be generating 1.7 MB of new data per second of every day for every person on the planet
  • 40,000 search queries are conducted every second through Google alone.
  • Only 23% of businesses have a plan for dealing with this deluge of data.

Data Science, Big Data and SEO

The terms “data science” and “data scientist” are both pretty new, coined only a few years back. D.J. Patil spoke of creating them in Building Data Science Teams. He said, “Starting in 2008, Jeff Hammerbacher and I sat down to share our experiences building the data and analytics groups at Facebook and LinkedIn. In many ways, that meeting was the start of data science as a distinct professional specialisation … we realised that as our organisations grew, we both had to figure out what to call the people on our teams.

‘Business analyst’ seemed too limiting. ‘Data analyst’ was a contender, but we felt that title might limit what people could do. After all, many of the people on our teams had deep engineering expertise. ‘Research scientist’ was a reasonable job title used by companies like Sun, HP, Xerox, Yahoo and IBM. However, we felt that most research scientists worked on projects that were futuristic and abstract, and the work was done in labs that were isolated from the product development teams.

It might take years for lab research to affect key products, if it ever did. Instead, the focus of our teams was to work on data applications that would have an immediate and massive impact on the business. The term that seemed to fit best was data scientist: those who use both data and science to create something new.”

From that, we can extrapolate that data science is merely the use of data, science and mathematics to extrapolate key information needed to make massive changes within a business or organisation. Now, let’s tackle the other elephant in the room – big data.

What Is Big Data?

If you were to ask a dozen CEOs what they thought big data was, you would likely receive a dozen different answers, all of which were correct, but also all of which were wrong. Big data is one of the most misunderstood terms in the modern world, and for good reason. It’s big. It’s used in myriad different ways. So, what is it, really?

Lisa Arthur sums it up well in her article What Is Big Data. “Big data is a collection of data from traditional and digital sources inside and outside your company that represents a source for ongoing discovery and analysis”. What she doesn’t say, but implies, is that data science is needed to extrapolate value and meaning from the sea of big data. Raw information isn’t particularly useful. It needs to be changed into a format that human beings can digest, that can give us the means to connect the dots and make more informed decisions. Otherwise, it’s just noise. Also, it’s not solely limited to the digital world, either.

Arthur goes on to explain that, “Some people like to constrain big data to digital inputs like web behaviour and social network interactions; however, the CMOs and CIOs I talk with agree that we can’t exclude traditional data derived from product transaction information, financial records and interaction channels, such as the call centre and point-of-sale. All of this is big data, too, even though it may be dwarfed by the volume of digital data that’s now growing at an exponential rate.”

We can gain further insight by considering Gartner’s definition of big data, which has been misconstrued and truncated. According to Gartner, big data “is high-volume, - velocity and – variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” From that, we can see that there are the infamous “three V’s” that need to be explained. We’ll do that below.

Volume: Volume is perhaps the simplest to understand, simply because most business owners, managers and CEOs have first-hand experience with the sheer amount of information now available to them.

Velocity: It’s tempting to think that velocity refers to real-time analysis, but that’s only part of the equation. It also refers to the speed of change and evolution within that data, and how to connect data sets arriving at different speeds, rather than at a steady, predictable pace.

Variety: Variety is where things get really interesting. Businesses now have access to an unprecedented variety of information, from an incredible number of sources. There are tweets, elevator logs, emails, social media interaction, financial data sheets, P & L statements and more. Perhaps most interestingly, the most valuable data is already owned by and stored within the business, but it’s not being utilised. This is called “dark data”, and transforming it from an unknown, unseen entity into usable information can make a dramatic difference to organisations of all sizes.

Important factors to determine data quality

In fact, entrepreneur and data scientist Matt Bentley points out that most businesses are actually using less than a per cent of their available data. “With all the great new marketing tools that have been developed in recent years, the amount of data available to us is increasing exponentially,” he says. “And thanks to mobile, Internet of Things, and a better understanding of unstructured and semi-structured data, you can bet that there’s a lot more coming. And how much of that data are we currently analysing and putting to good use? Just 0.05%. That's right – we’re investing billions in new tools and software to let us collect and store more marketing data, and then we’re essentially throwing 99.5% of it away.”

We don’t have to look very far to see just how vast, varied and important big data and data science are to today’s organisations. Toyota and Microsoft just partnered to create a new firm, called Toyota Connected. Based in Texas, in the US, the firm will focus on “expanding the automaker’s abilities in data centre management, data science and data management.” According to Tech Republic, “Toyota Connected is a collaboration between Toyota and Microsoft, as the company will use Microsoft Azure to build and deliver its data-based tools and services.”

In a press release, Toyota Connected’s CEO, Zack Hicks, said, “From telematics services that learn from your habits and preferences, to use-based insurance pricing models that respond to actual driving patterns, to connected vehicle networks that can share road conditions and traffic information, our goal is to deliver services that make lives easier.”

This is just one example of how big data can be accessed, parsed, harnessed and put to use for an organisation, as well as for the individuals that organisation serves. Of course, the real challenge is transforming the potential value of your data into something more – with real, concrete value. Moreover, it need not be constrained to the physical world – the online realm can also benefit from big data analysis. How does this differ from conventional business analytics?

According to Asab Imam, “Traditional analytics analysis is concerned with extracting actionable insights from historical data. Data science (and big data) enables you to create machine learning models that will help predict a future outcome”. So, from that we can see that traditional business analytics is backwards looking, while big data and data science are forward looking. It’s about analysing past information and current data to accurately determine future events or a future path forward for your business.

There’s another difference here, as well. Traditional business analytics is dependent on very structured data. It’s rigidly segmented, and easily accessible. Big data is not. It’s unstructured and semi-structured, available in both “domesticated” varieties, as well as “in the wild”. Big data includes everything from text messages to phone calls to social media updates and more, while traditional business analytics tends to include information stored in a database or a spreadsheet.

How Do Data Science and Big Data Affect SEO?

Not quite sure how SEO, data science and big data converge? Don’t worry. You’re not alone. While the glimmerings of that connection should be apparent, it can be difficult to see beyond the mere outline. We will break things down into their separate parts to help connect the dots and show just how essential data science is to your SEO efforts.

Before we do that, though, let’s look at the “why” of the situation. Why do you need data science or access to big data for search engine optimisation? Isn’t it just a process of researching keywords, implementing them into quality content, building links out and back to your website, and interacting on social media? Actually, there’s a lot more to it than that.

Consider the fact that while Google gives us a lot of information, the company leaves a lot left unsaid. Google says there are 200 important factors in ranking online, but doesn’t list the most important of those. There’s a lot of guesswork involved if you’re trying to build your website’s ranking, and a lot of costly trial and error. That cost comes both in monetary terms, as well as wasted time and effort. By harnessing big data and then assessing it with data science, you’re able to find the correlations in the data, and then determine what factors really affect your rankings, and which matter less.

For instance, you might have a page that performs very well, while another performs very poorly. Why is that? It’s tempting to look at things like social shares of the page. If the one that performs well has more social shares, then the natural conclusion seems to be that social sharing affects Google’s algorithm. However, the search engine giant has repeatedly denied that this is the case. It’s a correlation, not proof of causality.

To truly determine the cause of the different performance levels, you have to dig into the signals for each page (the one that wins traffic, and the one that loses traffic). Once that has been done, you cannot stop, as this once again only shows correlation, not causality. Data modelling must be employed to see beyond patterns and trends to the actual cause of traffic differences. In many instances, you’ll find that they have little or nothing to do with the correlations. Of course, most SEO tools available to you lack the capabilities to do this.

Applying Data Science to SEO: How It Works

Applying data science to SEO requires specialised knowledge and training, but it can be done. There are several required steps, though. First, you need to choose the desired outcome. In most instances, this will be page rank, although ranking is no longer considered the most important metric in SEO. With that being said, it is still crucial. It’s just not the sole measurement of success. There are other reasons to choose ranking as your outcome, such as the lack of extraneous variables that can alter the outcome (seasonality, for instance).

The next step is to choose your predictor variables. As mentioned, Google uses over 200 different signals to determine ranking. Moreover, each signal includes 50 different variants. The job of data science is to determine what those signals are, since Google leaves us in the dark about that. Yes, there are some good guesses, informed guesses, but they’re still guesses until Google confirms, which they’re not likely to do. Identifying those signals requires technology so that you can collect the data in an organised manner. Data sources should be wide and varied, and should include your content, your site architecture, links (inbound and outbound), social data and a great deal more.

Then you’ll need to explore the information gained, model different rankings, and validate the resultant model. Modelling cannot be done by hand, though – it’s simply too vast. You’ll need modelling automation with machine learning to do this. Finally, that information needs to be put into a visual format, such as a chart, graph or something similar, so that the human brain can comprehend what it’s seeing.

Potential Uses of Data Science for SEO

At first glance, it might seem that the applications for data science in the realm of SEO are a little limited. That’s not the case. They’re actually unlimited – it applies to every single aspect of search engine optimisation and bringing in targeted traffic to your website. Note that we didn’t say holding the #1 position in the SERPs. Again, this is not as important as it once was, simply because the top rank on a page for a single keyword phrase may or may not be of any benefit. Consider someone searching for “pizza London”.

Chances are good that they’re looking for a pizza restaurant. So, a business with the top slot in the SERPs would get the lion’s share of the traffic. However, someone searching for “pizza” may or may not want a restaurant. They might want a recipe, or pictures, or something else entirely. So, if you’re ranking for “pizza” and not for “pizza London”, and run such a restaurant, then the traffic you gain from Google might be worthless. They’ll click and bounce, or not click at all. Now, let’s take a closer look at some of the potential uses of data science in your SEO efforts.

Content Marketing

A lot of focus has been placed on content marketing, and rightly so. Without it, your website is essentially “dead in the water”. Google rewards quality content, and the longer the content, the better. However, not all content is created equal. You might have created compelling, original, interesting content that offers value to your audience, but fail to see any change in website traffic or conversions.

Data science can be applied to determine the problem, and to help you approach the prospect of content creation and sharing logically and scientifically. For instance, using data science, you can emulate website crawling to identify content topics, measure signals and how those signals affect business KPIs, including things like social shares, pages per visit, time spent per page and more.

In addition, this can help you hone in on the best types of content for your specific audience. Perhaps they prefer longer blog posts to reports, or maybe they prefer receiving content through Facebook, rather than downloading it from your website. This can be taken as far as determining the appropriate reading levels for your title tags. In fact, one data scientist was able to pinpoint that an improvement in title tag reading ease would increase a website’s ranking by a full 20 positions. This is a perfect example of just how important data science is to content on your site and on-page SEO.

The possibilities are myriad and varied, but the fact remains that data science can help ensure that you’re creating the right type of content, and then making it available to your audience in the ways and locations that they prefer. Ultimately, that will boost performance, brand recognition, conversion and more.

Analysing the Competition

It’s essential that you know where you stand in regards to competitors. This is simple due diligence. However, that doesn’t mean that it’s easy to do. How do you determine what changes they’re making? How do you tell what’s working for them and what’s not, and how that compares with your own efforts? Again, data science be applied to let you approach this conundrum scientifically and logically.

It can allow you to map changes made to your competitors’ websites, as well as other things, such as customer or visitor feedback regarding the business. Based on this information, you can extrapolate whether or not those same changes or activities would be beneficial for your own business through a direct, side-by-side comparison. You can also determine the underlying factors in your competitors’ successes or failures, and incorporate what helps them succeed, while avoiding the causes of failure.

There are many different factors that must be compared when analysing your competitors. For instance, you need to know their approximate position above or below you in the SERPs for specific keywords. You also need to ensure that they follow a similar business model. As an example, a traditional software developer and vendor might not be a particularly competitive business to an SaaS provider, even though they might offer very similar options. You also need to exclude any non-industry websites that contain similar information (think Wikipedia). This whole process begins with identifying the top-performing websites for every single keyword or keyword phrase you target, and then monitoring their performance for a month.

Technical Issue Impact on Your Business

Technology is all well and good, but it doesn’t always work the way it is supposed to. You’ve been there – unforeseen website outages, pages not displaying, garbled links and more are all part and parcel of being online. However, it can be difficult to assess the impact of these technical issues on your website. For instance, what impact does an improperly loading page have on your success? What about a 404 page error being displayed when your social media followers attempt to visit the link you provide? What if your site is down for an hour?

Each of these can be crucial, with potentially devastating effects on your business. Amazon estimated that if its pages slowed down by just a single second, it would cost the retail giant billions per year. Of course, chances are good that your business is nowhere near the size of Amazon, but the results can be just as harmful.

Data science can be applied to allow you to drill down into the effects of technical issues on your business. Server logs are the obvious source of this information, but there are many others, particularly for businesses without their own in-house servers. From webpage loading times to website outages and everything in between, this data can help you make informed decisions that affect visibility, reachability, profitability and more.

Off-Page SEO

While on-page SEO is crucial, off-page SEO cannot be neglected. Inbound links from other websites are invaluable for both sending you real traffic, and for boosting your authenticity and ranking with Google. However, not all links are created equal. It can be very difficult to determine which partners are the most valuable to your business, though. There’s also the problem of separating referral traffic that converts to traffic that bounces, or visitors that stay on your site for a decent length of time, but never take a desired action.

You can manually parse through years of repots from referral sources, but that can be an incredibly daunting task, not to mention the sheer amount of time required. Data science can be applied to determine patterns in referral traffic so that you can identify the most valuable partners. Not only will this help ensure that you’re working with the right ones, but it will also affect your efforts in terms of relationship building now and in the future.

The Most Important Consideration

When it comes to data science, whether applied to SEO or not, there are some important considerations to make. The single most important one is the quality of the data used. If inaccurate information is used, then your results will be likewise inaccurate. General, broad-scope information will create general results, and it’s impossible to extrapolate much from this. Detailed, granular, accurate information is absolutely essential, and this is one area where business owners and CEOs often struggle. How do you ensure that the data you’re using is accurate? Is it based on uncontrollable variables? Is it skewed for some other reason? Here are a few of the most important factors that go into determining the quality of the data in question.

Seasonality: Seasonality is a factor that will affect a business’ profitability, as well as search engine rankings. For instance, a business that specialises in winter coats will see their rankings change throughout the course of the year, peaking slightly before winter, and then declining once warmer weather arrives. The same applies to other companies with primarily seasonal offerings – beach balls, outdoor gear, hiking equipment and the like. With that being said, a global reach can dramatically alter this.

Accuracy: The accuracy of the information in question is of paramount importance. That is why one of the first steps data scientists take is “cleansing”. Essentially, this process helps to separate incorrect information from accurate data, allowing it to be eliminated before parsing, modelling and other steps. For instance, a mailing list might contain any number of incorrect names, partial addresses, outdated addresses, and the like. Cleansing would remove those inaccurate entries, leaving a cleaner list for parsing. Obviously, this is a very basic example, but it serves to highlight the importance very well. If those inaccurate entries were to remain in the list, a business could spend untold amounts of money attempting to market to someone who could not be reached with the information in hand.

Scale: The general rule here is this – the more accurate the information you use, the better. The scale of your data must be equal to the task. For instance, a small sampling of information regarding competitors and their use of the same keywords may or may not be accurate. The only way to truly tell how many competitors are using the same keywords as your business is to scale up your inquiry to a larger extent.

Visualisation: Data has very little meaning in its raw state. Yes, you might be able to extrapolate some value from raw numbers, but it is crucial to put business information into visual format so that it can be compared, digested and understood. This applies to even seemingly basic information, such as business performance over the course of a year. It can be understood in a text-based list format, but how much easier is it to understand and then extrapolate further value when it’s put into a pie chart, bar graph or other visual medium?

Data Science important considerations

In Conclusion

When everything is said and done, your SEO efforts could all be for naught if you’re not using data science. Without this invaluable tool, there is no guarantee that you’re seeing the entire picture, and it is more crucial than ever before that businesses large and small make informed choices based on accurate information that not only covers historical occurrences, but is able to help predict future outcomes. This is the role of data science and its use of big data.

Applying data science to the exponentially increasing amount of data available to you allows predictive steps to be taken based on logic and scientific principles. It eliminates the guesswork and ensures that the true cause can be determined, rather than basing actions on mere correlations.

The first step in applying data science to your SEO efforts is to identify the keywords that you use, that are shared by your competitors. This should include all of your competitors, and all of the keywords that you’re targeting. Track these for at least 30 days to determine accurate positioning of your competitors both above and below you, and to chart changes in their performance with different keywords.

You also need to ensure that you’re applying data science to your content marketing efforts. By thoroughly analysing the results of your content marketing efforts, you can predict what needs to change in order to see better results. For example, you might find that your particular audience gravitates toward long form on-page content, rather than downloadable content like ebooks or reports. Others might find that their demographic is very locked into social media and only rarely ventures to websites or blogs. Based on this information and the predicted success of a particular change to your efforts, you can take steps to ensure that you’re reaching your audience.

Data science can be applied to all other areas of SEO as well, from page ranking to inbound linking and other off-page SEO strategies. It truly is a “game changer”. However, it is absolutely crucial that the data utilised is completely clean and accurate. Once cleansed and parsed, it must be modelled and then visualised in a format that humans can not only understand, but from which we can extract meaning. Charts, graphs and even entire webpages can be created to highlight the results and ensure that stakeholders are able to take in and digest the most crucial pieces of information.

The real challenge with data science for SEO is that you must form the right hypothesis first. All efforts in this area are driven to prove or disprove that question. The right question will lead to accurate answers, and even more questions that can drive your efforts into the future. The wrong hypothesis will lead to inaccurate answers, or even a dead end.

Data science is not something with which most business owners or CEOs have much experience. It’s a highly specialised field that requires in-depth knowledge, expertise and training. Despite its importance to business success in the modern world, it’s a skillset too often lacking from today’s businesses. At Peppersack, we offer cutting-edge data science solutions designed to drive your success forward through predictive solutions based on the most accurate information. We ensure that you’re not tossing away 99.5% of your data, and that you’re able to make informed decisions based on the most accurate predictions. Contact us today to learn more about how we can help you move forward in this increasingly competitive world.

Works Cited:

https://datascience.berkeley.edu/about/what-is-data-science/

http://datascience.nyu.edu/what-is-data-science/

http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/2/#79723ad1318e

http://www.forbes.com/sites/lisaarthur/2013/08/15/what-is-big-data/#2e9552434872

http://www.forbes.com/sites/gilpress/2013/05/09/a-very-short-history-of-big-data/#3c49afa155da

http://www.sas.com/en_us/insights/big-data/what-is-big-data.html

https://www.quora.com/Can-you-utilize-data-science-for-SEO

http://www.forbes.com/sites/gartnergroup/2013/03/27/gartners-big-data-definition-consists-of-three-parts-not-to-be-confused-with-three-vs/2/#3b05a5ab6256

http://www.techrepublic.com/article/toyota-partners-with-microsoft-azure-on-new-data-science-company-toyota-connected/

https://artios.io/making-google-seo-predictable-with-data-science/

http://earnworthy.com/matt-bentley-interview/

Peppersack is a digital marketing agency base in Manchester, UK. Peppersack specialises in Inbound and Content Marketing. We build websites for our clients and support them with a range of services including Campaign Development, SEO, Content Generation, Social Media Marketing, Technical Support and Analytics.