This is where the data to build AI comes from

AI is all about data. Reams and reams of data are needed to train algorithms to do what we want, and what goes into the AI models determines what comes out. But here’s the problem: AI developers and researchers don’t really know much about the sources of the data they are using. AI’s data collection practices are immature compared with the sophistication of AI model development. Massive data sets often lack clear information about what is in them and where it came from. 

The Data Provenance Initiative, a group of over 50 researchers from both academia and industry, wanted to fix that. They wanted to know, very simply: Where does the data to build AI come from? They audited nearly 4,000 public data sets spanning over 600 languages, 67 countries, and three decades. The data came from 800 unique sources and nearly 700 organizations. 

Their findings, shared exclusively with MIT Technology Review, show a worrying trend: AI’s data practices risk concentrating power overwhelmingly in the hands of a few dominant technology companies. 

In the early 2010s, data sets came from a variety of sources, says Shayne Longpre, a researcher at MIT who is part of the project. 

It came not just from encyclopedias and the web, but also from sources such as parliamentary transcripts, earning calls, and weather reports. Back then, AI data sets were specifically curated and collected from different sources to suit individual tasks, Longpre says.

Then transformers, the architecture underpinning language models, were invented in 2017, and the AI sector started seeing performance get better the bigger the models and data sets were. Today, most AI data sets are built by indiscriminately hoovering material from the internet. Since 2018, the web has been the dominant source for data sets used in all media, such as audio, images, and video, and a gap between scraped data and more curated data sets has emerged and widened.

“In foundation model development, nothing seems to matter more for the capabilities than the scale and heterogeneity of the data and the web,” says Longpre. The need for scale has also boosted the use of synthetic data massively.

The past few years have also seen the rise of multimodal generative AI models, which can generate videos and images. Like large language models, they need as much data as possible, and the best source for that has become YouTube. 

For video models, as you can see in this chart, over 70% of data for both speech and image data sets comes from one source.

This could be a boon for Alphabet, Google’s parent company, which owns YouTube. Whereas text is distributed across the web and controlled by many different websites and platforms, video data is extremely concentrated in one platform.

“It gives a huge concentration of power over a lot of the most important data on the web to one company,” says Longpre. 

And because Google is also developing its own AI models, its massive advantage also raises questions about how the company will make this data available for competitors, says Sarah Myers West, the co–executive director at the AI Now Institute.

“It’s important to think about data not as though it’s sort of this naturally occurring resource, but it’s something that is created through particular processes,” says Myers West.

“If the data sets on which most of the AI that we’re interacting with reflect the intentions and the design of big, profit-motivated corporations—that’s reshaping the infrastructures of our world in ways that reflect the interests of those big corporations,” she says.

This monoculture also raises questions about how accurately the human experience is portrayed in the data set and what kinds of models we are building, says Sara Hooker, the vice president of research at the technology company Cohere, who is also part of the Data Provenance Initiative.

People upload videos to YouTube with a particular audience in mind, and the way people act in those videos is often intended for very specific effect. “Does [the data] capture all the nuances of humanity and all the ways that we exist?” says Hooker. 

Hidden restrictions

AI companies don’t usually share what data they used to train their models. One reason is that they want to protect their competitive edge. The other is that because of the complicated and opaque way data sets are bundled, packaged, and distributed, they likely don’t even know where all the data came from.

They also probably don’t have complete information about any constraints on how that data is supposed to be used or shared. The researchers at the Data Provenance Initiative found that data sets often have restrictive licenses or terms attached to them, which should limit their use for commercial purposes, for example.

“This lack of consistency across the data lineage makes it very hard for developers to make the right choice about what data to use,” says Hooker.

It also makes it almost impossible to be completely certain you haven’t trained your model on copyrighted data, adds Longpre.

More recently, companies such as OpenAI and Google have struck exclusive data-sharing deals with publishers, major forums such as Reddit, and social media platforms on the web. But this becomes another way for them to concentrate their power.

“These exclusive contracts can partition the internet into various zones of who can get access to it and who can’t,” says Longpre.

The trend benefits the biggest AI players, who can afford such deals, at the expense of researchers, nonprofits, and smaller companies, who will struggle to get access. The largest companies also have the best resources for crawling data sets.

“This is a new wave of asymmetric access that we haven’t seen to this extent on the open web,” Longpre says.

The West vs. the rest

The data that is used to train AI models is also heavily skewed to the Western world. Over 90% of the data sets that the researchers analyzed came from Europe and North America, and fewer than 4% came from Africa. 

“These data sets are reflecting one part of our world and our culture, but completely omitting others,” says Hooker.

The dominance of the English language in training data is partly explained by the fact that the internet is still over 90% in English, and there are still a lot of places on Earth where there’s really poor internet connection or none at all, says Giada Pistilli, principal ethicist at Hugging Face, who was not part of the research team. But another reason is convenience, she adds: Putting together data sets in other languages and taking other cultures into account requires conscious intention and a lot of work. 

The Western focus of these data sets becomes particularly clear with multimodal models. When an AI model is prompted for the sights and sounds of a wedding, for example, it might only be able to represent Western weddings, because that’s all that it has been trained on, Hooker says. 

This reinforces biases and could lead to AI models that push a certain US-centric worldview, erasing other languages and cultures.

“We are using these models all over the world, and there’s a massive discrepancy between the world we’re seeing and what’s invisible to these models,” Hooker says. 

The Download: AI tracking birds, and a pig kidney transplant

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology.

AI is changing how we study bird migration

In a warming world, migratory birds face many existential threats. Scientists rely on a combination of methods to track the timing and location of their migrations, but each has shortcomings. And there’s another problem: Most birds migrate at night, when it’s more difficult to identify them visually and while most birders are in bed.

For over a century, acoustic monitoring has hovered tantalizingly out of reach as a method that would solve ornithologists’ woes. Now, finally, machine-learning tools are unlocking a treasure trove of acoustic data for ecologists. Read the full story.

—Christian Elliot

This story is from the forthcoming magazine edition of MIT Technology Review, set to go live on January 6—it’s all about the exciting breakthroughs happening in the world right now. If you don’t already, subscribe to receive a copy.

A woman in the US is the third person to receive a gene-edited pig kidney

Towana Looney, a 53-year-old woman from Alabama, has become the third living person to receive a kidney transplant from a gene-edited pig. 

Looney, who donated one of her kidneys to her mother back in 1999, developed kidney failure several years later following a pregnancy complication that caused high blood pressure. She started dialysis treatment in December of 2016 and was put on a waiting list for a kidney transplant soon after.

But it was difficult to find a match. So Looney’s doctors recommended the experimental pig organ as an alternative. After eight years on the waiting list, Looney was authorized to receive the kidney. Read the full story.

—Jessica Hamzelou

Roundtables: The Worst Technology Failures of 2024

Each year, MIT Technology Review publishes a list of the worst technologies of the past 12 months.

Antonio Regalado, our senior editor for biomedicine, sat down to discuss 2024’s worst failures with our executive editor Niall Firth in a subscriber-exclusive online Roundtable event yesterday. Watch their conversation about what made the cut here, and to make sure you don’t miss out in the future, subscribe

MIT Technology Review Narrated: Meet the radio-obsessed civilian shaping Ukraine’s drone defense

Despite it being over 100 years old, radio technology is still critical in almost all aspects of modern warfare—including in the drones that have come to dominate the Russia-Ukraine war. 

Serhii “Flash” Beskrestnov, who has been obsessed with radios since childhood, has become an unlikely hero of the conflict, sharing advice and intel. His work may determine the future of Ukraine, and wars far beyond it.

This is our latest story to be turned into a MIT Technology Review Narrated podcast, which 
we’re publishing each week on Spotify and Apple Podcasts. Just navigate to MIT Technology Review Narrated on either platform, and follow us to get all our new content as it’s released.

The must-reads

I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology.

1 Conspiracy theories are still circulating about those mysterious drones
What are they? And where have they come from? (NY Mag $)
+ Authorities are attempting to quell public hysteria, but theories abound. (WP $)
+ Realistically, they’re probably just standard drones out for a night-time flight. (AP News)

2 AI poses a major threat to the power grid
That’s according to the US industry watchdog, which is feeling the pressure. (FT $)
+ AI’s emissions are about to skyrocket even further. (MIT Technology Review)

3 SpaceX and Elon Musk are under investigation 
US federal agencies are probing their repeated failures to comply with reporting rules. (NYT $)

4 Nvidia has unveiled a tiny, affordable AI supercomputer
Which is handy for roboticists looking to bypass connecting to remote data centers. (Gizmodo)
+ While it’s not the company’s most powerful device, it’s pretty speedy. (WSJ $)
+ Microsoft is gobbling up more of Nvidia’s chips than anyone else. (FT $)
+ Blacklisted Chinese AI chip firms gained access to cutting-edge UK tech. (The Guardian)

5 Bitcoin’s value is rocketing even higher
The industry continues to boom in the wake of Trump’s election victory. (Bloomberg $)
+ So much so, luxury brands are weighing up accepting crypto payments. (Reuters)

6 Hepatitis B is an extremely treatable disease
So why are so many people still dying from it? (New Yorker $)
+ We’re starting to understand the mysterious surge of hepatitis in children. (MIT Technology Review)

7 Earth—briefly—had an extra second moon

And scientists believe it originated from the actual moon we know and love. (New Scientist $) 

8 The future of deep-sea mining
A set of rules governing how we should do it is highly contentious—and up for debate.(Hakai Magazine)
+ These deep-sea “potatoes” could be the future of mining for renewable energy. (MIT Technology Review)

9 Resist the temptation to outsource your Christmas shopping to a bot 
You never know what you’ll end up with. (Insider $)
+ It’s probably quicker to browse the web yourself. (WP $)

10 Our snacks could soon be designed by AI 🍪
Confectionary giant Mondelez is using the tech to tweak recipes and test new ones. (WSJ $)
+ Forget cookies—this creamy vegan cheese was made with AI. (MIT Technology Review)

Quote of the day

“It takes a lot for an uber-wealthy, creative-type CEO, many of whom lean left, to suck it up and deal with Trump. But what choice do they have?”

—A Washington lobbyist explains to the Financial Times why the steady stream of tech executives paying their respects to US President-elect Donald Trump shows no sign of slowing.

The big story

What does GPT-3 “know” about me?

August 2022

One of the biggest stories in tech is the rise of large language models that produce text that reads like a human might have written it.

These models’ power comes from being trained on troves of publicly available human-created text hoovered up from the internet. If you’ve posted anything even remotely personal in English on the internet, chances are your data might be part of some of the world’s most popular LLMs.

Melissa Heikkilä, MIT Technology Review’s AI reporter, wondered what data these models might have on her—and how it could be misused. So she put OpenAI’s GPT-3 to the test. Read about what she found.

We can still have nice things

A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or tweet ’em at me.)

+ 2024 was a seriously weird year, as evidenced by this completely bonkers list.
+ Who knew Seal was such a grunge head?
+ These Charli xcx Christmas mashups will haunt my dreams forever, and not in a good way.
+ Next summer I feel the need to level up my sandcastle game.

Charts: Global Investor Trends 2024

For its fourth annual “Global Investor Survey,” PwC queried 345 global investors and analysts in September 2024 across various regions, asset classes, and investment approaches. The goal was to understand respondents’ expectations for the companies they invest in and cover and their views on risks and technology, including generative AI.

Thirty-six percent of respondents replied the companies they own or cover are “highly” or “extremely” exposed to geopolitical conflict and cyber risk.

Investors emphasized the importance of companies adapting business modelsTechnological change was the top priority, followed by government regulation, shifts in customer preferences, and supply chain instability.

Generative AI will drive significant performance gains without compromising employees’ roles, according to the respondents. Sixty-six percent expect the companies they invest in to achieve productivity boosts from AI within the next year, while 63% foresee revenue growth and 62% predict increased profitability.

Moreover, the data also shows that investors are more inclined to view AI as an opportunity rather than a challenge.

OpenAI Announces 1-800-ChatGPT via @sejournal, @martinibuster

OpenAI ChatGPT just rolled out speech access to ChatGPT by phone and text access through the WhatsApp messaging system. The new services allow users to talk to and message ChatGPT to get answers. The phone access method enables users with an unstable or no data connection to use ChatGPT from a telephone while on the go, even without a ChatGPT account.

Speak With ChatGPT By Phone

Speaking with ChatGPT only requires setting up ChatGPT as a contact, using their 1-800-ChatGPT phone number, which in numbers is 1-800-242-8478. Once added to the phone’s contacts list a user can now phone and speak with ChatGPT to get answers.

Screenshot of video presenters pointing downward to a banner that reads Call Toll Free 1-800-ChatGPT

The presenters phoned ChatGPT with an iPhone, an old flip phone and with a rotary dial telephone to demonstrate how it’s a phone call that is used to reach ChatGPT and access answers. You can do it on the road or at home from a land line.

The functionality is currently only available in the United States and is limited to 15 minutes of free calling per month. However you can also download the ChatGPT App and create an account to speak even longer.

Image of a man speaking with ChatGPT with an old fashioned rotary phone

An example phone call involved asking ChatGPT to explain Reinforcement Learning as if to a five year old.

ChatGPT spoke the following answer:

“Sure! Imagine you have a robot friend and you want to teach it to clean up your room you give it a treat every time it does a good job that’s reinforcement fine-tuning the robot learns to do better by getting rewards.”

ChatGPT On WhatsApp

OpenAI also announced a way to reach ChatGPT with WhatsApp, and it’s available to users around anywhere in the world. The demonstration showed the presenters accessing 1-800-ChatGPT on WhatsApp through the mobile phone’s contacts list. But it can also be accessed by scanning the following QR code.

Screenshot Of ChatGPT On WhatsApp QR Code

The WhatsApp experience is currently limited to texting with ChatGPT and users can access it without having an account. OpenAI is working on ways to authenticate the WhatsApp access with a ChatGPT account and to be able to search with images.

Facts About New Access Methods

The new functionalities use the ChatGPT 40 Mini model. OpenAI engineers literally created these new functionalities over the past few weeks, which is pretty amazing.

Watch the announcement of the new ways to interact with ChatGPT:

1-800-ChatGPT

What An Enterprise Client Wants From Their SEO Agency via @sejournal, @danielkcheung

A lasting relationship with an enterprise client is good for business. It gives you authority, it gives your staff exposure to how large organizations work, and there should be a financial element.

But winning the pitch is just the beginning. Many agencies make the grave mistake of misunderstanding their role in the relationship, and it is this: The relationship is transactional – at least, at first.

This is what an enterprise client needs from you:

  • Clear, consistent communication.
  • Alignment with business objectives.
  • Flexibility.
  • Integrity.
  • Operational efficiency and responsiveness.
  • Proactive problem-solving.

In many ways, these six things overlap on a Venn diagram. And when you get it right, we will see you not as an external vendor but as an extension of ourselves.

1. Clear, Concise Communication

Enterprise clients don’t just want emails or reports. They want clarity, alignment, and confidence that you’re on the same page.

If they’re left guessing what’s happening, you’re failing.

Always ask yourself: What is the message you wish to convey?

Oftentimes, less is more.

Instead of a lengthy email, sometimes a 15-minute Teams call will not only address the topic but also the bigger picture and the next steps. But this doesn’t mean long emails don’t have a place—they do.

Context is everything.

I get it – it’s purely subjective from one point-of-contact to the next. But that’s the name of the game.

Be adaptable, and don’t assume. Reach out and ask how your point-of-contact prefers to communicate.

At the start of every engagement, even when I was agency-side, I would ask for clear directions on ways of working.

And lastly, before you hit send, ask yourself: Does this have to be mentioned/asked/challenged right now?

How to do this effectively:

  • Send a pre-read ahead of time. Strive for two business days ahead of time, 24 hours worse case.
  • Include an executive summary at the beginning of every presentation. There is nothing worse than having skim reading a deck full of slides without an executive summary. I’m busy. My boss is busy; their boss is busy. If you’ve made the effort to create a presentation deck, put a TL;DR at the front for us.
  • Sign up to use the same team collaboration app as your client for quick updates. Most people don’t reply to emails immediately. Instant messaging such as Slack or Teams? Completely different rules. Plus, I speak from personal experience that shooting a Slack message in a dedicated channel takes far less mental bandwidth than crafting an email. The best part is that it works both ways, so win-win!

Here’s the truth: Great communication reduces friction, builds trust, and keeps you in the loop when priorities shift.

2. Alignment With Business Goals

Increasing quarter-on-quarter traffic is nice. Rankings are cool. But if you’re not moving the needle on their actual business goals – revenue, customer retention, market share – you’re just noise.

The biggest lesson I learned when moving agency-side to client-side was this: If your recommendation doesn’t ladder up directly to business goals, then you’re wasting everyone’s time with your research, audits, and recommendations.

And, as SEO professionals, we default to problem identification mode because that’s how most of us got started. That is, find all the problems related to a particular pillar (e.g., content, technical SEO, off-page) and mistake this list as the strategy.

This is what I did the first week I started my first enterprise SEO role.

I fired up Screaming Frog and found all the things.

But I had no context as to who was responsible for resolving each issue and what their priorities were.

What may seem like an important SEO activity may not be a business priority.

How to do this effectively:

  • Transparency goes both ways. Just as an enterprise client expects you to be transparent, you can as well by asking what their strategic pillars are for the quarter or year. To go a step further, get tangible guidance by requesting your point-of-contact what their objectives and key results (OKRs) are. Trust me, they’ll have them because that’s corporate life.
  • Bring the right people along the journey. If you wish to propose adding content, ask your point-of-contact what stakeholders are involved in making this change happen. If you’ve discovered crawling, indexing, and rendering issues, ask who can make changes to robots.txt, to the , and to the frontend stack. Chances are, they’re all separate teams who work in their own silos and backlogs.

The lesson is this: Your role is not to fix all the things because you simply cannot. Instead, take a minute to understand who’s who because even your point-of-contact is an advocate, not an executioner.

Why this matters: Clients don’t want SEO in a silo. We want strategies that tie into our biggest priorities.

Why? Because our performance review and bonuses rely on them. So, speak our language and show us how SEO helps us win.

3. Flexibility

Markets change. Leadership changes direction. Enterprise clients want partners who can adapt to their evolving needs without skipping a beat.

I don’t care that you’ve sunk 80 hours into something I asked you to do. I only care about what is top-of-mind right now.

It’s nothing personal. I probably feel frustrated, just as you are. But priorities shift, so learn to go with the flow and be an asset instead of a blocker.

When you’ve got my back, I’ve got yours because we’re in this together.

The key: Be agile. Show that you’re not just a plan-follower but a partner who can pivot without losing focus on results.

4. Integrity

Integrity is the currency of trust. Enterprise clients need to know they can rely on you to tell the truth, even when it’s inconvenient or uncomfortable.

If there’s a mistake, own it. If timelines slip, address it early. If you think the client’s ask won’t work, say so – and back it up with data or reasoning. The worst thing you can do is over-promise and under-deliver.

Recently, a vendor blamed their lack of access to a Sharepoint file.

Perhaps it was true; Sharepoint can be fickle with external vendors. But the fact that this was their explanation when I asked why there was a delay in the delivery disappointed me greatly.

In my mind, I assumed they were overextended and did not get around to the task.

There’s a really easy fix to this: Every time your client shares a file with you, open it and see if you have the required access. Don’t wait two weeks later because that’s too late and sends a very bad message.

Similarly, not all campaigns go to plan. For example, perhaps your digital PR campaign didn’t produce the results you expected. That’s fine.

The second worst thing you can do is lie about it. The worst thing you can do is buy backlinks to pad the numbers.

Enterprise search marketers know that there are no guarantees with Google. What my boss, their boss, and their boss expect are learnings.

What did we learn from this exercise?

What can we do better next time?

Did we document what we didn’t plan and why in a wiki so that we don’t make the same investment in something that doesn’t work?

The flip side of this is to stand up for yourself because I’m not looking for a lackey. Not every idea I come up with is appropriate, and I expect – no, rely on – you to tell the truth even when it’s inconvenient or uncomfortable.

5. Operational Efficiency And Responsiveness

Enterprise projects are a symphony of moving parts, and delays in one area can cascade into chaos. Your job is to deliver fast, precise work while minimizing bottlenecks.

I think most enterprise SEO professionals will agree with me on this – my calendar is full. On some days, it’s literally back-to-back meetings with different stakeholders.

I don’t have the time or mental bandwidth to hold your hand.

When I give you a task, speed matters – not just for execution but for acknowledgment. A simple “We’re on it, here’s when you can expect an update” goes a long way in showing you’re reliable.

Efficiency isn’t just about working quickly – it’s about working smart. Streamline processes, remove redundancies, and bring structure to chaos. Help us feel like we’re in capable hands, no matter how derailed the project gets.

The play: Deliver fast, precise work, and be responsive. They’ll keep coming back to the agency that gets it done.

6. Proactive Problem-Solving Without Access To First-Party Data

Enterprise clients often operate in silos, and as an external agency, you’re rarely handed direct access to our analytics platforms.

Limited access to first-party data is the norm, but that doesn’t excuse you from identifying issues or presenting solutions.

The best agencies thrive under constraints. If you don’t have access to first-party data, get creative with proxies. Use publicly available tools, competitive analysis, and trend data to craft recommendations.

When possible, suggest ways the client can share aggregated insights or anonymized data that protect internal policies while giving you enough to work with.

Your goal is to demonstrate that you can solve problems without needing to see everything. And if data gaps are creating risks, flag them early.

Be proactive in suggesting solutions, such as data clean rooms or integrations that can provide the insights you need without breaching compliance.

Do this instead: Leverage external data sources.

If there’s one thing I know from working agency-side, it is that you have access to all the tools. So, pull insights from other data sources to identify patterns and opportunities.

Why this matters: First-party data is a privilege, not a given. By showing that you can deliver value despite limitations, you position yourself as a resilient, resourceful partner who doesn’t let obstacles stand in the way of results.

The 6 Pillars Of Enterprise SEO Success

Enterprise clients don’t just want vendors. We want long-term partners because procurement and onboarding are painful.

Here’s what we value most:

  • Clear communication that aligns and inspires.
  • Strategies tied directly to business outcomes.
  • Agility in the face of shifting priorities.
  • Integrity and transparency in every interaction.
  • Efficiency that respects our time and resources.
  • Creative problem-solving that delivers results, even under constraints.

Master these six pillars, and you’ll become more than a service provider. You’ll be a partner we fight to keep.

More resources:


Featured Image: wee dezign/Shutterstock

GA4 Metrics Every Advertiser Should Pay Attention To via @sejournal, @timothyjjensen

While paying attention to metrics in ad platforms is crucial to the success of any online advertising initiative, you can’t ignore what users are doing after they click the ad.

Sure, you can measure the website conversions in your ad platforms, but what else are people doing on your site that could be informative to your campaigns?

Google Analytics can help you gain insight into the steps beyond the initial click, answering questions such as:

  • How much time are users spending on your landing page, and are they looking at other pages on your site?
  • How many paid users are coming to your site for the first time, and how many have previously been on the site before interacting with ads?
  • What other channels have led them to your site in addition to paid?
  • Are they watching the videos you’ve embedded in your site?
  • What percentage of users are adding items to their carts and not checking out right away?

In this article, let’s take a look at several key metrics in Google Analytics 4 (GA4) that can help with these questions and more.

Key Event Counts & Rates

Key events in GA4 correlate to what you consider your primary business success metrics. These will vary based on your goals but may include lead form submissions, online account creation, purchases, or event registrations, to name a few options.

Confusingly, while key events may match what you think of as “conversions” in other channels, GA4 currently reserves the “conversions” nomenclature specifically for Google Ads conversions tracked via a linked account.

Events are a core functionality in GA4, with every action a user takes on the site potentially correlating to an event (from page views to form submissions).

However, you should think carefully about what events actually matter to your business bottom line to be marked as a key event.

Additionally, the key event rate is another metric you should consider. When looking at the session level, you will notice what percentage of sessions resulted in a key event taking place.

GA4 Acquisition ReportScreenshot from Google Analytics, November 2024

When looking at key events, you have a few useful ways to incorporate them, including:

  • View event counts and event rates by channel and by source/medium. For instance, you can compare key event rates between paid search and paid social to see which is more likely to yield qualified visits.
  • Look at performance by landing page to see which entry points attract the users most likely to take action. Are there any pages with low session volume and high key event rates that may be worth promoting more?
  • Filter specific key events to compare which ones have the highest volume and event rates. For instance, if you offer both a demo request and a free trial, you can compare which drives the most interest from paid search vs. paid social.
  • Compare attribution models (Advertising > Attribution > Attribution Models) to see how many key events are attributed to each source and channel when using a last-click model vs. a data-driven one.
    •  Last-click attribution credits the key event to the last non-direct source by which a user arrived on the site.
    • Data-driven attribution distributes credit between sources based on your account’s prior data. Factors may include time between visits from various sources, number of interactions, devices, and more. While this is, unfortunately, a “black box” model on Google’s part, it will help to weigh more toward sources that may have influenced consideration when users visit your site multiple times before taking action.

General Event Counts & Rates

While not every event should be considered a key event, you should take the time to look at other events that can give clues to user engagement on the site.

If you’ve turned on Enhanced Measurement, you can see events for actions such as scroll activity, file downloads, outbound clicks, and video interaction (for embedded YouTube videos).

While the specific application of these events will vary based on how your site is set up, here are a few ways they could be used in your analysis:

  • Determine if it is worth including an embedded video on your landing page. Are people watching the video, and if so, how far are they viewing on average? Additionally, are users who watch the video also more likely to submit a lead form or complete a purchase in the same session?
  • Weigh the importance of content below the fold on your landing page. Are a decent percentage of people bothering to scroll, or are most just viewing what is immediately visible when reaching the site?
  • Assess the value of downloadable content. If you’re offering a PDF such as an ebook or spec sheet, what percentage of people are clicking to download it?

To take these out-of-the-box events a step further, you can also set up custom events for more advanced tracking.

Ecommerce Metrics

For those promoting ecommerce sites, GA4 offers a robust set of metrics that allow you to analyze the full path of purchase behavior.

Once you’ve set up your site to incorporate ecommerce events, you can view them in Reports under Life Cycle > Monetization or Business Objectives > Sales. (Note that GA4 still displays the Life Cycle Collection instead of Business Objectives as default in certain circumstances.)

GA4 Ecommerce ReportScreenshot from Google Analytics, November 2024

Here are a few important metrics you should be paying attention to:

  • Transactions: The total number of purchases
  • Revenue: Get an idea of the income from on-site purchases, and use it to calculate return on ad spend (ROAS).
  • Add to cart: Understand how many users are expressing enough interest to add an item to their cart, and look at abandonment rates during the checkout process.

Engagement Metrics

GA4 introduced new metrics to assess how much interest users are showing while on your site, offering a more robust approach to measurement than the much-maligned bounce rate that was omnipresent in previous reporting.

In order to qualify as an “engaged session,” a session needs to last longer than 10 seconds, include two page views or screen views, or have a key event fire.

GA4 Engagement OverviewScreenshot from Google Analytics, November 2024

Looking at engaged sessions in addition to total sessions will offer a more accurate picture of how often people spent at least enough time on the site to absorb some of the content vs. immediately leaving.

The engagement rate will show you what percentage of sessions fit the “engaged” criteria, offering clues as to which landing pages and sources will most likely drive qualified individuals.

Another metric is average engagement time, showing the average amount of time spent on the site either by session (visit level) or by user (individual level), depending on the report you are viewing.

Finally, you can view new vs. returning users to get an idea of which channels will most likely drive people to the site for the first time vs. those who have previously interacted with it.

Of course, note that these metrics aren’t perfect (with cross-device users and privacy settings complicating accuracy), but they can at least give you a rough idea.

Additionally, be mindful that channels, where you’ve focused more heavily on retargeting, will naturally drive more returning users.

Ad Platform Integrated Metrics

If you have a Google Ads account, you should link it to your GA4 account in order to automatically pass through metrics from your campaigns. This will offer more robust data than just relying on UTM parameters.

To ensure Google Ads data is flowing correctly, make sure you have admin access to the account, turn on auto-tagging, and link the proper Ads account ID to the correct GA4 property.

GA4 Advertising ReportScreenshot from Google Analytics, November 2024

You can see this data correlated with GA4 key events under Advertising > Planning > Google Ads.

If desired, you can import cost data from non-Google ad platforms and view corresponding metrics in the Planning > All Channels report.

GA4 Conversions ReportScreenshot from Google Analytics, November 2024

Additionally, the Advertising > Conversion Performance report allows you to select a Google Ads account and Ads-based conversions to view counts by various breakdowns. This lets you compare totals for these conversions from Ads vs. other channels.

As an added bonus, you can also see a few GA4 metrics directly within the Ads interface if you select the option to “Import app and web metrics” when setting up your link.

The % engaged sessions (the percentage of total sessions qualifying as “engaged”), events/sessions, and average engagement duration are the three available as of the time of publishing.

These can be useful to get a quick view of which campaigns, ads, etc., will most likely attract users willing to spend time on your site.

As a side note, you should not be overly concerned about matching up sessions and click totals perfectly. These can vary for a number of reasons:

  • A session is only counted when a page is viewed, and any previous sessions have timed out. By default, if a user goes back to the site within 30 minutes, they will still be within the same session.
  • A user could click and leave the site before the GA4 code has time to fire, in which case the click would be counted, and a session would not register.
  • Some types of Google Ads campaigns count clicks for actions that may not entail visiting a website. For instance, Demand Gen campaigns include clicks to open Gmail ads.

Start Analyzing Your Paid Traffic

Now that we’ve reviewed several important GA4 metrics, think about how you can apply this data when managing your PPC campaigns.

Understanding the metrics available to you is one important step in mastering GA4, but being able to segment data and understand context is the other crucial step.

Be sure to review these metrics both at the channel and source level, as well as for individual landing pages you’re pushing traffic to.

Wherever possible, incorporate takeaways from GA4 into your PPC reporting as well to show insight beyond the ad platform data.

More resources:


Featured Image: PeopleImages.com – Yuri A/Shutterstock

Get instant clarity on your SEO with the new Yoast SEO Dashboard!

With the new Yoast SEO Dashboard, you can see how your site is doing at a glance. Instead of hunting through pages, you’ll find all your key metrics in one place. It’s easy to spot which posts need attention, see where you can improve, and figure out what to tackle first so you can spend more time refining your content and less time searching for data.

We know it can be challenging to improve your content when you have to dig through every post and page. That’s why we created a dashboard that instantly shows your site SEO Perfomance. It streamlines the process so you can focus on what matters: making your content shine.

What you get from the Yoast SEO Dashboard:

  • Top-level overview of your SEO: see critical insights at a glance without hunting through individual pages
  • Filterable views of SEO and readability scores: quickly spot where you can make the most significant improvements

Designed for clarity and direction, our dashboard makes it easy to check in, prioritize your next steps, and enhance your content strategy immediately. It’s straightforward, efficient, and all about helping you work smarter.

How to access the Yoast SEO Dashboard:

To access the Yoast SEO Dashboard, you just need to:

  1. Ensure your Yoast SEO is up-to-date: In your WordPress admin area, go to “Plugins” and update Yoast SEO to the latest version
  2. Navigate to the Dashboard: Click on “Yoast SEO” in your WordPress sidebar to land on the dashboard, or click on “Dashboard”
  3. Explore your insights: Review the overview, filter your scores, and start working through your task list

Ready to see how much simpler managing your SEO can be?
Get Yoast SEO today and let your new dashboard guide you toward better performance.

Coming up next!

10 Hosting Trends Agencies Should Watch In 2025

This post was sponsored by Bluehost. The opinions expressed in this article are the sponsor’s own.

Which hosting service is best for agencies?

How do I uncover what will be best for my clients in 2025?

What features should my hosting service have in 2025?

Hosting has evolved well beyond keeping websites online.

Hosting providers must align their services to meet clients’ technological needs and keep up with constantly changing technological advances.

Today, quality hosting companies must focus on speed, security, and scalability. Staying ahead of hosting trends is critical to maintaining competitive offerings, optimizing workflows, and meeting client demands.

So, what should you watch for in 2025?

The next 12 months promise significant shifts in hosting technologies, with advancements in AI, automation, security, and sustainability leading the way.

Understanding and leveraging these trends enables agencies and professionals to provide better client experiences, streamline operations, and reduce the negative effects of future industry changes.

Trend 1: Enhanced AI & Automation Implemented In Hosting

AI and automation are already transforming hosting, making it smarter and more efficient for service providers, agencies, brands, and end-point customers alike.

Hosting providers now leverage AI to optimize server performance, predict maintenance needs, and even supplement customer support with AI-driven features like chatbots.

As a result, automating routine tasks such as backups, updates, and resource scaling reduces downtime and the need for manual intervention. These innovations are game-changing for those managing multiple client sites and will become increasingly important in 2025.

It only makes sense.

Automated systems free up valuable time, allowing you more time to focus on strategic growth instead of tedious maintenance tasks. AI-powered insights can also identify performance bottlenecks, enabling you to address issues before they impact your website or those of your clients.

Agencies that adopt these technologies this year will not only deliver exceptional service but also be able to position themselves as forward-thinking.

Bluehost embraces automation with features like automated backups, one-click updates, and a centralized dashboard for easy site management. These tools streamline workflows, enabling agencies and professionals to manage multiple sites with minimal effort while ensuring optimal performance.

Trend 2: Multi-Cloud & Hybrid Cloud Solutions Are Now Essential

In 2025, as businesses demand more flexibility and reliability from their online infrastructure, multi-cloud and hybrid cloud solutions will become essential in the hosting world.

These approaches offer the best of both worlds:

  • The ability to leverage multiple cloud providers for redundancy and performance.
  • The option to combine public and private cloud environments for greater control and customization.

For agencies managing diverse client needs, multi-cloud and hybrid cloud strategies provide the scalability and adaptability required to meet modern demands. Multi-cloud solutions allow agencies to distribute their clients’ workloads across multiple cloud providers, ensuring that no single point of failure disrupts their operations.

This feature is particularly valuable for agencies with high-traffic websites, where downtime or slow performance can have a significant impact on revenue and user experience. Hybrid cloud solutions, on the other hand, let agencies blend the scalability of public clouds with the security and control of private cloud environments.

This service is ideal for clients with sensitive data or compliance requirements, such as ecommerce or healthcare businesses.

Bluehost Cloud provides scalable infrastructure and tools that enable agencies to customize hosting solutions to fit their clients’ unique requirements. Our cloud hosting solution’s elastic architecture ensures that websites can handle sudden traffic spikes without compromising speed or reliability.

Additionally, our intuitive management dashboard allows agencies to easily monitor and allocate resources across their client portfolio, making it simple to implement tailored solutions for varying workloads.

By adopting multi-cloud and hybrid cloud strategies, agencies can offer their clients enhanced performance, improved redundancy, and greater control over their hosting environments.

With our scalable solutions and robust toolset, agencies can confidently deliver hosting that grows with their clients’ businesses while maintaining consistent quality and reliability. This flexibility not only meets today’s hosting demands but also helps position your agency for long-term success in a rapidly evolving digital landscape.

Trend 3: Edge Computing & CDNs Replace AMP For Improving Website Speed

As online audiences grow, the demand for faster, more responsive websites has never been higher. Edge computing and Content Delivery Networks (CDNs) are at the forefront of this evolution, enabling websites to reduce latency significantly. For agencies managing clients with diverse and international audiences, these technologies are crucial for improving user experience and ensuring website performance remains competitive.

Edge computing brings data processing closer to the end user by leveraging servers located at the “edge” of a network, reducing the time it takes for information to travel.

Combined with CDNs that cache website content on servers worldwide, these technologies ensure faster load times, smoother navigation, and better performance metrics.

These features are especially beneficial for media-heavy or high-traffic websites, where even a slight delay can impact engagement and conversions.

Bluehost integrates with leading CDN solutions to deliver content quickly and efficiently to users across the globe. By leveraging a CDN, Bluehost ensures that websites load faster regardless of a visitor’s location, enhancing user experience and SEO performance.

This integration simplifies the optimization of site speed for agencies with multiple clients. By adopting edge computing and CDN technology, you can help your clients achieve faster load times, improved site stability, and higher customer satisfaction.

Bluehost’s seamless CDN integration enables you to deliver a hosting solution that meets the expectations of a modern, global audience while building trust and loyalty with your clients.

Trend 4: Core Web Vitals & SEO Hosting Features Make Or Break Websites

Core Web Vitals play an important role in today’s SEO, as Google is increasingly emphasizing website performance and user experience in its ranking algorithms. Today, loading speed, interactivity, and visual stability impact a site’s ability to rank well in search results and keep visitors engaged.

That means optimizing Core Web Vitals isn’t just an SEO task for agencies managing client websites. Fast load times and responsive design are critical parts of delivering a high-quality digital experience. For example, metrics like Largest Contentful Paint (LCP), which measures how quickly a page’s main content loads, depend heavily on hosting infrastructure.

Agencies need hosting solutions optimized for these metrics to ensure their clients’ sites stay competitive in the SERPs.

Bluehost offers a WordPress-optimized hosting environment with features specifically designed to improve load times and server response speeds. From advanced caching technology to robust server architecture, Bluehost ensures that sites meet Core Web Vitals standards with ease.

Additionally, our hosting solutions include tools for monitoring site performance, allowing agencies to proactively address any issues that could impact rankings or user experience.

By prioritizing Core Web Vitals and leveraging SEO-focused hosting features, agencies can enhance their clients’ visibility, engagement, and overall online success. With Bluehost’s optimized hosting solutions, you’ll have the tools and infrastructure needed to deliver fast, stable, and high-performing websites that delight users and search engines.

Trend 5: Sustainable Hosting Practices Help Reduce Energy Consumption

Sustainability is no longer just a buzzword. It’s a key consideration for businesses and agencies alike. As 2025 progresses, more clients will prioritize environmentally conscious practices, and hosting providers will step up to offer greener solutions, such as energy-efficient data centers and carbon offset programs.

Migrating to a sustainable hosting provider not only supports client values but also demonstrates a commitment to responsible business practices, which will resonate more with consumers in 2025 than ever before.

Efficient hosting practices reduce energy consumption and create a more sustainable digital ecosystem. It will also allow you to help clients meet their environmental goals without compromising on performance.

These benefits are especially valuable for clients with higher energy and performance demands, such as those in ecommerce, media-heavy, or high-traffic industries.

Bluehost has long been recognized as a trusted hosting provider that operates with efficiency in mind.

Our robust, energy-efficient infrastructure already aligns with the sustainability goals of environmentally conscious clients.

In addition, our long-standing reputation, proven history with WordPress, and demonstrable reliability enhance your clients’ sustainability objectives, ensuring they can operate responsibly and confidently.

By choosing sustainable hosting practices and partners like Bluehost, you can contribute to a greener digital future while reinforcing your clients’ environmental commitments and strengthening client relationships by aligning with their values.

Trend 6: Security Must Be A Core Offering

Security is a non-negotiable priority for any website. Cyber threats like data breaches, malware, and DDoS attacks are on the rise, and the consequences of a breach, including lost revenue, damaged reputations, and potential legal issues, can devastate clients. As a result, offering secure hosting solutions with proactive security measures is essential to safeguarding clients’ businesses and building trust.

These key features include SSL certificates, which protect sensitive data while boosting SEO rankings and user trust, and regular malware scans to prevent vulnerabilities.

They should also include automated backups that enable quick restoration in the event of a crash or attack and provide comprehensive protection and peace of mind. Essential security features are standard in Bluehost hosting plans, including SSL certificates, daily automated backups, and proactive malware scanning.

These built-in tools eliminate the need for additional solutions, added complexity, or costs. For agencies, our security features reduce risks for your clients and provide peace of mind.

By choosing a hosting provider like Bluehost, you can prioritize client security, reinforce client trust, and minimize emergencies, allowing you to avoid spending time and resources addressing threats or repairing damage.

In short, by partnering with Bluehost, security becomes a core part of your agency’s value proposition.

Trend 7: Hosting Optimized For AI & Machine Learning Is Key To High Visibility On SERPs

As artificial intelligence and machine learning become increasingly integrated with websites and applications in 2025, hosting providers must keep pace with the increasing demands these technologies place on infrastructure.

AI-driven tools like chatbots, recommendation engines, and predictive analytics require significant computational power and seamless data processing.

AI and machine learning applications often involve handling large datasets, running resource-intensive algorithms, and maintaining real-time responsiveness. Hosting optimized for these needs ensures that websites can perform reliably under heavy workloads, reducing latency and downtime and delivering consistent performance.

If you plan to be successful, you’ll also require scalable scalable hosting solutions. These solutions allow resources to expand dynamically with demand, accommodate growth, and handle traffic surges.

Bluehost’s scalable hosting is built to support advanced tools and applications, making it an ideal choice for agencies working on AI-driven projects. Our robust infrastructure delivers consistent performance, and flexibility allows you to scale easily as your client’s needs evolve. By leveraging Bluehost, agencies can confidently deliver AI-integrated websites that meet modern performance demands.

Trend 8: Managed Hosting Helps You Focus More On Profits

In 2025, websites will become increasingly complex. Businesses will require higher performance and reliability, and everyone will be looking to operate as lean and efficiently as possible. These trends mean managed hosting will become the go-to solution for agencies and their clients.

Managed hosting shifts time-intensive technical maintenance away from agencies and business owners by including features such as automatic updates, performance monitoring, and enhanced security. In short, managed hosting enables you to simplify workflows, save time, and deliver consistent quality to your clients.

These hosting services are particularly valuable for WordPress websites, where regular updates, plugin compatibility checks, and security enhancements occur frequently but are essential to maintaining optimal performance.

Managed hosting also typically includes tools like staging environments, which allow agencies to test changes and updates without risking disruptions to live sites and ensure you can deliver a seamless experience to clients.

Bluehost offers managed WordPress hosting that includes automatic updates, staging environments, and 24/7 expert support. These features allow you to handle technical details efficiently while focusing on delivering results for your clients without added stress or time.

Trend 9: The Shift Toward Decentralized Hosting Boosts Your Brand’s Longevity

In 2025, expect to see decentralized hosting gain attention as a futuristic approach to web hosting. Like Bitcoin and similar advancements, the technology leverages blockchain technology and peer-to-peer networks to create hosting environments that prioritize privacy, resilience, and independence from centralized control.

While this model appears to provide exciting new opportunities, it’s still in the early stages. It faces challenges in scalability, user-friendliness, and widespread adoption, which agencies can’t typically rely on for client sites.

Decentralized hosting may become a viable option for specific use cases, such as privacy-focused projects or highly distributed systems. However, centralized hosting providers still offer the best balance of reliability, scalability, and accessibility for most businesses and agencies today.

For these reasons, agencies managing client websites will continue to focus on proven, reliable hosting solutions that deliver consistent performance and robust support.

So, while decentralized hosting may gain traction this year, Bluehost will continue to provide a trustworthy hosting environment designed to meet the needs of modern websites. With a strong emphasis on reliability, scalability, and user-friendly management tools, we offer a proven solution agencies can depend on to deliver exceptional client results.

Trend 10: Scalable Hosting For High-Growth Websites Is Key For Growth

As businesses grow, their websites will experience increasing traffic and resource demands. High-growth websites, such as e-commerce platforms, content-heavy blogs, or viral marketing campaigns, require hosting solutions that can scale instantly. And scalable hosting is critical to delivering consistent user experiences and avoiding downtime during peak periods.

Scalable hosting like Bluehost ensures your clients’ websites can easily adjust resources like bandwidth, storage, and processing power to meet fluctuating demands. Our scalable hosting solutions are designed for high-growth websites. Our unmetered bandwidth and infrastructure were built to handle traffic surges, ensuring websites remain fast and accessible.

These features make us the ideal choice for agencies looking to future-proof their clients’ hosting needs.

As the digital landscape continues to evolve in 2025, keeping up with the latest trends in hosting is essential for agencies to provide top-tier service, drive client satisfaction, and maintain a competitive edge. From AI and automation to scalability and security, the future of hosting demands flexible, efficient solutions tailored to modern needs.

By understanding and leveraging these trends, you can position your agency as a trusted partner and deliver exceptional results to your clients, whether by adopting managed hosting or integrating CDNs.

Bluehost hosting will meet today’s demands while helping to prepare agencies like yours for tomorrow. With features like 100% uptime guaranteed through our Service Level Agreement (SLA), 24/7 priority support, and built-in tools like SSL certificates, automated backups, and advanced caching, Bluehost offers a robust and reliable hosting environment.

Additionally, Bluehost Cloud makes switching easy and cost-effective with $0 migration costs and credit for remaining contracts, giving you the flexibility to transition seamlessly without the high cost.

Take your agency’s hosting strategy to the next level with Bluehost. Discover how our comprehensive hosting solutions can support your growth, enhance client satisfaction, and keep your business ahead of the curve.


Image Credits

Featured Image: Image by Bluehost. Used with permission.

Marketing Trend: Consumers Prefer Relatability via @sejournal, @martinibuster

A new iStock 2025 Marketing Trends report finds declining consumer trust in social media and influencers, emphasizing the importance of relatability over perfection for marketers and businesses.

Trust For Marketing Success

The iStock report finds that 81% of consumers don’t trust content on social media. Nevertheless they still trust visual platforms like TikTok and Instagram Reels for discovery and inspiration. In terms of influence, 64% of consumers trust businesses over celebrities and influencers, particularly brands that align with their values (58%).

Authenticity And Real-User Content (RUC)

iStock’s data shows that consumer perception of influencer “realness” has declined, with 67% of people trusting traditional advertising over sponsored influencer posts. iStock is recommending what it calls Real-User Content (RUC), images and videos that project realness. Video content was highlighted by iStock as a strong trend that consumers should consider as more consumers turn to video content for learning and inspiration.

iStock recommends that marketers focus on being “real, truthful, and original” as the key to building trust. While authenticity is important, iStock is emphasizing offering real stories and being relatable as opposed to content that reflects virtually unattainable perfection.

They write:

“This change is affecting how people interact with visual content, especially on social media. Despite people’s lack of trust, they still find these platforms valuable, 82% of users still go to places like TikTok, Instagram Reels, and YouTube Shorts for video content to learn something new or get inspiration. In other words, people want the benefits of social media, without the negative effects. This shift has also made video-driven social search more popular, where platforms focused on video are no longer just for scrolling —they’ve become places to search and discover. In 2025, to succeed, you need to speak directly to your audience, this approach will always be more effective than a flood of generic posts.”

The report recommends radical honesty by showing the company in ways that include imperfect moments. iStock’s 2025 Marketing Trends report shows an approach to connecting with consumers in a way that reflects qualities of realness that people are looking for in the content they consume.

Read iStock’s report:

Crack the Code on Trust: 2025 Marketing Insights for Small Businesses

Featured Image by Shutterstock/HAKINMHAN

Bing Search Updates: Faster, More Precise Results via @sejournal, @MattGSouthern

Microsoft has announced updates to Bing’s search infrastructure incorporating large language models (LLMs), small language models (SLMs), and new optimization techniques.

This update aims to improve performance and reduce costs in search result delivery.

In an announcement, the company states:

“At Bing, we are always pushing the boundaries of search technology. Leveraging both Large Language Models (LLMs) and Small Language Models (SLMs) marks a significant milestone in enhancing our search capabilities. While transformer models have served us well, the growing complexity of search queries necessitated more powerful models.”

Performance Gains

Using LLMs in search systems can create problems with speed and cost.

To solve these problems, Bing has trained SLMs, which it claims are 100 times faster than LLMs.

The announcement reads:

“LLMs can be expensive to serve and slow. To improve efficiency, we trained SLM models (~100x throughput improvement over LLM), which process and understand search queries more precisely.”

Bing also uses NVIDIA TensorRT-LLM to improve how well SLMs work.

TensorRT-LLM is a tool that helps reduce the time and cost of running large models on NVIDIA GPUs.

Impact On “Deep Search”

According to a technical report from Microsoft, integrating Nvidia’s TensorRT-LLM technology has enhanced the company’s “Deep Search” feature.

Deep Search leverages SLMs in real time to provide relevant web results.

Before optimization, Bing’s original transformer model had a 95th percentile latency of 4.76 seconds per batch (20 queries) and a throughput of 4.2 queries per second per instance.

With TensorRT-LLM, the latency was reduced to 3.03 seconds per batch, and throughput increased to 6.6 queries per second per instance.

This represents a 36% reduction in latency and a 57% decrease in operational costs.

The company states:

“… our product is built on the foundation of providing the best results, and we will not compromise on quality for speed. This is where TensorRT-LLM comes into play, reducing model inference time and, consequently, the end-to-end experience latency without sacrificing result quality.”

Benefits For Bing Users

This update brings several potential benefits to Bing users:

  • Faster search results with optimized inference and quicker response times
  • Improved accuracy through enhanced capabilities of SLM models, delivering more contextualized results
  • Cost efficiency, allowing Bing to invest in further innovations and improvements

Why Bing’s Move to LLM/SLM Models Matters

Bing’s switch to LLM/SLM models and TensorRT optimization could impact the future of search.

As users ask more complex questions, search engines need to better understand and deliver relevant results quickly. Bing aims to do that using smaller language models and advanced optimization techniques.

While we’ll have to wait and see the full impact, Bing’s move sets the stage for a new chapter in search.


Featured Image: mindea/Shutterstock