Millions of images of passports, credit cards, birth certificates, and other documents containing personally identifiable information are likely included in one of the biggest open-source AI training sets, new research has found.
Thousands of images—including identifiable faces—were found in a small subset of DataComp CommonPool, a major AI training set for image generation scraped from the web. Because the researchers audited just 0.1% of CommonPool’s data, they estimate that the real number of images containing personally identifiable information, including faces and identity documents, is in the hundreds of millions. The study that details the breach was published on arXiv earlier this month.
The bottom line, says William Agnew, a postdoctoral fellow in AI ethics at Carnegie Mellon University and one of the coauthors, is that “anything you put online can [be] and probably has been scraped.”
The researchers found thousands ofinstances of validated identity documents—including images of credit cards, driver’s licenses, passports, and birth certificates—as well as over 800 validated job application documents (including résumés and cover letters), which were confirmed through LinkedIn and other web searches as being associated with real people. (In many more cases, the researchers did not have time to validate the documents or were unable to because of issues like image clarity.)
A number of the résumés disclosed sensitive information including disability status, the results of background checks, birth dates and birthplaces of dependents, and race. When résumés were linked to people with online presences, researchers also found contact information, government identifiers, sociodemographic information, face photographs, home addresses, and the contact information of other people (like references).
Examples of identity-related documents found in CommonPool’s small-scale data set show a credit card, a Social Security number, and a driver’s license. For each sample, the type of URL site is shown at the top, the image in the middle, and the caption in quotes below. All personal information has been replaced, and text has been paraphrased to avoid direct quotations. Images have been redacted to show the presence of faces without identifying the individuals.
COURTESY OF THE RESEARCHERS
When it was released in 2023, DataComp CommonPool, with its 12.8 billion data samples, was the largest existing data set of publicly available image-text pairs, which are often used to train generative text-to-image models. While its curators said that CommonPool was intended for academic research, its license does not prohibit commercial use as well.
CommonPool was created as a follow-up to the LAION-5B data set, which was used to train models including Stable Diffusion and Midjourney. It draws on the same data source: web scraping done by the nonprofit Common Crawl between 2014 and 2022.
While commercial models often do not disclose what data sets they are trained on, the shared data sources of DataComp CommonPool and LAION-5B mean that the data sets are similar, and that the same personally identifiable information likely appears in LAION-5B, as well as in other downstream models trained on CommonPool data. CommonPool researchers did not respond to emailed questions.
And since DataComp CommonPool has been downloaded more than 2 million times over the past two years, it is likely that “there [are]many downstream models that are all trained on this exact data set,” says Rachel Hong, a PhD student in computer science at the University of Washington and the paper’s lead author. Those would duplicate similar privacy risks.
Good intentions are not enough
“You can assume that any large-scale web-scraped data always contains content that shouldn’t be there,” says Abeba Birhane, a cognitive scientist and tech ethicist who leads Trinity College Dublin’s AI Accountability Lab—whether it’s personally identifiable information (PII), child sexual abuse imagery, or hate speech (which Birhane’s own research into LAION-5B has found).
Indeed, the curators of DataComp CommonPool were themselves aware it was likely that PII would appear in the data set and did take some measures to preserve privacy, including automatically detecting and blurring faces. But in their limited data set, Hong’s team found and validated over 800 faces that the algorithm had missed, and they estimated that overall, the algorithm had missed 102 million faces in the entire data set. On the other hand, they did not apply filters that could have recognized known PII character strings, like emails or Social Security numbers.
“Filtering is extremely hard to do well,” says Agnew. “They would have had to make very significant advancements in PII detection and removal that they haven’t made public to be able to effectively filter this.”
Examples of résumé documents and personal disclosures found in CommonPool’s small-scale data set. For each sample, the type of URL site is shown at the top, the image in the middle, and the caption in quotes below. All personal information has been replaced, and text has been paraphrased to avoid direct quotations. Images have been redacted to show the presence of faces without identifying the individuals. Image courtesy of the researchers.
COURTESY OF THE RESEARCHERS
There are other privacy issues that the face blurring doesn’t address. While the blurring filter is automatically applied, it is optional and can be removed. Additionally, the captions that often accompany the photos, as well as the photos’ metadata, often contain even more personal information, such as names and exact locations.
Another privacy mitigation measure comes from Hugging Face, a platform that distributes training data sets and hosts CommonPool, which integrates with a tool that theoretically allows people to search for and remove their own information from a data set. But as the researchers note in their paper, this would require people to know that their data is there to start with. When asked for comment, Florent Daudens of Hugging Face said that “maximizing the privacy of data subjects across the AI ecosystem takes a multilayered approach, which includes but is not limited to the widget mentioned,” and that the platform is “working with our community of users to move the needle in a more privacy-grounded direction.”
In any case, just getting your data removed from one data set probably isn’t enough. “Even if someone finds out their data was used in a training data sets and … exercises their right to deletion, technically the law is unclear about what that means,” says Tiffany Li, an associate professor of law at the University of San Francisco School of Law. “If the organization only deletes data from the training data sets—but does not delete or retrain the already trained model—then the harm will nonetheless be done.”
The bottom line, says Agnew, is that “if you web-scrape, you’re going to have private data in there. Even if you filter, you’re still going to have private data in there, just because of the scale of this. And that’s something that we [machine-learning researchers], as a field, really need to grapple with.”
Reconsidering consent
CommonPool was built on web data scraped between 2014 and 2022, meaning that many of the images likely date to before 2020, when ChatGPT was released. So even if it’s theoretically possible that some people consented to having their information publicly available to anyone on the web, they could not have consented to having their data used to train large AI models that did not yet exist.
And with web scrapers often scraping data from each other, an image that was originally uploaded by the owner to one specific location would often find its way into other image repositories. “I might upload something onto the internet, and then … a year or so later, [I] want to take it down, but then that [removal] doesn’t necessarily do anything anymore,” says Agnew.
The researchers also found numerous examples of children’s personal information, including depictions of birth certificates, passports, and health status, but in contexts suggesting that they had been shared for limited purposes.
“It really illuminates the original sin of AI systems built off public data—it’s extractive, misleading, and dangerous to people who have been using the internet with one framework of risk, never assuming it would all be hoovered up by a group trying to create an image generator,” says Ben Winters, the director of AI and privacy at the Consumer Federation of America.
Finding a policy that fits
Ultimately, the paper calls for the machine-learning community to rethink the common practice of indiscriminate web scraping and also lays out the possible violations of current privacy laws represented by the existence of PII in massive machine-learning data sets, as well as the limitations of those laws’ ability to protect privacy.
“We have the GDPR in Europe, we have the CCPA in California, but there’s still no federal data protection law in America, which also means that different Americans have different rights protections,” says Marietje Schaake, a Dutch lawmaker turned tech policy expert who currently serves as a fellow at Stanford’s Cyber Policy Center.
Besides, these privacy laws apply to companies that meet certain criteria for size and other characteristics. They do not necessarily apply to researchers like those who were responsible for creating and curating DataComp CommonPool.
And even state laws that do address privacy, like California’s consumer privacy act, have carve-outs for “publicly available” information. Machine-learning researchers have long operated on the principle that if it’s available on the internet, then it is public and no longer private information, but Hong, Agnew, and their colleagues hope that their research challenges this assumption.
“What we found is that ‘publicly available’ includes a lot of stuff that a lot of people might consider private—résumés, photos, credit card numbers, various IDs, news stories from when you were a child, your family blog. These are probably not things people want to just be used anywhere, for anything,” says Hong.
Hopefully, Schaake says, this research “will raise alarm bells and create change.”
This article previously misstated Tiffany Li’s affiliation. This has been fixed.
Google has updated its machine learning crash course with new videos and modules on large language models and automated machine learning. These courses are useful introductions to the technologies behind modern search engines and generative AI, information that will make you a better SEO.
What Is Google’s Machine Learning Crash Course?
Google’s machine learning course is an easy to understand introduction to machine learning. It shows what machine learning is all about and how it can be useful to you and your business.
The different courses are self-contained in modules, beginning with introductions to the fundamentals of Linear Regression, Logistic Regression, and Binary Classification Models
The other modules cover:
Data How to work with machine learning data
Advanced Machine Learning Models Introductions to Neural Networks, Embeddings, and Large Language Models
Real-world ML These modules cover best practices for deploying machine learning models in the real world.
The Large Language Models module is a new addition to the courses and is a good way to get up to speed fast with the technology and be conversant about it.
Google’s documentation shows what students learn with the module:
“Define a few different types of language models and their components.
Describe how large language models are created and the importance of context and parameters.
Identify how large language models take advantage of self-attention.
Reveal three key problems with large language models.
Explain how fine-tuning and distillation can improve a model’s predictions and efficiency.”
Google recommends first taking six other courses before starting the LLM module, so as to gain an understanding of the fundamentals. The six recommended courses look very interesting:
Introduction to Machine Learning
Linear regression
Working with categorical data
Datasets, generalization, and overfitting
Neural networks
Embeddings
The courses for linear regression, neural networks and embeddings can arguably be called essential for SEOs because these technologies have been a major part of how search ranking algorithms work. Obtaining a basic understanding about these technologies will improve your ability to understand how the backend of search engines work.
Many misleading ideas are popular in the SEO community because they sound like common sense, much like some answers you may have experienced from generative AI make sense but are hallucinations. Learning what these technologies are and how they work will help you become a better search marketer.
In an essay last week, Sam Altman, the CEO of OpenAI, argued that the accelerating capabilities of AI will usher in an idyllic “Intelligence Age,” unleashing “unimaginable” prosperity and “astounding triumphs” like “fixing the climate.”
It’s a promise that no one is in a position to make—and one that, when it comes to the topic of climate change, fundamentally misunderstands the nature of the problem.
More maddening, the argument suggests that the technology’s massive consumption of electricity today doesn’t much matter, since it will allow us to generate abundant clean power in the future. That casually waves away growing concerns about a technology that’s already accelerating proposals for natural-gas plants and diverting major tech companies from their corporate climate targets.
By all accounts, AI’s energy demands will only continue to increase, even as the world scrambles to build larger, cleaner power systems to meet the increasing needs of EV charging, green hydrogen production, heat pumps, and other low-carbon technologies. Altman himself reportedly just met with White House officials to make the case for building absolutely massive AI data centers, which could require the equivalent of five dedicated nuclear reactors to run.
It’s a bedrock perspective of MIT Technology Review that technological advances can deliver real benefits and accelerate societal progress in meaningful ways. But for decades researchers and companies have oversold the potential of AI to deliver blockbuster medicines, achieve super intelligence, and free humanity from the need to work. To be fair, there have been significant advances, but nothing on the order of what’s been hyped.
Given that track record, I’d argue you need to develop a tool that does more than plagiarize journalism and help students cheat on homework before you can credibly assert that it will solve humanity’s thorniest problems, whether the target is rampant poverty or global warming.
All those advances are still relatively incremental. But let’s say AI does bring about an energy miracle. Perhaps its pattern-recognition prowess will deliver the key insight that finally cracks fusion—a technology that Altman is betting on heavily as an investor.
That would be fantastic. But technological advances are just the start—necessary but far from sufficient to eliminate the world’s climate emissions.
How do I know?
Because between nuclear fission plants, solar farms, wind turbines, and batteries, we already have every technology we need to clean up the power sector. This should be the low-hanging fruit of the energy transition. Yet in the largest economy on Earth, fossil fuels still generate 60% of the electricity. The fact that so much of our power still comes from coal, petroleum, and natural gas is a regulatory failure as much as a technological one.
“As long as we effectively subsidize fossil fuels by allowing them to use the atmosphere as a waste dump, we are not allowing clean energy to compete on a level playing field,” Zeke Hausfather, a climate scientist at the independent research organization Berkeley Earth, wrote on X in a response to Altman’s post. “We need policy changes, not just tech breakthroughs, to meet our climate goals.”
That’s not to say there aren’t big technical problems we still need to solve. Just look at the continuing struggles to develop clean, cost-competitive ways of fertilizing crops or flying planes. But the fundamental challenges of climate change are sunk costs, development obstacles, and inertia.
We’ve built and paid for a global economy that spews out planet-warming gases, investing trillions of dollars in power plants, steel mills, factories, jets, boilers, water heaters, stoves, and SUVs that run on fossil fuels. And few people or companies will happily write off those investments so long as those products and plants still work. AI can’t remedy all that just by generating better ideas.
To raze and replace the machinery of every industry around the world at the speed now required, we will need increasingly aggressive climate policies that incentivize or force everyone to switch to cleaner plants, products, and practices.
But with every proposal for a stricter law or some big new wind or solar farm, forces will push back, because the plan will hit someone’s wallet, block someone’s views, or threaten the areas or traditions someone cherishes. Climate change is an infrastructure problem, and building infrastructure is a messy human endeavor.
Tech advances can ease some of these issues. Cheaper, better alternatives to legacy industries make hard choices more politically palatable. But there are no improvements to AI algorithms or underlying data sets that solve the challenge of NIMBYism, the conflict between human interests, or the desire to breathe the fresh air in an unsullied wilderness.
To assert that a single technology—that just happens to be the one your company develops—can miraculously untangle these intractable conflicts of human society is at best self-serving, if not a little naïve. And it’s a troubling idea to proclaim at a point when the growth of that very technology is threatening to undermine the meager progress the world has begun to make on climate change.
As it is, the one thing we can state confidently about generative AI is that it’s making the hardest problem we’ve ever had to solve that much harder to solve.
The US Department of Homeland Security (DHS) plans to collect and analyze photos of the faces of migrant children at the border in a bid to improve facial recognition technology, MIT Technology Review can reveal. This includes children “down to the infant,” according to John Boyd, assistant director of the department’s Office of Biometric Identity Management (OBIM), where a key part of his role is to research and develop future biometric identity services for the government.
As Boyd explained at a conference in June, the key question for OBIM is, “If we pick up someone from Panama at the southern border at age four, say, and then pick them up at age six, are we going to recognize them?”
Facial recognition technology (FRT) has traditionally not been applied to children, largely because training data sets of real children’s faces are few and far between, and consist of either low-quality images drawn from the internet or small sample sizes with little diversity. Such limitations reflect the significant sensitivities regarding privacy and consent when it comes to minors.
In practice, the new DHS plan could effectively solve that problem. According to Syracuse University’s Transactional Records Access Clearinghouse (TRAC), 339,234 children arrived at the US-Mexico border in 2022, the last year for which numbers are currently available. Of those children, 150,000 were unaccompanied—the highest annual number on record. If the face prints of even 1% of those children had been enrolled in OBIM’s craniofacial structural progression program, the resulting data set would dwarf nearly all existing data sets of real children’s faces used for aging research.
It’s unclear to what extent the plan has already been implemented; Boyd tells MIT Technology Review that to the best of his knowledge, the agency has not yet started collecting data under the program, but he adds that as “the senior executive,” he would “have to get with [his] staff to see.” He could only confirm that his office is “funding” it. Despite repeated requests, Boyd did not provide any additional information.
Boyd says OBIM’s plan to collect facial images from children under 14 is possible due to recent “rulemaking” at “some DHS components,” or sub-offices, that have removed age restrictions on the collection of biometric data. US Customs and Border Protection (CBP), the US Transportation Security Administration, and US Immigration and Customs Enforcement declined to comment before publication. US Citizenship and Immigration Services (USCIS) did not respond to multiple requests for comment. OBIM referred MIT Technology Review back to DHS’s main press office.
DHS did not comment on the program prior, but sent an emailed statement following publication: “The Department of Homeland Security uses various forms of technology to execute its mission, including some biometric capabilities. DHS ensures all technologies, regardless of type, are operated under the established authorities and within the scope of the law. We are committed to protecting the privacy, civil rights, and civil liberties of all individuals who may be subject to the technology we use to keep the nation safe and secure.”
Boyd spoke publicly about the plan in June at the Federal Identity Forum and Exposition, an annual identity management conference for federal employees and contractors. But close observers of DHS that we spoke with—including a former official, representatives of two influential lawmakers who have spoken out about the federal government’s use of surveillance technologies, and immigrants’ rights organizations that closely track policies affecting migrants—were unaware of any new policies allowing biometric data collection of children under 14.
That is not to say that all of them are surprised. “That tracks,” says one former CBP official who has visited several migrant processing centers on the US-Mexico border and requested anonymity to speak freely. He says “every center” he visited “had biometric identity collection, and everybody was going through it,” though he was unaware of a specific policy mandating the practice. “I don’t recall them separating out children,” he adds.
“The reports of CBP, as well as DHS more broadly, expanding the use of facial recognition technology to track migrant children is another stride toward a surveillance state and should be a concern to everyone who values privacy,” Justin Krakoff, deputy communications director for Senator Jeff Merkley of Oregon, said in a statement to MIT Technology Review. Merkley has been an outspoken critic of both DHS’s immigration policies and of government use of facial recognition technologies.
Beyond concerns about privacy, transparency, and accountability, some experts also worry about testing and developing new technologies using data from a population that has little recourse to provide—or withhold—consent.
Could consent “actually take into account the vast power differentials that are inherent in the way that this is tested out on people?” asks Petra Molnar, author of The Walls Have Eyes: Surviving Migration in the Age of AI. “And if you arrive at a border … and you are faced with the impossible choice of either: get into a country if you give us your biometrics, or you don’t.”
“That completely vitiates informed consent,” she adds.
This question becomes even more challenging when it comes to children, says Ashley Gorski, a senior staff attorney with the American Civil Liberties Union. DHS “should have to meet an extremely high bar to show that these kids and their legal guardians have meaningfully consented to serve as test subjects,” she says. “There’s a significant intimidation factor, and children aren’t as equipped to consider long-term risks.”
Murky new rules
The Office of Biometric Identity Management, previously known as the US Visitor and Immigrant Status Indicator Technology Program (US-VISIT), was created after 9/11 with the specific mandate of collecting biometric data—initially only fingerprints and photographs—from all non-US citizens who sought to enter the country.
Since then, DHS has begun collecting face prints, iris and retina scans, and even DNA, among other modalities. It is also testing new ways of gathering this data—including through contactless fingerprint collection, which is currently deployed at five sites on the border, as Boyd shared in his conference presentation.
Since 2023, CBP has been using a mobile app, CBP One, for asylum seekers to submit biometric data even before they enter the United States; users are required to take selfies periodically to verify their identity. The app has been riddled with problems, including technical glitches and facial recognition algorithms that are unable to recognize darker-skinned people. This is compounded by the fact that not every asylum seeker has a smartphone.
Then, just after crossing into the United States, migrants must submit to collection of biometric data, including DNA. For a sense of scale, a recent report from Georgetown Law School’s Center on Privacy and Technology found that CBP has added 1.5 million DNA profiles, primarily from migrants crossing the border, to law enforcement databases since it began collecting DNA “from any person in CBP custody subject to fingerprinting” in January 2020. The researchers noted that an overrepresentation of immigrants—the majority of whom are people of color—in a DNA database used by law enforcement could subject them to over-policing and lead to other forms of bias.
Generally, these programs only require information from individuals aged 14 to 79. DHS attempted to change this back in 2020, with proposed rules for USCIS and CBP that would have expanded biometric data collection dramatically, including by age. (USCIS’s proposed rule would have doubled the number of people from whom biometric data would be required, including any US citizen who sponsors an immigrant.) But the USCIS rule was withdrawn in the wake of the Biden administration’s new “priorities to reduce barriers and undue burdens in the immigration system.” Meanwhile, for reasons that remain unclear, the proposed CBP rule was never enacted.
This would make it appear “contradictory” if DHS were now collecting the biometric data of children under 14, says Dinesh McCoy, a staff attorney with Just Futures Law, an immigrant rights group that tracks surveillance technologies.
Neither Boyd nor DHS’s media office would confirm which specific policy changes he was referring to in his presentation, though MIT Technology Review has identified a 2017 memo, issued by then-Secretary of Homeland Security John F. Kelly, that encouraged DHS components to remove “age as a basis for determining when to collect biometrics.”
The DHS’s Office of the Inspector General (OIG) referred to this memo as the “overarching policy for biometrics at DHS” in a September 2023 report, though none of the press offices MIT Technology Review contacted—including the main DHS press office, OIG, and OBIM, among others—would confirm whether this was still the relevant policy; we have not been able to confirm any related policy changes since then.
The OIG audit also found a number of fundamental issues related to DHS’s oversight of biometric data collection and use—including that its 10-year strategic framework for biometrics, covering 2015 to 2025, “did not accurately reflect the current state of biometrics across the Department, such as the use of facial recognition verification and identification.” Nor did it provide clear guidance for the consistent collection and use of biometrics across DHS, including age requirements.
But there is also another potential explanation for the new OBIM program: Boyd says it is being conducted under the auspices of the DHS’s undersecretary of science and technology, the office that leads much of the agency’s research efforts. Because it is for research, rather than to be used “in DHS operations to inform processes or decision making,” many of the standard restrictions for DHS use of face recognition and face capture technologies do not apply, according to a DHS directive.
Do you have any additional information on DHS’s craniofacial structural progression initiative? Please reach out with a non-work email to tips@technologyreview.com or securely on Signal at 626.765.5489.
Some lawyers allege that changing the age limit for data collection via department policy, not by a federal rule, which requires a public comment period, is problematic. McCoy, for instance, says any lack of transparency here amplifies the already “extremely challenging” task of “finding [out] in a systematic way how these technologies are deployed”—even though that is key for accountability.
Advancing the field
At the identity forum and in a subsequent conversation, Boyd explained that this data collection is meant to advance the development of effective FRT algorithms. Boyd leads OBIM’s Future Identity team, whose mission is to “research, review, assess, and develop technology, policy, and human factors that enable rapid, accurate, and secure identity services” and to make OBIM “the preferred provider for identity services within DHS.”
Driven by high-profile cases of missing children, there has long been interest in understanding how children’s faces age. At the same time, there have been technical challenges to doing so, both preceding FRT and with it.
At its core, facial recognition identifies individuals by comparing the geometry of various facial features in an original face print with subsequent images. Based on this comparison, a facial recognition algorithm assigns a percentage likelihood that there is a match.
But as children grow and develop, their bone structure changes significantly, making it difficult for facial recognition algorithms to identify them over time. (These changes tend to be even more pronounced in children under 14. In contrast, as adults age, the changes tend to be in the skin and muscle, and have less variation overall.) More data would help solve this problem, but there is a dearth of high-quality data sets of children’s faces with verifiable ages.
“What we’re trying to do is to get large data sets of known individuals,” Boyd tells MIT Technology Review. That means taking high-quality face prints “under controlled conditions where we know we’ve got the person with the right name [and] the correct birth date”—or, in other words, where they can be certain about the “provenance of the data.”
For example, one data set used for aging research consists of 305 celebrities’ faces as they aged from five to 32. But these photos, scraped from the internet, contain too many other variables—such as differing image qualities, lighting conditions, and distances at which they were taken—to be truly useful. Plus, speaking to the provenance issue that Boyd highlights, their actual ages in each photo can only be estimated.
Another tactic is to use data sets of adult faces that have been synthetically de-aged. Synthetic data is considered more privacy-preserving, but it too has limitations, says Stephanie Schuckers, director of the Center for Identification Technology Research (CITeR). “You can test things with only the generated data,” Schuckers explains, but the question remains: “Would you get similar results to the real data?”
(Hosted at Clarkson University in New York, CITeR brings together a network of academic and government affiliates working on identity technologies. OBIM is a member of the research consortium.)
Schuckers’s team at CITeR has taken another approach: an ongoing longitudinal study of a cohort of 231 elementary and middle school students from the area around Clarkson University. Since 2016, the team has captured biometric data every six months (save for two years of the covid-19 pandemic), including facial images. They have found that the open-source face recognition models they tested can in fact successfully recognize children three to four years after they were initially enrolled.
But the conditions of this study aren’t easily replicable at scale. The study images are taken in a controlled environment, all the participants are volunteers, the researchers sought consent from parents and the subjects themselves, and the research was approved by the university’s Institutional Review Board. Schuckers’s research also promises to protect privacy by requiring other researchers to request access, and by providing facial datasets separately from other data that have been collected.
What’s more, this research still has technical limitations, including that the sample is small, and it is overwhelmingly Caucasian, meaning it might be less accurate when applied to other races.
Schuckers says she was unaware of DHS’s craniofacial structural progression initiative.
Far-reaching implications
Boyd says OBIM takes privacy considerations seriously, and that “we don’t share … data with commercial industries.” Still, OBIM has 144 government partners with which it does share information, and it has been criticized by the Government Accountability Office for poorly documenting who it shares information with, and with what privacy-protecting measures.
Even if the data does stay within the federal government, OBIM’s findings regarding the accuracy of FRT for children over time could neverthelessinfluence how—and when—the rest of the government collects biometric data, as well as whether the broader facial recognition industry may also market its services for children. (Indeed, Boyd says sharing “results,” or the findings of how accurate FRT algorithms are, is different than sharing the data itself.)
That this technology is being tested on people who are offered fewer privacy protections than would be afforded to US citizens is just part of the wider trend of using people from the developing world, whether they are migrants coming to the border or civilians in war zones, to help improve new technologies.
In fact, Boyd previously helped advance the Department of Defense’s biometric systems in Iraq and Afghanistan, where he acknowledged that individuals lacked the privacy protections that would have been granted in many other contexts, despite the incredibly high stakes. Biometric data collected in those war zones—in some areas, from every fighting-age male—was used toidentify and target insurgents, and being misidentified could mean death.
These projects subsequently played a substantial role in influencing the expansion of biometric data collection by the Department of Defense, which now happens globally. And architects of the program, like Boyd, have taken important roles in expanding the use of biometrics at other agencies.
“It’s not an accident” that this testing happens in the context of border zones, says Molnar. Borders are “the perfect laboratory for tech experimentation, because oversight is weak, discretion is baked into the decisions that get made … it allows the state to experiment in ways that it wouldn’t be allowed to in other spaces.”
But, she notes, “just because it happens at the border doesn’t mean that that’s where it’s going to stay.”
Update: This story was updated to include comment from DHS.
Do you have any additional information on DHS’s craniofacial structural progression initiative? Please reach out with a non-work email to tips@technologyreview.com or securely on Signal at 626.765.5489.
This new series of articles focuses on working with LLMs to scale your SEO tasks. We hope to help you integrate AI into SEO so you can level up your skills.
Following this, it’s time to flex your “AI knowledge muscles” by learning how to use text embeddings to find keyword cannibalization.
We will start with OpenAI’s text embeddings and compare them.
Model
Dimensionality
Pricing
Notes
text-embedding-ada-002
1536
$0.10 per 1M tokens
Great for most use cases.
text-embedding-3-small
1536
$0.002 per 1M tokens
Faster and cheaper but less accurate
text-embedding-3-large
3072
$0.13 per 1M tokens
More accurate for complex long text-related tasks, slower
(*tokens can be considered as words words.)
But before we start, you need to install Python and Jupyter on your computer.
Jupyter is a web-based tool for professionals and researchers. It allows you to perform complex data analysis and machine learning model development using any programming language.
Don’t worry – it’s really easy and takes little time to finish the installations. And remember, ChatGPT is your friend when it comes to programming.
Open your Windows command line or terminal on Mac.
Type this commands pip install jupyterlab and pip install notebook
Run Jupiter by this command: jupyter lab
We will use Jupyter to experiment with text embeddings; you’ll see how fun it is to work with!
But before we start, you must sign up for OpenAI’s API and set up billing by filling your balance.
Open AI Api Billing settings
Once you’ve done that, set up email notifications to inform you when your spending exceeds a certain amount under Usage limits.
Then, obtain API keys under Dashboard > API keys, which you should keep private and never share publicly.
OpenAI API keys
Now, you have all the necessary tools to start playing with embeddings.
Open your computer command terminal and type jupyter lab.
You should see something like the below image pop up in your browser.
Click on Python 3 under Notebook.
jupyter lab
In the opened window, you will write your code.
As a small task, let’s group similar URLs from a CSV. The sample CSV has two columns: URL and Title. Our script’s task will be to group URLs with similar semantic meanings based on the title so we can consolidate those pages into one and fix keyword cannibalization issues.
Here are the steps you need to do:
Install required Python libraries with the following commands in your PC’s terminal (or in Jupyter notebook)
The ‘openai’ library is required to interact with the OpenAI API to get embeddings, and ‘pandas’ is used for data manipulation and handling CSV file operations.
The ‘scikit-learn’ library is necessary for calculating cosine similarity, and ‘numpy’ is essential for numerical operations and handling arrays. Lastly, unidecode is used to clean text.
Then, download the sample sheet as a CSV, rename the file to pages.csv, and upload it to your Jupyter folder where your script is located.
Set your OpenAI API key to the key you obtained in the step above, and copy-paste the code below into the notebook.
Run the code by clicking the play triangle icon at the top of the notebook.
import pandas as pd
import openai
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import csv
from unidecode import unidecode
# Function to clean text
def clean_text(text: str) -> str:
# First, replace known problematic characters with their correct equivalents
replacements = {
'–': '–', # en dash
'’': '’', # right single quotation mark
'“': '“', # left double quotation mark
'â€': '”', # right double quotation mark
'‘': '‘', # left single quotation mark
'â€': '—' # em dash
}
for old, new in replacements.items():
text = text.replace(old, new)
# Then, use unidecode to transliterate any remaining problematic Unicode characters
text = unidecode(text)
return text
# Load the CSV file with UTF-8 encoding from root folder of Jupiter project folder
df = pd.read_csv('pages.csv', encoding='utf-8')
# Clean the 'Title' column to remove unwanted symbols
df['Title'] = df['Title'].apply(clean_text)
# Set your OpenAI API key
openai.api_key = 'your-api-key-goes-here'
# Function to get embeddings
def get_embedding(text):
response = openai.Embedding.create(input=[text], engine="text-embedding-ada-002")
return response['data'][0]['embedding']
# Generate embeddings for all titles
df['embedding'] = df['Title'].apply(get_embedding)
# Create a matrix of embeddings
embedding_matrix = np.vstack(df['embedding'].values)
# Compute cosine similarity matrix
similarity_matrix = cosine_similarity(embedding_matrix)
# Define similarity threshold
similarity_threshold = 0.9 # since threshold is 0.1 for dissimilarity
# Create a list to store groups
groups = []
# Keep track of visited indices
visited = set()
# Group similar titles based on the similarity matrix
for i in range(len(similarity_matrix)):
if i not in visited:
# Find all similar titles
similar_indices = np.where(similarity_matrix[i] >= similarity_threshold)[0]
# Log comparisons
print(f"nChecking similarity for '{df.iloc[i]['Title']}' (Index {i}):")
print("-" * 50)
for j in range(len(similarity_matrix)):
if i != j: # Ensure that a title is not compared with itself
similarity_value = similarity_matrix[i, j]
comparison_result = 'greater' if similarity_value >= similarity_threshold else 'less'
print(f"Compared with '{df.iloc[j]['Title']}' (Index {j}): similarity = {similarity_value:.4f} ({comparison_result} than threshold)")
# Add these indices to visited
visited.update(similar_indices)
# Add the group to the list
group = df.iloc[similar_indices][['URL', 'Title']].to_dict('records')
groups.append(group)
print(f"nFormed Group {len(groups)}:")
for item in group:
print(f" - URL: {item['URL']}, Title: {item['Title']}")
# Check if groups were created
if not groups:
print("No groups were created.")
# Define the output CSV file
output_file = 'grouped_pages.csv'
# Write the results to the CSV file with UTF-8 encoding
with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = ['Group', 'URL', 'Title']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for group_index, group in enumerate(groups, start=1):
for page in group:
cleaned_title = clean_text(page['Title']) # Ensure no unwanted symbols in the output
writer.writerow({'Group': group_index, 'URL': page['URL'], 'Title': cleaned_title})
print(f"Writing Group {group_index}, URL: {page['URL']}, Title: {cleaned_title}")
print(f"Output written to {output_file}")
This code reads a CSV file, ‘pages.csv,’ containing titles and URLs, which you can easily export from your CMS or get by crawling a client website using Screaming Frog.
Then, it cleans the titles from non-UTF characters, generates embedding vectors for each title using OpenAI’s API, calculates the similarity between the titles, groups similar titles together, and writes the grouped results to a new CSV file, ‘grouped_pages.csv.’
In the keyword cannibalization task, we use a similarity threshold of 0.9, which means if cosine similarity is less than 0.9, we will consider articles as different. To visualize this in a simplified two-dimensional space, it will appear as two vectors with an angle of approximately 25 degrees between them.
In your case, you may want to use a different threshold, like 0.85 (approximately 31 degrees between them), and run it on a sample of your data to evaluate the results and the overall quality of matches. If it is unsatisfactory, you can increase the threshold to make it more strict for better precision.
You can install ‘matplotlib’ via terminal.
pip install matplotlib
And use the Python code below in a separate Jupyter notebook to visualize cosine similarities in two-dimensional space on your own. Try it; it’s fun!
import matplotlib.pyplot as plt
import numpy as np
# Define the angle for cosine similarity of 0.9. Change here to your desired value.
theta = np.arccos(0.9)
# Define the vectors
u = np.array([1, 0])
v = np.array([np.cos(theta), np.sin(theta)])
# Define the 45 degree rotation matrix
rotation_matrix = np.array([
[np.cos(np.pi/4), -np.sin(np.pi/4)],
[np.sin(np.pi/4), np.cos(np.pi/4)]
])
# Apply the rotation to both vectors
u_rotated = np.dot(rotation_matrix, u)
v_rotated = np.dot(rotation_matrix, v)
# Plotting the vectors
plt.figure()
plt.quiver(0, 0, u_rotated[0], u_rotated[1], angles='xy', scale_units='xy', scale=1, color='r')
plt.quiver(0, 0, v_rotated[0], v_rotated[1], angles='xy', scale_units='xy', scale=1, color='b')
# Setting the plot limits to only positive ranges
plt.xlim(0, 1.5)
plt.ylim(0, 1.5)
# Adding labels and grid
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.grid(True)
plt.title('Visualization of Vectors with Cosine Similarity of 0.9')
# Show the plot
plt.show()
I usually use 0.9 and higher for identifying keyword cannibalization issues, but you may need to set it to 0.5 when dealing with old article redirects, as old articles may not have nearly identical articles that are fresher but partially close.
It may also be better to have the meta description concatenated with the title in case of redirects, in addition to the title.
So, it depends on the task you are performing. We will review how to implement redirects in a separate article later in this series.
Now, let’s review the results with the three models mentioned above and see how they were able to identify close articles from our data sample from Search Engine Journal’s articles.
Data Sample
From the list, we already see that the 2nd and 4th articles cover the same topic on ‘meta tags.’ The articles in the 5th and 7th rows are pretty much the same – discussing the importance of H1 tags in SEO – and can be merged.
The article in the 3rd row doesn’t have any similarities with any of the articles in the list but has common words like “Tag” or “SEO.”
The article in the 6th row is again about H1, but not exactly the same as H1’s importance to SEO. Instead, it represents Google’s opinion on whether they should match.
Articles on the 8th and 9th rows are quite close but still different; they can be combined.
text-embedding-ada-002
By using ‘text-embedding-ada-002,’ we precisely found the 2nd and 4th articles with a cosine similarity of 0.92 and the 5th and 7th articles with a similarity of 0.91.
Screenshot from Jupyter log showing cosine similarities
And it generated output with grouped URLs by using the same group number for similar articles. (colors are applied manually for visualization purposes).
Output sheet with grouped URLs
For the 2nd and 3rd articles, which have common words “Tag” and “SEO” but are unrelated, the cosine similarity was 0.86. This shows why a high similarity threshold of 0.9 or greater is necessary. If we set it to 0.85, it would be full of false positives and could suggest merging unrelated articles.
text-embedding-3-small
By using ‘text-embedding-3-small,’ quite surprisingly, it didn’t find any matches per our similarity threshold of 0.9 or higher.
For the 2nd and 4th articles, cosine similarity was 0.76, and for the 5th and 7th articles, with similarity 0.77.
To better understand this model through experimentation, I’ve added a slightly modified version of the 1st row with ’15’ vs. ’14’ to the sample.
“14 Most Important Meta And HTML Tags You Need To Know For SEO”
“15 Most Important Meta And HTML Tags You Need To Know For SEO”
An example which shows text-embedding-3-small results
On the contrary, ‘text-embedding-ada-002’ gave 0.98 cosine similarity between those versions.
Title 1
Title 2
Cosine Similarity
14 Most Important Meta And HTML Tags You Need To Know For SEO
15 Most Important Meta And HTML Tags You Need To Know For SEO
0.92
14 Most Important Meta And HTML Tags You Need To Know For SEO
Meta Tags: What You Need To Know For SEO
0.76
Here, we see that this model is not quite a good fit for comparing titles.
text-embedding-3-large
This model’s dimensionality is 3072, which is 2 times higher than that of ‘text-embedding-3-small’ and ‘text-embedding-ada-002′, with 1536 dimensionality.
As it has more dimensions than the other models, we could expect it to capture semantic meaning with higher precision.
However, it gave the 2nd and 4th articles cosine similarity of 0.70 and the 5th and 7th articles similarity of 0.75.
I’ve tested it again with slightly modified versions of the first article with ’15’ vs. ’14’ and without ‘Most Important’ in the title.
“14 Most Important Meta And HTML Tags You Need To Know For SEO”
“15 Most Important Meta And HTML Tags You Need To Know For SEO”
“14 Meta And HTML Tags You Need To Know For SEO”
Title 1
Title 2
Cosine Similarity
14 Most Important Meta And HTML Tags You Need To Know For SEO
15 Most Important Meta And HTML Tags You Need To Know For SEO
0.95
14 Most Important Meta And HTML Tags You Need To Know For SEO
14 Most Important Meta And HTML Tags You Need To Know For SEO
0.93
14 Most Important Meta And HTML Tags You Need To Know For SEO
Meta Tags: What You Need To Know For SEO
0.70
15 Most Important Meta And HTML Tags You Need To Know For SEO
14 Most Important Meta And HTML Tags You Need To Know For SEO
0.86
So we can see that ‘text-embedding-3-large’ is underperforming compared to ‘text-embedding-ada-002’ when we calculate cosine similarities between titles.
I want to note that the accuracy of ‘text-embedding-3-large’ increases with the length of the text, but ‘text-embedding-ada-002’ still performs better overall.
Another approach could be to strip away stop words from the text. Removing these can sometimes help focus the embeddings on more meaningful words, potentially improving the accuracy of tasks like similarity calculations.
The best way to determine whether removing stop words improves accuracy for your specific task and dataset is to empirically test both approaches and compare the results.
Conclusion
With these examples, you have learned how to work with OpenAI’s embedding models and can already perform a wide range of tasks.
For similarity thresholds, you need to experiment with your own datasets and see which thresholds make sense for your specific task by running it on smaller samples of data and performing a human review of the output.
Please note that the code we have in this article is not optimal for large datasets since you need to create text embeddings of articles every time there is a change in your dataset to evaluate against other rows.
To make it efficient, we must use vector databases and store embedding information there once generated. We will cover how to use vector databases very soon and change the code sample here to use a vector database.
But the question is, how can you make the most out of AI other than using a chatbot user interface?
For that, you need a profound understanding of how large language models (LLMs) work and learn the basic level of coding. And yes, coding is absolutely necessary to succeed as an SEO professional nowadays.
This is the first of a series of articles that aim to level up your skills so you can start using LLMs to scale your SEO tasks. We believe that in the future, this skill will be required for success.
We need to start from the basics. It will include essential information, so later in this series, you will be able to use LLMs to scale your SEO or marketing efforts for the most tedious tasks.
Contrary to other similar articles you’ve read, we will start here from the end. The video below illustrates what you will be able to do after reading all the articles in the series on how to use LLMs for SEO.
Our team uses this tool to make internal linking faster while maintaining human oversight.
Did you like it? This is what you will be able to build yourself very soon.
Now, let’s start with the basics and equip you with the required background knowledge in LLMs.
What Are Vectors?
In mathematics, vectors are objects described by an ordered list of numbers (components) corresponding to the coordinates in the vector space.
A simple example of a vector is a vector in two-dimensional space, which is represented by (x,y)coordinates as illustrated below.
Sample two-dimensional vector with x=13 and y=8 coordinates notating as (13,8)
In this case, the coordinate x=13 represents the length of the vector’s projection on the X-axis, and y=8 represents the length of the vector’s projection on the Y-axis.
Vectors that are defined with coordinates have a length, which is called the magnitude of a vector or norm. For our two-dimensional simplified case, it is calculated by the formula:
However, mathematicians went ahead and defined vectors with an arbitrary number of abstract coordinates (X1, X2, X3 … Xn), which is called an “N-dimensional” vector.
In the case of a vector in three-dimensional space, that would be three numbers (x,y,z), which we can still interpret and understand, but anything above that is out of our imagination, and everything becomes an abstract concept.
And here is where LLM embeddings come into play.
What Is Text Embedding?
Text embeddings are a subset of LLM embeddings, which are abstract high-dimensional vectors representing text that capture semantic contexts and relationships between words.
In LLM jargon, “words” are called data tokens, with each word being a token. More abstractly, embeddings are numerical representations of those tokens, encoding relationships between any data tokens (units of data), where a data token can be an image, sound recording, text, or video frame.
In order to calculate how close words are semantically, we need to convert them into numbers. Just like you subtract numbers (e.g., 10-6=4) and you can tell that the distance between 10 and 6 is 4 points, it is possible to subtract vectors and calculate how close the two vectors are.
Thus, understanding vector distances is important in order to grasp how LLMs work.
There are different ways to measure how close vectors are:
Euclidean distance.
Cosine similarity or distance.
Jaccard similarity.
Manhattan distance.
Each has its own use cases, but we will discuss only commonly used cosine and Euclidean distances.
What Is The Cosine Similarity?
It measures the cosine of the angle between two vectors, i.e., how closely those two vectors are aligned with each other.
Euclidean distance vs. cosine similarity
It is defined as follows:
Where the dot product of two vectors is divided by the product of their magnitudes, a.k.a. lengths.
Its values range from -1, which means completely opposite, to 1, which means identical. A value of ‘0’ means the vectors are perpendicular.
In terms of text embeddings, achieving the exact cosine similarity value of -1 is unlikely, but here are examples of texts with 0 or 1 cosine similarities.
Cosine Similarity = 1 (Identical)
“Top 10 Hidden Gems for Solo Travelers in San Francisco”
“Top 10 Hidden Gems for Solo Travelers in San Francisco”
These texts are identical, so their embeddings would be the same, resulting in a cosine similarity of 1.
Cosine Similarity = 0 (Perpendicular, Which Means Unrelated)
“Quantum mechanics”
“I love rainy day”
These texts are totally unrelated, resulting in a cosine similarity of 0 between their BERT embeddings.
(Note: We will learn in the next chapters in detail practicing with embeddings using Python and Jupyter).
Vertex Ai’s text-’embedding-preview-0409′ model
OpenAi’s ‘text-embedding-3-small’ model
We are skipping the case with cosine similarity = -1 because it is highly unlikely to happen.
If you try to get cosine similarity for text with opposite meanings like “love” vs. “hate” or “the successful project” vs. “the failing project,” you will get 0.5-0.6 cosine similarity with Google Vertex AI’s ‘text-embedding-preview-0409’ model.
It is because the words “love” and “hate” often appear in similar contexts related to emotions, and “successful” and “failing” are both related to project outcomes. The contexts in which they are used might overlap significantly in the training data.
Cosine similarity can be used for the following SEO tasks:
Cosine similarity focuses on the direction of the vectors (the angle between them) rather than their magnitude (length). As a result, it can capture semantic similarity and determine how closely two pieces of content align, even if one is much longer or uses more words than the other.
Deep diving and exploring each of these will be a goal of upcoming articles we will publish.
What Is The Euclidean Distance?
In case you have two vectors A(X1,Y1) and B(X2,Y2), the Euclidean distance is calculated by the following formula:
It is like using a ruler to measure the distance between two points (the red line in the chart above).
Euclidean distance can be used for the following SEO tasks:
Here is an example of Euclidean distance calculation with a value of 0.08, nearly close to 0, for duplicate content where paragraphs are just swapped – meaning the distance is 0, i.e., the content we compare is the same.
Euclidean distance calculation example of duplicate content
Of course, you can use cosine similarity, and it will detect duplicate content with cosine similarity 0.9 out of 1 (almost identical).
Here is a key point to remember: You should not merely rely on cosine similarity but use other methods, too, as Netflix’s research paper suggests that using cosine similarity can lead to meaningless “similarities.”
We show that cosine similarity of the learned embeddings can in fact yield arbitrary results. We find that the underlying reason is not cosine similarity itself, but the fact that the learned embeddings have a degree of freedom that can render arbitrary cosine-similarities.
As an SEO professional, you don’t need to be able to fully comprehend that paper, but remember that research shows that other distance methods, such as the Euclidean, should be considered based on the project needs and outcome you get to reduce false-positive results.
What Is L2 Normalization?
L2 normalization is a mathematical transformation applied to vectors to make them unit vectors with a length of 1.
To explain in simple terms, let’s say Bob and Alice walked a long distance. Now, we want to compare their directions. Did they follow similar paths, or did they go in completely different directions?
“Alice” is represented by a red dot in the upper right quadrant, and “Bob” is represented by a green dot.
However, since they are far from their origin, we will have difficulty measuring the angle between their paths because they have gone too far.
On the other hand, we can’t claim that if they are far from each other, it means their paths are different.
L2 normalization is like bringing both Alice and Bob back to the same closer distance from the starting point, say one foot from the origin, to make it easier to measure the angle between their paths.
Now, we see that even though they are far apart, their path directions are quite close.
A Cartesian plane with a circle centered at the origin.
This means that we’ve removed the effect of their different path lengths (a.k.a. vectors magnitude) and can focus purely on the direction of their movements.
In the context of text embeddings, this normalization helps us focus on the semantic similarity between texts (the direction of the vectors).
Most of the embedding models, such as OpeanAI’s ‘text-embedding-3-large’ or Google Vertex AI’s ‘text-embedding-preview-0409’ models, return pre-normalized embeddings, which means you don’t need to normalize.
But, for example, BERT model ‘bert-base-uncased’ embeddings are not pre-normalized.
Conclusion
This was the introductory chapter of our series of articles to familiarize you with the jargon of LLMs, which I hope made the information accessible without needing a PhD in mathematics.
If you still have trouble memorizing these, don’t worry. As we cover the next sections, we will refer to the definitions defined here, and you will be able to understand them through practice.
The next chapters will be even more interesting:
Introduction To OpenAI’s Text Embeddings With Examples.
Introduction To Google’s Vertex AI Text Embeddings With Examples.
Introduction To Vector Databases.
How To Use LLM Embeddings For Internal Linking.
How To Use LLM Embeddings For Implementing Redirects At Scale.
Putting It All Together: LLMs-Based WordPress Plugin For Internal Linking.
Many of you may say that there are tools you can buy that do these types of things automatically, but those tools will not be able to perform many specific tasks based on your project needs, which require a custom approach.
Using SEO tools is always great, but having skills is even better!
Believe it or not, PPC automation was around long before the days of ChatGPT.
But with the rise in popularity of AI, machine learning, and automation – where does that leave PPC management?
Staying tried and true to manual PPC management has been shown to limit efficiency and margins.
But PPC automation layering isn’t just for pros. Whether you’re managing one account or 10 accounts, having a human component alongside automation can be the key to success.
This in-depth guide will teach you:
What PPC automation layering is.
How automation has impacted Google Ads.
How automation has impacted PPC practitioners.
Use cases for PPC automation layering.
What Is Automation Layering?
PPC automation layering is the strategic use of multiple automation tools and rules to manage and optimize PPC campaigns.
The main goal of PPC automation layering is to improve the efficiency and effectiveness of your PPC efforts.
This is where automation layering comes in.
Automation layering leverages different automation features, technologies, and sometimes 3rd party tools at various levels of your campaign management.
Some examples of automation layering include:
Smart Bidding strategies – Ad platforms take care of keyword bidding based on goals input within campaign settings. Examples of Smart Bidding include Target CPA, Target ROAS, Maximize Conversions, and more.
Automated PPC rules – Ad platforms can run specific account rules on a schedule based on the goal of the rule. An example would be to have Google Ads pause time-sensitive sale ads on a specific day and time.
PPC scripts – These are blocks of code that give ad platforms certain parameters to look out for and then have the platform take a specific action if those parameters are met.
Google Ads Recommendationstab – Google reviews campaign performance and puts together recommendations for PPC marketers to either take action on or dismiss if irrelevant.
Third-party automation tools – Tools such as Google Ads Editor, Optmyzr, Adalysis, and more can help take PPC management to the next level with their automated software and additional insights.
See the pattern here?
Automation and machine learning produce outputs of PPC management based on the inputs of PPC marketers to produce better campaign results.
How Has Automation Impacted Google Ads?
Over the years, Google Ads (as well as Microsoft Ads, Meta, etc.) have introduced countless automation features for marketers.
So, what’s their main goal here?
To make paid media management easier for advertisers.
In recent years, Google Ads has made a shift towards small to medium-sized businesses (SMBs), with claims that setting up a campaign is as easy as inputting a few settings and letting Google handle the rest.
For more experienced advertisers, they’ve experienced first-hand the shift from hands-on PPC management to a more “hands-off” approach with automation features.
While everyone can agree that easier paid media management sounds great, the learning curve for marketers has been far from perfect.
Automation has essentially taken over many of the day-to-day management tasks that PPC advertisers were used to doing.
For example, a few automation features built into the Google Ads platform include:
Keyword and campaign bid management.
Audience expansion.
Automated ad asset creation.
Keyword expansion.
And much more.
In theory, having Google Ads take over much of the busy work sounds great.
But in reality, the shift to automation layering has been far from great.
This leads us to the next big question: Will automation replace PPC marketers?
Does Automation Replace PPC Experts?
Job layoffs and restructuring due to automation is certainly a sensitive topic.
Add in a looming economic slowdown, and you’ve got many marketers clinging to their roles as hard as they can.
The marketing industry has seen shifts in job responsibilities, specifically in the content space, thanks to platforms like ChatGPT, Bard, and more.
But it’s time to settle this debate once and for all.
Automation will not replace the need for PPC marketers.
Now, keep in mind that automation has already replaced many of the day-to-day tasks that PPC experts have been used to doing for the past decade or so.
What we have, and will continue to see, is a shift in the role of PPC experts.
Since automation and machine learning take the role of day-to-day management, PPC experts will spend more time doing things such as:
Analyzing data and data quality.
Strategic decision making.
Reviewing and optimizing outputs from automation.
Automation and machines are great at pulling levers, making overall campaign management more efficient.
But automation tools alone cannot replace human touch and creating a story based on data and insights.
This is the beauty of PPC automation layering.
Lean into what automation tools have to offer, which leaves you more time to become a more strategic PPC marketer.
PPC Automation Layering Use Cases
There are many ways that PPC marketers and automation technologies can work together for optimal campaign results.
Below are just a few examples of how to use automation layering to your advantage.
1. Make The Most Of Smart Bidding Capabilities
As mentioned earlier in this guide, Smart Bidding is one of the most useful PPC automation tools.
Google Ads has developed its own automated bidding strategies to take the guesswork out of manual bid management.
However, Smart Bidding is not foolproof and certainly not a “set and forget” strategy.
Smart Bidding outputs can only be as effective as the inputs given to the machine learning system.
So, how should you use automation layering for Smart Bidding?
First, pick a Smart Bidding strategy that best fits an individual campaign goal. You can choose from:
Target cost per action (CPA).
Target return on ad spend (ROAS).
Maximize conversions.
Maximize conversion value.
Whenever starting a Smart Bidding strategy, it’s important to put some safeguards in place to reduce the volatility in campaign performance.
This could mean setting up an automated rule to alert you whenever significant volatility is reported, such as:
Spike in cost per click (CPC) or cost.
Dip in impressions, clicks, or cost.
Either of these scenarios could be due to learning curves in the algorithm, or it could be an indicator that your bids are too low or too high.
For example, say a campaign has a set target CPA goal of $25, but then all of a sudden, impressions and clicks fall off a cliff.
This could mean that the target CPA is set too low, and the algorithm has throttled ad serving to preserve only for individual users the algorithm thinks are most likely to purchase.
Without having an alert system in place, campaign volatility could go unnoticed for hours, days, or even weeks if you’re not checking performance in a timely manner.
2. Interact With Recommendations & Insights To Improve Automated Outputs
The goal of the ad algorithms is to get smarter every day and improve campaign performance.
But again, automated outputs are only as good as the input signals it’s been given at the beginning.
Many experienced PPC marketers tend to write off the Google Ads Recommendations or Insights tab due to perceptions of receiving irrelevant suggestions.
However, these systems were meant to learn from the input of marketers to better learn how to optimize.
Just because a recommendation is given in the platform does not mean you have to implement it.
The beauty of this tool is you have the ability to dismiss the opportunity and then tell Google why you’re dismissing it.
There’s even an option for “this is not relevant.”
Be willing to interact with the Recommendations and Insights tab on a weekly or bi-weekly basis to help better train the algorithms for optimizing performance based on what you signal as important.
Over time, these recommendations and insights can save you significant time over the weeks and months, which in turn gives you more time to focus on strategic measures in your account(s).
3. Automate Competitor Analysis With Tools
It’s one thing to ensure your ads and campaigns are running smoothly at all times.
Next-level strategy is using automation to keep track of your competitors and what they’re doing.
Multiple third-party tools have competitor analysis features to alert you on items such as:
Keyword coverage.
Content marketing.
Social media presence.
Market share.
And more.
Keep in mind that these tools are a paid subscription, but many are useful in many other automation areas outside of competitor analysis.
The goal is to not simply keep up with your competitors and copy what they’re doing.
Setting up automated competitor analysis helps you stay informed and can reinforce your market positioning or react in a way to help set you apart from competitor content.
In Summary
PPC automation layering has been around for a while and continues to mature each year.
While automation has replaced certain aspects of a PPC manager’s job, it can’t replace the necessary human components of an effective PPC strategy.
Use this guide to learn how to use automation layering to your advantage to get the most bang for your buck.
Celebrate the Holidays with some of SEJ’s best articles of 2023.
Our Festive Flashback series runs from December 21 – January 5, featuring daily reads on significant events, fundamentals, actionable strategies, and thought leader opinions.
2023 has been quite eventful in the SEO industry and our contributors produced some outstanding articles to keep pace and reflect these changes.
Catch up on the best reads of 2023 to give you plenty to reflect on as you move into 2024.
As an SEO professional, you’ve heard about ChatGPT and BARD – or even used them yourself.
You might also be familiar with a handful of machine learning models from Google, such as BERT and RankBrain.
These are all great applications of machine learning.
But it isn’t always immediately apparent how machine learning can be useful for your BAU (business as usual) tasks or daily SEO to-dos.
There’s also the question of ethics, as in data privacy and fairness, when it comes to using machine learning in your work.
However, understanding how machine learning works in search (and in real life) can only work to your advantage as an SEO pro – whether you’re technical or not.
In this comprehensive article, learn the types of machine learning and ethics associated with the technology, then take inspiration from seven machine learning examples that have had a great impact on our world.
As an added bonus, at the end of this article, explore eight machine learning examples directly applied to SEO.
What Is Machine Learning?
Machine learning is a subset of artificial intelligence (AI) which enables systems to learn and improve without the need to be explicitly programmed. This is achieved by using large data sets that train the parameters in the model.
Perhaps the most well-known examples of machine learning currently are ChatGPT and BARD – and while this post won’t be focusing on them, I decided to ask them both to explain to me what machine learning is. Here are their responses:
ChatGPT
“Machine learning is a type of artificial intelligence that allows software applications to learn from the data and become more accurate in predicting outcomes without explicit programming. In other words, it is a method of teaching computers to learn from the data they are given, and then to make decisions or predictions based on that learning.
The goal of machine learning is to develop algorithms that can learn patterns in data, and then use those patterns to make decisions or predictions about new data. This is done by training the machine learning algorithm on a dataset of known inputs and outputs, and then using that knowledge to make predictions on new, unseen data.
Some common applications of machine learning include image recognition, natural language processing, fraud detection, and recommendation systems.”
Screenshot from ChatGPT, April 2023
BARD
“Machine learning is the study of computer algorithms that improve automatically through experience and by the use of data. It is seen as a subset of artificial intelligence. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so.
Machine learning can be used for a variety of tasks, including:
Classification: Sorting data into categories. For example, a spam filter might classify emails as spam or not spam.
Regression: Predicting a continuous value. For example, a weather forecast might predict the temperature tomorrow.
Clustering: Grouping data points together based on their similarity. For example, a recommender system might cluster users together based on their past purchases.
Anomaly detection: Identifying unusual data points. For example, a fraud detection system might identify transactions that are likely to be fraudulent.
Machine learning is a powerful tool that can be used to solve a variety of problems. However, it is important to note that machine learning algorithms are only as good as the data they are trained on. If the data is biased, the algorithm will be biased as well. Additionally, machine learning algorithms can be susceptible to overfitting, which is when the algorithm learns the training data too well and does not generalize well to new data.
Despite these challenges, machine learning is a rapidly growing field with a wide range of applications. As the amount of data available to us continues to grow, machine learning is likely to play an even greater role in our lives in the future.”
Screenshot from BARD, April 2023
Supervised Learning
As the name suggests, in supervised learning, the model learns under supervision. It is trained with labeled data, and the output is manually reviewed.
The machine learns from the labeled data, which is then used to make future predictions.
Once the output is received, the model remembers it and will use it for the next operation.
Image from author, April 2023
Image from author, April 2023
There are two main types of supervised learning: classification and regression.
Classification
Classification is when the output variable is categorical, with two or more classes that the model can identify; for example, true or false and dog or cat.
Examples of this include predicting whether emails are likely to be spam or whether an image is of a dog or cat.
In both of these examples, the model will be trained on data that is either classified as spam or not spam, and whether an image contains a dog or cat.
Regression
This is when the output variable is a real or continuous value, and there is a relationship between the variables. Essentially, a change in one variable is associated with a change that occurs in the other variable.
The model then learns the relationship between them and predicts what the outcome will be depending on the data it is given.
For example, predicting humidity based on a given temperature value or what the stock price is likely to be at a given time.
Unsupervised Learning
Unsupervised learning is when the model uses unlabeled data and learns by itself, without any supervision. Essentially, unlike supervised learning, the model will act on the input data without any guidance.
It does not require any labeled data, as its job is to look for hidden patterns or structures in the input data and then organize it according to any similarities and differences.
For example, if a model is given pictures of both dogs and cats, it isn’t already trained to know the features that differentiate both. Still, it can categorize them based on patterns of similarities and differences.
Image from author, April 2023
There are also two main types of unsupervised learning: clustering and association.
Clustering
Clustering is the method of sorting objects into clusters that are similar to each other and belong to one cluster, versus objects that are dissimilar to a particular cluster and therefore belong in another.
Examples of this include recommendation systems and image classifying.
Association
Association is rule-based and is used to discover the probability of the co-occurrence of items within a collection of values.
Examples include fraud detection, customer segmentation, and discovering purchasing habits.
Semi-supervised Learning
Semi-supervised learning bridges both supervised and unsupervised learning by using a small section of labeled data, together with unlabeled data, to train the model. It, therefore, works for various problems, from classification and regression to clustering and association.
Semi-supervised learning can be used if there is a large amount of unlabeled data, as it only requires a small portion of the data to be labeled to train the model, which can then be applied to the remaining unlabeled data.
Google has used semi-supervised learning to better understand language used within a search to ensure it serves the most relevant content for a particular query.
Reinforcement Learning
Reinforcement learning is when a model is trained to return the optimum solution to a problem by taking a sequential approach to decision-making.
It uses trial and error from its own experiences to define the output, with rewards for positive behavior and negative reinforcement if it is not working towards the goal.
The model interacts with the environment that has been set up and comes up with solutions without human interference.
Human interference will then be introduced to provide either positive or negative reinforcement depending on how close to the goal the output is.
Examples include robotics – think robots working in a factory assembly line – and gaming, with AlphaGo as the most famous example. This is where the model was trained to beat the AlphaGo champion by using reinforcement learning to define the best approach to win the game.
Machine Learning Ethics
There is no doubt that machine learning has many benefits, and the use of machine learning models is ever-growing.
However, it’s important to consider the ethical concerns that come with using technology of this kind. These concerns include:
The accuracy of a machine learning model and whether it will generate the correct output.
Bias in the data that is used to train models, which causes a bias in the model itself, and, therefore, a bias in the outcome. If there is historical bias in data, that bias will often be replicated throughout.
The fairness in outcomes and the overall process.
Privacy – particularly with data that is used to train machine learning models – as well as the accuracy of the outcomes and predictions.
7 Machine Learning Examples In The Real World
1. Netflix
Netflix uses machine learning in a number of ways to provide the best experience for its users.
The company is also continually collecting large amounts of data, including ratings, the location of users, the length of time for which something is watched, if content is added to a list, and even whether something has been binge-watched.
This data is then used to further improve its machine learning models.
Content Recommendations
TV and movie recommendations on Netflix are personalized to each individual user’s preferences. To do this, Netflix deployed a recommendation system that considers previous content consumed, users’ most viewed genres, and content watched by users with similar preferences.
Auto-generated Thumbnails
Netflix discovered that the images used on the browse screen make a big difference in whether users watch something or not.
It, therefore, uses machine learning to create and display different images according to a user’s individual preferences. It does this by analyzing a user’s previous content choices and learning the kind of image that is more likely to encourage them to click.
These are just two examples of how Netflix uses machine learning on its platform. If you want to learn more about how it is used, you can check out the company’s research areas blog.
2. Airbnb
With millions of listings in locations across the globe at different price points, Airbnb uses machine learning to ensure users can find what they are looking for quickly and to improve conversions.
There are a number of ways the company deploys machine learning, and it shares a lot of details on its engineering blog.
Image Classification
As hosts can upload images for their properties, Airbnb found that a lot of images were mislabeled. To try and optimize user experience, it deployed an image classification model that used computer vision and deep learning.
The project aimed to categorize photos based on different rooms. This enabled Airbnb to show listing images grouped by room type and ensure the listing follows Airbnb’s guidelines.
In order to do this, it retrained the image classification neural network ResNet50, with a small number of labeled photos. This enabled it to accurately classify current and future images uploaded to the site.
Search Ranking
To provide a personalized experience for users, Airbnb deployed a ranking model that optimized search and discovery. The data for this model came from user engagement metrics such as clicks and bookings.
Listings started by being ordered randomly, and then various factors were given a weight within the model – including price, quality, and popularity with users. The more weight a listing had, the higher it would be displayed in listings.
This has since been optimized further, with training data including the number of guests, price, and availability also included within the model to discover patterns and preferences to create a more personalized experience.
Spotify uses a recommendation algorithm that predicts a user’s preference based on a collection of data from other users. This is due to numerous similarities that occur between music types that clusters of people listen to.
Playlists are one way it can do this, using statistical methods to create personalized playlists for users, such as Discover Weekly and daily mixes.
It can then use further data to adjust these depending on a user’s behavior.
With personal playlists also being created in the millions, Spotify has a huge database to work with – particularly if songs are grouped and labeled with semantic meaning.
This has allowed the company to recommend songs to users with similar music tastes. The machine learning model can serve songs to users with a similar listening history to aid music discovery.
Natural Language
With the Natural Processing Language (NLP) algorithm enabling computers to understand text better than ever before, Spotify is able to categorize music based on the language used to describe it.
It can scrape the web for text on a particular song and then use NLP to categorize songs based on this context.
This also helps algorithms identify songs or artists that belong in similar playlists, which further helps the recommendation system.
4. Detecting Fake News
While AI tools such as machine learning content generation can be a source for creating fake news, machine learning models that use natural language processing can also be used to assess articles and determine if they include false information.
Social network platforms use machine learning to find words and patterns in shared content that could indicate fake news is being shared and flag it appropriately.
5. Health Detection
There is an example of a neural network that was trained on over 100,000 images to distinguish dangerous skin lesions from benign ones. When tested against human dermatologists, the model could accurately detect 95% of skin cancer from the images provided, compared to 86.6% by the dermatologists.
As the model missed fewer melanomas, it was determined to have a higher sensitivity and was continually trained throughout the process.
There is hope that machine learning and AI, together with human intelligence, may become a useful tool for faster diagnosis.
Other ways image detection is being used in healthcare include identifying abnormalities in X-rays or scans and identifying key markups that may indicate an underlying illness.
6. Wildlife Security
Protection Assistant for Wildlife Security is an AI system that is being used to evaluate information about poaching activity to create a patrol route for conservationists to help prevent poaching attacks.
The system is continually being provided with more data, such as locations of traps and sightings of animals, which helps it to become smarter.
The predictive analysis enables patrol units to identify areas where it is likely animal poachers will visit.
8 Machine Learning Examples In SEO
1. Content Quality
Machine learning models can be trained to improve the quality of website content by predicting what both users and search engines would prefer to see.
The model can be trained on the most important insights, including search volume and traffic, conversion rate, internal links, and word count.
A content quality score can then be generated for each page, which will help inform where optimizations need to be made and can be particularly useful for content audits.
2. Natural Language Processing
Natural Language Processing (NLP) uses machine learning to reveal the structure and meaning of text. It analyzes text to understand the sentiment and extract key information.
NLP focuses on understanding context rather than just words. It is more about the content around keywords and how they fit together into sentences and paragraphs, than keywords on their own.
The overall sentiment is also taken into account, as it refers to the feeling behind the search query. The types of words used within the search help to determine whether it is classified as having a positive, negative, or neutral sentiment.
The key areas of importance for NLP are;
Entity – Words representing tangible objects such as people, places, and things that are identified and evaluated.
Categories – Text separated into categories.
Salience – How relevant the entity is.
Google has a free NLP API demo that can be used to analyze how text is seen and understood by Google. This enables you to identify improvements to content.
Recommendations In The World Of NLP
NLP is also being used to review and understand anchor text that is used to link pages. Therefore, it is more important than ever to ensure anchor text is relevant and informative.
Ensuring each page has a natural flow, with headings providing hierarchy and readability.
Answering the question the article is querying as quickly as possible. Ensure that users and search engines can discover key information without making too much effort.
Ensure you have the correct spelling and punctuation used to display authority and trustworthiness.
3. Google’s Models
AI and machine learning is used throughout Google’s many products and services. The most popular use of it in the context of search is to understand language and the intent behind search queries.
Previously, the search systems looked for matching words only, which didn’t even consider misspellings. Eventually, algorithms were created to find patterns that identified misspellings and potential typos.
There have been several systems introduced throughout the last few years after Google confirmed in 2016 its intention to become a machine learning first company.
RankBrain
The first of these was RankBrain, which was introduced in 2015 and helps Google to understand how different words are related to different concepts.
This enables Google to take a broad query and better define how it relates to real-world concepts.
Google’s systems learn from seeing words used in a query on the page, which it can then use to understand terms and match them to related concepts to understand what a user is searching for.
Neural Matching
Neural matching was launched in 2018 and introduced to local search in 2019.
This helps Google understand how queries relate to pages by looking at the content on a page, or a search query, and understanding it within the context of the page content or query.
Most queries made today make use of neural matching, and it is used in rankings.
This system enables Google to understand how combinations of words express different meanings and intent by reviewing the whole sequence of words on a page.
BERT is now used in most queries, as it helps Google understand what a user is looking for to surface the best results related to the search.
MUM
MUM, which means Multitask Unified Model, was introduced in 2021 and is used to understand languages and variations in search terms.
This uses the latest advancements to find patterns in sentences and correlations between different words to understand nuanced questions – and even predict which words are likely to come next.
4. Predictive Prefetching
By combining historical website data on user behavior with the capabilities of machine learning, some tools can guess which page a user is likely to navigate to next and begin prefetching the necessary resources to load the page.
This is known as predictive prefetching and can enhance website performance.
Predictive prefetching can also apply to other scenarios, such as forecasting pieces of content or widgets that users are most likely to view or interact with and personalizing the experience based on that information.
5. Testing
Running SEO A/B tests is one of the most effective ways to provide the SEO impact of changes, and the ability to generate statistically significant results is possible with the use of machine learning algorithms and neural networks.
SearchPilot is an example of SEO A/B testing that is powered by machine learning and neural network models.
Starting with a bucketing algorithm that creates statistically similar buckets of control and variant pages to perform tests on, a neural network model then forecasts expected traffic to the pages the test is being run on.
The neural network model, which is trained to account for any and all external influences such as seasonality, competitor activity, and algorithm updates, will also analyze the organic search traffic to the variant pages and identify how they perform against the control group throughout the test.
This also enables users to calculate whether any difference in traffic is statistically significant.
(Disclaimer: I work for SearchPilot.)
6. Internal Linking
Machine learning can help with internal linking in two ways:
Updating broken links: Machine learning can crawl your site to spot any broken internal links and then replace them with a link to the best alternative page.
Suggesting relevant internal linking: These tools can leverage big data to suggest relevant internal links during the article creation process and over time.
The other internal linking task is an internal link audit. This includes analyzing the number of internal links to a page, the placement of the links together with the anchor text, and the overall crawl depth of the page.
Anchor text classification can also be performed to identify the phrases used most frequently in alt text and categorize them based on topics and whether they are branded or non-branded terms.
7. Image Captioning For Alt Text
As SEO pros, we understand the importance of image alt text. They improve accessibility for people who use screen readers while also helping search engine crawlers understand the content of the page they are placed on.
Language vision models can be used to automatically caption images, therefore providing content that can be used as alt text. Image captioning is used to describe what is shown within an image in a single sentence.
Two models are used for image captioning, both as important as the other. The image-based model will start by extracting features from the image, while the language-based model will translate those features into a logical sentence.
If you’re interested in how machine learning can be used in daily SEO tasks, this article by Lazarina Stoy is a must-read – and if you would like to play around with some super interesting scripts, this collection of Colab notebooks from Britney Muller is the perfect place to start.
In Conclusion
Machine learning isn’t limited to just ChatGPT and BARD.
There are many practical applications for machine learning, both in the real world and specifically in the world of SEO – and these are likely just the beginning.
And while it will be vital to remain cognizant of the ethical questions associated with machine learning, it has exciting implications for the future of SEO.
Google has postponed the launch of Gemini, which was initially set for next week. According to sources for The Information, the delay is due to the AI’s inconsistent performance with non-English queries.
Why Was Google Gemini Delayed?
Google’s CEO, Sundar Pichai, reportedly canceled several events in California, New York, and Washington, where Gemini would be unveiled.
These events were crucial to Google’s strategy, marking what could have been its most significant product introduction of the year.
The delay highlights the intense competition in the AI sector, particularly against OpenAI, Microsoft, and Meta.
GPT-4 Outperformed Gemini In Multilingual Tasks
In certain aspects, Gemini, comparable to OpenAI’s GPT-4, has fallen short in handling multilingual tasks effectively.
This shortcoming is particularly notable given Google’s global market presence and the importance of diverse language support in AI technologies.
Implications For Google’s AI Strategy
This development has implications beyond Gemini itself. Other Google products like Bard, Assistant, and Docs, which are expected to be enhanced by Gemini’s capabilities, may delay receiving these updates.
This setback contrasts the growing popularity of Microsoft Copilot, which recently announced integration with OpenAI’s latest new features, including GPTs.
Traditionally a leader in AI, Google is racing to match the pace of innovation set by OpenAI. While Google Bard’s capabilities continuously increase, it still falls behind the more advanced features available to premium users of ChatGPT.
Despite the delay, Google remains committed to advancing Gemini, with Pichai expressing a focus on ensuring its competitiveness and state-of-the-art capabilities.
As the company refines Gemini, how it will reshape the landscape of conversational AI and compete with rapidly advancing rivals like OpenAI remains to be seen.
In its latest announcement, OpenAI introduced Data Partnerships, a collaborative effort aimed at enhancing the training of AI models, particularly as the world edges closer to achieving artificial general intelligence (AGI).
This initiative aims to ensure that AI systems deeply understand the vast array of subject matters, industries, cultures, and languages, a goal only possible through accessing a wide-ranging training dataset.
According to the company, the value of incorporating varied datasets into AI is immense, especially for AGI. By understanding nuanced human intentions, behaviors, and interactions, AGI can serve humanity in a more sophisticated and beneficial manner.
This development could be particularly noteworthy for in-house brand marketers, SEO professionals, and marketing agencies.
The potential of AI to comprehend and analyze complex marketing data and consumer behaviors could be significantly enhanced by integrating industry-specific datasets and domain-specific knowledge.
OpenAI Data Partnerships — an open call for organizations to work with us to represent their data in AI training: https://t.co/aTHBQBRddK
The call for large-scale datasets is not limited to any single modality. OpenAI is open to text, images, audio, and video.
They are particularly interested in data that captures human intentions, like long-form writing or conversations, rather than disconnected snippets. Not your Slack archives.
Data partners have the opportunity to influence the future of AI and stand to gain from AI models that are better attuned to their specific needs.
Organization types invited to participate include academic, research, religious, government, political, media, for-profit, non-profit, grassroots community, and individual.
Screenshot from OpenAI, November 2023
Interested partners must create an open-source dataset that can be utilized publicly for AI training or prepare private datasets for training proprietary AI models.
One approach plays a significant role in fostering an open ecosystem, while the other caters to organizations wishing to keep their data confidential.
To streamline the process, OpenAI is prepared to handle data in almost any form (digital or physical) and will assist with digitization and structuring using its advanced OCR and ASR technologies.
In addition, OpenAI is keen on maintaining privacy and sensitive information, offering to work closely with partners to clean the data, ensuring it is free from personal or third-party information.
As AI continues to intertwine with various sectors, the invitation from OpenAI presents a unique opportunity for the marketing industry.
By contributing to the creation of AGI that truly understands the market and its consumers, organizations can not only guide the trajectory of AI development but also reap the rewards of more sophisticated, domain-aware AI tools.
OpenAI’s Data Partnerships represent a critical step towards realizing AGI that is safe, beneficial, and comprehensively informed by “human society.”