Report: Deepfake porn consistently found atop Google, Bing search results

Report: Deepfake porn consistently found atop Google, Bing search results

Popular search engines like Google and Bing are making it easy to surface nonconsensual deepfake pornography by placing it at the top of search results, NBC News reported Thursday.

These controversial deepfakes superimpose faces of real women, often celebrities, onto the bodies of adult entertainers to make them appear to be engaging in real sex. Thanks in part to advances in generative AI, there is now a burgeoning black market for deepfake porn that could be discovered through a Google search, NBC News previously reported.

NBC News uncovered the problem by turning off safe search, then combining the names of 36 female celebrities with obvious search terms like “deepfakes,” “deepfake porn,” and “fake nudes.” Bing generated links to deepfake videos in top results 35 times, while Google did so 34 times. Bing also surfaced “fake nude photos of former teen Disney Channel female actors” using images where actors appear to be underaged.

A Google spokesperson told NBC that the tech giant understands “how distressing this content can be for people affected by it” and is “actively working to bring more protections to Search.”

According to Google’s spokesperson, this controversial content sometimes appears because “Google indexes content that exists on the web,” just “like any search engine.” But while searches using terms like “deepfake” may generate results consistently, Google “actively” designs “ranking systems to avoid shocking people with unexpected harmful or explicit content that they aren’t looking for,” the spokesperson said.

Currently, the only way to remove nonconsensual deepfake porn from Google search results is for the victim to submit a form personally or through an “authorized representative.” That form requires victims to meet three requirements: showing that they’re “identifiably depicted” in the deepfake; the “imagery in question is fake and falsely depicts” them as “nude or in a sexually explicit situation”; and the imagery was distributed without their consent.

While this gives victims some course of action to remove content, experts are concerned that search engines need to do more to effectively reduce the prevalence of deepfake pornography available online—which right now is rising at a rapid rate.

This emerging issue increasingly affects average people and even children, not just celebrities. Last June, child safety experts discovered thousands of realistic but fake AI child sex images being traded online, around the same time that the FBI warned that the use of AI-generated deepfakes in sextortion schemes was increasing.

And nonconsensual deepfake porn isn’t just being traded in black markets online. In November, New Jersey police launched a probe after high school teens used AI image generators to create and share fake nude photos of female classmates.

With tech companies seemingly slow to stop the rise in deepfakes, some states have passed laws criminalizing deepfake porn distribution. Last July, Virginia amended its existing law criminalizing revenge porn to include any “falsely created videographic or still image.” In October, New York passed a law specifically focused on banning deepfake porn, imposing a $1,000 fine and up to a year of jail time on violators. Congress has also introduced legislation that creates criminal penalties for spreading deepfake porn.

Although Google told NBC News that its search features “don’t allow manipulated media or sexually explicit content,” the outlet’s investigation seemingly found otherwise. NBC News also noted that Google’s Play app store hosts an app that was previously marketed for creating deepfake porn, despite prohibiting “apps determined to promote or perpetuate demonstrably misleading or deceptive imagery, videos and/or text.” This suggests that Google’s remediation efforts blocking deceptive imagery may be inconsistent.

Google told Ars that it will soon be strengthening its policies against apps featuring AI-generated restricted content in the Play Store. A generative AI policy taking effect on January 31 will require all apps to comply with developer policies that ban AI-generated restricted content, including deceptive content and content that facilitates the exploitation or abuse of children.

Experts told NBC News that “Google’s lack of proactive patrolling for abuse has made it and other search engines useful platforms for people looking to engage in deepfake harassment campaigns.”

Google is currently “in the process of building more expansive safeguards, with a particular focus on removing the need for known victims to request content removals one by one,” Google’s spokesperson told NBC News.

Microsoft’s spokesperson told Ars that they were looking into our request to comment. We will update this report with any new information that Microsoft shares.

In the past, Microsoft President Brad Smith has said that among all dangers that AI poses, deepfakes worry him most, but deepfakes fueling “foreign cyber influence operations” seemingly concern him more than deepfake porn.

This story was updated on January 11 to include information on Google’s AI-generated content policy.

https://arstechnica.com/?p=1995499




Quality rater and algorithmic evaluation systems: Are major changes coming?

Crowd-sourced human quality raters have been the mainstay of the algorithmic evaluation process for search engines for decades. Still, a potential sea-change in research and production implementation could be on the horizon. 

Recent groundbreaking research by Bing (with some purported commercial implementation already) and a sharp uptick in closely related information retrieval research by others, indicates some big shake-ups are coming.

These shake-ups may have far-reaching consequences for both the armies of quality raters and potentially the frequency of algorithmic updates we see go live, too. 

The importance of evaluation

In addition to crawling, indexing, ranking and result serving for search engines is the important process of evaluation. 

How well does a current or proposed search result set or experimental design align with the notoriously subjective notion of relevance to a given query, at a given time, for a given search engine user’s contextual information needs?

Since we know relevance and intent for many queries are always changing, and how users prefer to consume information evolves, search result pages also need to change to meet both the searcher’s intent and preferred user interface. 

Some changes have predictable, temporal and periodic query intent shifts. For example, in the period approaching Black Friday, many queries typically considered informational might take sweeping commercial intent shifts. Similarly, a transport query like [Liverpool Manchester] might shift to a sports query on local match derby days. 

In these instances, an ever-expanding legacy of historical data supports a high probability of what users consider more meaningful results, albeit temporarily. These levels of confidence likely make seasonal or other predictable periodic results and temporary UI design shifting relatively straightforward adjustments for search engines to implement.

However, when it comes to broader notions of evolving “relevance” and “quality,” and for the purposes of experimental design changes too, search engines must know a proposed change in rankings after development by search engineers is truly better and more precise to information needs, than the present results generated. 

Evaluation is an important stage in search results evolution and vital to providing confidence in proposed changes – and substantial data for any adjustments (algorithmic tuning) to the proposed “systems,” if required. 

Evaluation is where humans “enter the loop” (offline and online) to provide feedback in various ways before roll-outs to production environments.

This is not to say evaluation is not a continuous part of production search. It is. However, an ongoing judgment of existing results and user activity will likely evaluate how well an implemented change continues to fare in production against an acceptable relevance (or satisfaction) based metric range. A metric range based on the initial human judge-submitted relevance evaluations.

In a 2022 paper titled, “The crowd is made of people: Observations from large-scale crowd labelling,” Thomas et al., who are researchers from Bing, allude to the ongoing use of such metric ranges in a production environment when referencing a monitored component of web search “evaluated in part by RBP-based scores, calculated daily over tens of thousands of judge-submitted labels.” (RBP stands for Rank-Biased Precision).

Human-in-the-loop (HITL)

Data labels and labeling

An important point before we continue. I will mention labels and labeling a lot throughout this piece, and a clarification about what is meant by labels and labeling will make the rest of this article much easier to understand:

I will provide you with a couple of real-world examples most people will be familiar with for breadth of audience understanding before continuing:

  • Have you ever checked a Gmail account and marked something as spam?
  • Have you ever marked a film on Netflix as “Not for me,” “I like this,” or “love this”?

All of these submitted actions by you create data labels used by search engines or in information retrieval systems. Yes, even Netflix has a huge foundation in information retrieval and a great information retrieval research team tool. (Note that Netflix is both information retrieval with a strong subset of that field, called “recommender systems.”)

By marking “Not for me” on a Netflix film, you submitted a data label. You became a data labeler to help the “system” understand more about what you like (and also what people similar to you like) and to help Netflix train and tune their recommender systems further.

Data labels are all around us. Labels markup data so it can be transformed into mathematical forms for measurement at scale. 

Enormous amounts of these labels and “labeling” in the information retrieval and machine learning space are used as training data for machine learning. 

“This image has been labeled as a cat.” 

“This image has been labeled as a dog… cat… dog… dog… dog… cat,” and so on. 

All of the labels help machines learn what a dog or a cat looks like with enough examples of images marked as cats or dogs.

Labeling is not new; it’s been around for centuries, since the first classification of items took place. A label was assigned when something was marked as being in a “subset” or “set of things.” 

Anything “classified” has effectively had a label attached to it, and the person who marked the item as belonging to that particular classification is considered the labeler.

But moving forward to recent times, probably the best-known data labeling example is that of reCAPTCHA. Every time we select the little squares on the image grid, we add labels, and we are labelers. 

We, as humans, “enter the loop” and provide feedback and data.

With that explanation out of the way, let us move on to the different ways data labels and feedback are acquired, and in particular, feedback for “relevance” to queries to tune algorithms or evaluate experimental design by search engines.

Implicit and explicit evaluation feedback

While Google refers to their evaluation systems in documents meant for the non-technical audience overall as “rigorous testing,” human-in-the-loop evaluations in information retrieval widely happen through implicit or explicit feedback.

Implicit feedback

With implicit feedback, the user isn’t actively aware they provide feedback. The many live search traffic experiments (i.e., tests in the wild) search engines carry out on tiny segments of real users (as small as 0.1%), and subsequent analysis of click data, user scrolling, dwell time and result skipping, fall into the category of implicit feedback. 

In addition to live experiments, the ongoing general click, scroll and browse behavior of real search engine users can also constitute implicit feedback and likely feed into “Learning to Rank (LTR) machine learning” click models. 

This, in turn, feeds into rationales for proposed algorithmic relevance changes, as non-temporal searcher behavior shifts and world changes lead to unseen queries and new meanings for queries. 

There is the age-old SEO debate around whether rankings change immediately before further evaluation from implicit click data. I will not cover that here other than to say there is considerable awareness of the huge bias and noise that comes with raw click data in the information retrieval research space and the huge challenges in its continuous use in live environments. Hence, the many pieces of research work around proposed click models for unbiased learning to rank and learning to rank with bias.

Regardless, it is no secret overall in information retrieval how important click data is for evaluation purposes. There are countless papers and even IR books co-authored by Google research team members, such as “Click Models for Web Search” (Chuklin and De Rijke, 2022). 

Google also openly states in their “rigorous testing” article:

“We look at a very long list of metrics, such as what people click on, how many queries were done, whether queries were abandoned, how long it took for people to click on a result and so on.”

And so a cycle continues. Detected change needed from Learning to Rank, click model application, engineering, evaluation, detected change needed, click model application, engineering, evaluation, and so forth.

Explicit feedback

In contrast to implicit feedback from unaware search engine users (in live experiments or in general use), explicit feedback is derived from actively aware participants or relevance labelers. 

The purpose of this relevance data collection is to mathematically roll it up and adjust overall proposed systems.

A gold standard of relevance labeling – considered near to a ground truth (i.e., the reality of the real world) of intent to query matching – is ultimately sought. 

There are various ways in which a gold standard of relevance labeling is gathered. However, a silver standard (less precise than gold but more widely available data) is often acquired (and accepted) and likely used to assist in further tuning.

Explicit feedback takes four main formats. Each has its advantages and disadvantages, largely about relevance labeling quality (compared with gold standard or ground truth) and how scalable the approach is.

Real users in feedback sessions with user feedback teams

Search engine user research teams and real users provided with different contexts in different countries collaborate in user feedback sessions to provide relevance data labels for queries and their intents. 

This format likely provides near to a gold standard of relevance. However, the method is not scalable due to its time-consuming nature, and the number of participants could never be anywhere near representative of the wider search population at large.

True subject matter experts / topic experts / professional annotators

True subject matter experts and professional relevance assessors provide relevance for query mappings annotated to their intents in data labeling, including many nuanced cases. 

Since these are the authors of the query to intent mappings, they know the exact intent, and this type of labeling is likely considered near to a gold standard. However, this method, similar to the user feedback research teams format, is not scalable due to the sparsity of relevance labels and, again, the time-consuming nature of this process. 

This method was more widely used before introducing the more scalable approach of crowd-sourced human quality raters (to follow) in recent times.

Search engines simply ask real users whether something is relevant or helpful

Real search engine users are actively asked whether a search result is helpful (or relevant) by search engines and consciously provide explicit binary feedback in the form of yes or no responses with recent “thumbs up” design changes spotted in the wild.

rustybrick on X - Google search result poll

Crowd-sourced human quality raters

The main source of explicit feedback comes from “the crowd.” Major search engines have huge numbers of crowd-sourced human quality raters provided with some training and handbooks and hired through external contractors working remotely worldwide. 

Google alone has a purported 16,000 such quality raters. These crowd-sourced relevance labelers and the programs they are part of are referred to differently by each search engine. 

Google refers to its participants as “quality raters” in the Quality Raters Program, with the third-party contractor referring to Google’s web search relevance program as “Project Yukon.” 

Bing refers to their participants as simply “judges” in the Human Relevance System (HRS), with third-party contractors referring to Bing’s project as simply “Web Content Assessor.” 

Despite these differences, participants’ purposes are primarily the same. The role of the crowd-sourced human quality rater is to provide synthetic relevance labels emulating search engine users across the world as part of explicit algorithmic feedback. Feedback often takes the form of a side-by-side (pairwise) comparison of proposed changes versus either existing systems or alongside other proposed system changes. 

Since much of this is considered offline evaluation, it isn’t always live search results that are being compared but also images of results. And it isn’t always a pairwise comparison, either. 

These are just some of the many different types of tasks that human quality raters carry out for evaluation, and data labeling, via third-party contractors. The relevance judges likely continuously monitor after the proposed change roll-out to production search, too. (For example, as the aforementioned Bing research paper alludes to.)

Whatever the method of feedback acquisition, human-in-the-loop relevance evaluations (either implicit or explicit) play a significant role before the many algorithmic updates (Google launched over 4,700 changes in 2022 alone, for example), including the now increasingly frequent broad core updates, which ultimately appear to be an overall evaluation of fundamental relevance revisited.


Get the daily newsletter search marketers rely on.


Relevance labeling at a query level and a system level

Despite the blog posts we have seen alerting us to the scary prospect of human quality raters visiting our site via referral traffic analysis, naturally, in systems built for scale, individual results of quality rater evaluations at a page level, or even at an individual rater level have no significance on their own. 

Human quality raters do not judge websites or webpages in isolation 

Evaluation is a measurement of systems, not web pages – with “systems” meaning the algorithms generating the proposed changes. All of the relevance labels (i.e., “relevant,” “not relevant,” “highly relevant”) provided by labelers roll up to a system level. 

“We use responses from raters to evaluate changes, but they don’t directly impact how our search results are ranked.”

– “How our Quality Raters make Search results better,” Google Search Help

In other words, while relevance labeling doesn’t directly impact rankings, aggregated data labeling does provide a means to take an overall (average) measurement of how well a proposed algorithmic change (system) might be, more precisely relevant (when ranked), with lots of reliance on various types of algorithmic averages.

Query-level scores are combined to determine system-level scores. Data from relevance labels is turned into numerical values and then into “average” precision metrics to “tune” the proposed system further before any roll-out to search engine users more broadly. 

How far from the expected average precision metrics engineers hoped to achieve with the proposed change is the reality when ‘humans enter the loop’?

While we cannot be entirely sure of the metrics used on aggregated data labels when everything is turned into numerical values for relevance measurement, there are universally recognized information retrieval ranking evaluation metrics in many research papers. 

Most authors of such papers are search engine engineers, academics, or both. Production follows research in the information retrieval field, of which all web search is a part.

Such metrics are order-aware evaluation metrics (where the ranked order of relevance matters, and weighting, or “punishing” of the evaluation if the ranked-order is incorrect). These metrics include:

  • Mean reciprocal rank (MRR).
  • Rank-biased precision (RBP).
  • Mean average precision (MAP).
  • Normalized and un-normalized discounted cumulative gain (NDCG and DCG respectively).

In a 2022 research paper co-authored by a Google research engineer, NDCG and AP (average precision) are referred to as a norm in the evaluation of pairwise ranking results:

“A fundamental step in the offline evaluation of search and recommendation systems is to determine whether a ranking from one system tends to be better than the ranking of a second system. This often involves, given item-level relevance judgments, distilling each ranking into a scalar evaluation metric, such as average precision (AP) or normalized discounted cumulative gain (NDCG). We can then say that one system is preferred to another if its metric values tend to be higher.”

– “Offline Retrieval Evaluation Without Evaluation Metrics,” Diaz and Ferraro, 2022

Information on DCG, NDCG, MAP, MRR and their commonality of use in web search evaluation and ranking tuning is widely available.

Victor Lavrenko, a former assistant professor at the University of Edinburgh, also describes one of the more common evaluation metrics, mean average precision, well:

“Mean Average Precision (MAP) is the standard single-number measure for comparing search algorithms. Average precision (AP) is the average of … precision values at all ranks where relevant documents are found. AP values are then averaged over a large set of queries…”

So it’s literally all about the averages judges submit from the curated data labels distilled into a consumable numerical metric versus the predicted averages hoped for after engineering and then tuning the ranking algorithms further.

Quality raters are simply relevance labelers

Quality raters are simply relevance labelers, classifying and feeding a huge pipeline of data, rolled up and turned into numerical scores for:

  • Aggregation on whether a proposed change is near an acceptable average level of relevance precision or user satisfaction.
  • Or determining whether the proposed change needs further tuning (or total abandonment).

The sparsity of relevance labeling causes a bottleneck

Regardless of the evaluation metrics used, the initial data is the most important part of the process (the relevance labels) since, without labels, no measurement via evaluation can take place.

A ranking algorithm or proposed change is all very well, but unless “humans enter the loop” and determine whether it is relevant in evaluation, the change likely won’t happen.

For the past couple of decades, in information retrieval widely, the main pipeline of this HITL-labeled relevance data has come from crowd-sourced human quality raters, which replaced the use of the professional (but fewer in numbers) expert annotators as search engines (and their need for speedy iteration) grew. 

Feeding yays and nays in turn converted into numbers and averages in order to tune search systems.

But scale (and the need for more and more relevance labeled data) is increasingly problematic, and not just for search engines (even despite these armies of human quality raters). 

The scalability and sparsity issue of data labeling presents a global bottleneck and the classic “demand outstrips supply” challenge.

Widespread demand for data labeling has grown phenomenally due to the explosion in machine learning in many industries and markets. Everyone needs lots and lots of data labeling. 

Recent research by consulting firm Grand View Research illustrates the huge growth in market demand, reporting:

“The global data collection and labeling market size was valued at $2.22 billion in 2022 and it is expected to expand at a compound annual growth rate of 28.9% from 2023 to 2030, with the market then expected to be worth $13.7 billion.”

This is very problematic. Particularly in increasingly competitive arenas such as AI-driven generative search with the effective training of large language models requiring huge amounts of labeling and annotations of many types.

Authors at Deepmind, in a 2022 paper, state:

 “We find current large language models are significantly undertrained, a consequence of the recent focus on scaling language models while keeping the amount of training data constant. …we find for compute-optimal training …for every doubling of model size the number of training tokens should also be doubled.” 

– “Training Compute-Optimal Large Language Models,” Hoffman et al. 

When the amount of labels needed grows quicker than the crowd can reliably produce them, a bottleneck in scalability for relevance and quality via rapid evaluation on production roll-outs can occur. 

Lack of scalability and sparsity do not fit well with speedy iterative progress

Lack of scalability was an issue when search engines moved away from the industry norm of professional, expert annotators and toward the crowd-sourced human quality raters providing relevance labels, and scale and data sparsity is once again a major issue with the status quo of using the crowd. 

Some problems with crowd-sourced human quality raters

In addition to the lack of scale, other issues come with using the crowd. Some of these relate to human nature, human error, ethical considerations and reputational concerns.

While relevance remains largely subjective, crowd-sourced human quality raters are provided with, and tested on, lengthy handbooks, in order to determine relevance. 

Google’s publicly available Quality Raters Guide is over 160 pages long, and Bing’s Human Relevance Guidelines is “reported to be over 70 pages long,” per Thomas et al.

Bing is much more coy with their relevance training handbooks. Still, if you root around, as I did when researching this piece, you can find some of the documentation with incredible detail on what relevance means (in this instance for local search), which looks like one of their judging guidelines in the depths online.

Efforts are made in this training to instill a mindset appreciative of the evaluator’s role as a “pseudo” search engine user in their natural locale. 

The synthetic user mindset needs to consider many factors when emulating real users with different information needs and expectations. 

These needs and expectations depend on several factors beyond simply their locale, including age, race, religion, gender, personal opinion and political affiliation. 

The crowd is made of people

Unsurprisingly, humans are not without their failings as relevance data labelers.

Human error needs no explanation at all and bias on the web is a known concern, not just for search engines but more generally in search, machine learning, and AI overall. Hence, the dedicated “responsible AI” field emerges in part to deal with combatting baked-in biases in machine learning and algorithms. 

However, findings in the 2022 large-scale study by Thomas et al., Bing researchers, highlight factors leading to reduced precision relevance labeling going beyond simple human error and traditional conscious or unconscious bias.

Even despite the training and handbooks, Bing’s findings, derived from “hundreds of millions of labels, collected from hundreds of thousands of workers as a routine part of search engine development,” underscore some of the less obvious factors, more akin to physiological and cognitive factors and contributing to a reduction in precision quality in relevance labeling tasks, and can be summarised as follows:

  • Task-switching: Corresponded directly with a decline in quality of relevance labeling, which was significant as only 28% of participants worked on a single task in a session with all others moving between tasks. 
  • Left side bias: In a side-by-side comparison, a result displayed on the left side was more likely to be selected as relevant when compared with results on the right side. Since pair-wise analysis by search engines is widespread, this is concerning.
  • Anchoring: Played a part in relevance labeling choices, whereby the relevance label assigned on the first result by a labeler is also much more likely to be the relevance label assigned for the second result. This same label selection appeared to have a descending probability of selection in the first 10 evaluated queries in a session. After 10 evaluated queries, the researchers found that the anchoring issue seemed to disappear. In this instance the labeler hooks (anchors) onto the first choice they make and since they have no real notion of relevance or context at that time, the probability of them choosing the same relevance label with the next option is high. This phenomenon disappears as the labeler gathers more information from subsequent pairwise sets to consider.
  • General fatigue of crowd-workers played a part in reduced precision labeling.
  • General disagreement between judges on which one of a pairwise result was relevant from the two options. Simply differing opinions and perhaps a lack of true understanding of the context of the intended search engine user.
  • Time of day and day of week when labeling was carried out by evaluators also plays a role. The researchers noted some related findings which appeared to correlate with spikes in reduced relevance labeling accuracy when regional celebrations were underway, and might have easily been considered simple human error, or noise, if not explored more fully.

The crowd is not perfect at all.

A dark side of the data labeling industry

Then there is the other side of the use of human crowd-sourced labelers, which concerns society as a whole. That of low-paid “ghost workers” in emerging economies employed to label data for search engines and others in the tech and AI industry.

Major online publications increasingly draw attention to this issue with headlines like:

And, we have Google’s own third-party quality raters protesting for higher pay as recently as February 2023, with claims of “poverty wages and no benefits.”

Add together all of this with the potential for human error, bias, scalability concerns with the status quo, the subjectivity of “relevance,” the lack of true searcher context at the time of query and the inability to truly determine whether a query has a navigational intent.

And we have not even touched upon the potential minefield of regulations and privacy concerns around implicit feedback.

How to deal with lack of scale and “human issues”?

Enter large language models (LLMs), ChatGPT and increasing use of machine-generated synthetic data.

Is the time right to look at replacing ‘the crowd’?

A 2022 research piece from “Frontiers of Information Access Experimentation for Research and Education” involving several respected information retrieval researchers explores the feasibility of replacing the crowd, illustrating the conversation is well underway.

Clarke et al. state: 

“The recent availability of LLMs has opened the possibility to use them to automatically generate relevance assessments in the form of preference judgements. While the idea of automatically generated judgements has been looked at before, new-generation LLMs drive us to re-ask the question of whether human assessors are still necessary.”

However, when considering the current situation, Clarke et al. raise specific concerns around a possible degradation in the quality of relevance labeling in exchange for huge scale potentials:

Concerns about reduced quality in exchange for scale?

“It is a concern that machine-annotated assessments might degrade the quality, while dramatically increasing the number of annotations available.” 

The researchers draw parallels between the previous major shift in the information retrieval space away from professional annotators some years before to “the crowd,” continuing:

“Nevertheless, a similar change in terms of data collection paradigm was observed with the increased use of crowd assessor…such annotation tasks were delegated to crowd workers, with a substantial decrease in terms of quality of the annotation, compensated by a huge increase in annotated data.”

They surmise that the feasibility of “over time” a spectrum of balanced machine and human collaboration, or a hybrid approach to relevance labeling for evaluations, may be a way forward. 

A wide range of options from 0% machine and 100% human right across to 100% machine and 0% human is explored.

The researchers consider options whereby the human is at the beginning of the workflow providing more detailed query annotations to assist the machine in relevance evaluation, or at the end of the process to check the annotations provided by the machines.

In this paper, the researchers draw attention to the unknown risks that may emerge through the use of LLMs in relevance annotation over human crowd usage, but do concede at some point, there will likely be an industry move toward the replacement of human annotators in favor of LLMs:

“It is yet to be understood what the risks associated with such technology are: it is likely that in the next few years, we will assist in a substantial increase in the usage of LLMs to replace human annotators.”

Things move fast in the world of LLMs

But much progress can take place in a year, and despite these concerns, other researchers are already rolling with the idea of using machines as relevance labelers.

Despite the concerns raised in the Clarke et al. paper around reduced annotation quality should a large-scale move toward machine usage occur, in less than a year, there has been a significant development that impacts production search.

Very recently, Mark Sanderson, a well-respected and established information retrieval researcher, shared a slide from a presentation by Paul Thomas, one of four Bing research engineers presenting their work on the implementation of GPT-4 as relevance labelers rather than humans from the crowd. 

Researchers from Bing have made a breakthrough in using LLMs to replace “the crowd” annotators (in whole or in part) in the 2023 paper, “Large language models can accurately predict searcher preferences.” 

The enormity of this recent work by Bing (in terms of the potential change for search research) was emphasized in a tweet by Sanderson. Sanderson described the talk as “incredible,” noting, “Synthetic labels have been a holy grail of retrieval research for decades.”

While sharing the paper and subsequent case study, Thomas also shared Bing is now using GPT-4 for its relevance judgments. So, not just research, but (to an unknown extent) in production search too.

Mark Sanderson on X

So what has Bing done?

The use of GPT-4 at Bing for relevance labeling

The traditional approach of relevance evaluation typically produces a varied mixture of gold and silver labels when “the crowd” provides judgments from explicit feedback after reading “the guidelines” (Bing’s equivalent of Google’s Quality Raters Guide). 

In addition, live tests in the wild utilizing implicit feedback typically generate gold labels (the reality of the real world “human in the loop”), but with a lack of scale and high relative costs. 

Bing’s approach utilized GPT-4 LLM machine-learned pseudo-relevance annotators created and trained via prompt engineering. The purpose of these instances is to emulate quality raters to detect relevance based on a carefully selected set of gold standard labels.

This was then rolled out to provide bulk “gold label” annotations more widely via machine learning, reportedly for a fraction of the relative cost of traditional approaches. 

The prompt included telling the system that it is a search quality rater whose purpose is to assess whether documents in a set of results are relevant to a query using a label reduced to a binary relevant / not relevant judgment for consistency and to minimize complexity in the research work.

To aggregate evaluations more broadly, Bing sometimes utilized up to five pseudo-relevance labelers via machine learning per prompt.

The approach and impacts for cost, scale and purported accuracy are illustrated below and compared with other traditional explicit feedback approaches, plus implicit online evaluation.

Interestingly, two co-authors are also co-authors in Bing’s research piece, “The Crowd is Made of People,” and undoubtedly are well aware of the challenges of using the crowd.

Source: “Large language models can accurately predict searcher preferences,” Thomas et al., 2023
Source: “Large language models can accurately predict searcher preferences,” Thomas et al., 2023

With these findings, Bing researchers claim:

“To measure agreement with real searchers needs high-quality “gold” labels, but with these we find that models produce better labels than third-party workers, for a fraction of the cost, and these labels let us train notably better rankers.” 

Scale and low-cost combined

These findings illustrate machine learning and large language models have the potential to reduce or eliminate bottlenecks in data labeling and, therefore, the evaluation process.

This is a sea-change pointing the way to an enormous step forward in how evaluation before algorithmic updates are undertaken since the potential for scale at a fraction of the cost of “the crowd” is considerable.

It’s not just Bing reporting on the success of machines over humans in relevance labeling tasks, and it’s not just ChatGPT either. Plenty of research into whether human assessors can be replaced in part or wholly by machines is certainly picking up pace in 2022 and 2023 in other research, too.

Others are reporting some success in utilizing machines over humans for relevance labeling, too

In a July 2023 paper, researchers at the University of Zurich found open source large language models (FLAN and HugginChat) outperform human crowd workers (including trained relevance annotators and consistently high-scoring crowd-sourced MTurk human relevance annotators). 

Although this work was carried out on tweet analysis rather than search results, their findings were that other open-source large language models were not only better than humans but were almost as good in their relevance labeling as ChatGPT (Alizadeh et al, 2023).

This opens the door to even more potential going forward for large-scale relevance annotations without the need for “the crowd” in its current format.

But what might come next, and what will become of ‘the crowd’ of human quality raters?

Responsible AI importance 

Caution is likely overwhelmingly front of mind for search engines. There are other highly important considerations.

Responsible AI, as yet unknown risk with these approaches, baked-in bias detection, and its removal, or at least an awareness and adjustment to bias, to name but a few. LLMs tend to “hallucinate,” and “overfitting” could present problems as well, so monitoring might well consider factors such as these with guardrails built as necessary. 

Explainable AI also calls for models to provide an explanation as to why a label or other type of output was deemed relevant, so this is another area where there will likely be further development. Researchers are also exploring ways to create bias awareness in LLM relevance judgments. 

Human relevance assessors are monitored continuously anyway, so continual monitoring is already a part of the evaluation process. However, one can presume Bing, and others, would tread much more cautiously with this machine-led approach over the “the crowd” approach. Careful monitoring will also be required to avoid drops in quality in exchange for scalability.

In outlining their approach (illustrated in the image above), Bing shared this process: 

  • Select via gold labels
  • Generate labels in bulk
  • Monitor with several methods

“Monitor with several methods” would certainly fit with a clear note of caution.

Next steps?

Bing, and others, will no doubt look to improve upon these new means of gathering annotations and relevance feedback at scale. The door is unlocked to a new agility.

A low-cost, hugely scalable relevance judgment process undoubtedly gives a strong competitive advantage when adjusting search results to meet changing information needs.

As the saying goes, the cat is out of the bag, and one could presume the research will continue to heat up to a frenzy in the information retrieval space (including other search engines) in the short to medium term.

A spectrum of human and machine assessors?

In their 2023 paper “HMC: A Spectrum of Human–Machine-Collaborative Relevance Judgement Frameworks,” Clarke et al. alluded to a feasible approach that might well mean subsequent stages of a move toward replacement of the crowd with machines taking a hybrid or spectrum form.

While a spectrum of human-machine collaboration might increase in favor of machine-learned methods as confidence grows and after careful monitoring, none of this means “the crowd” will leave entirely. The crowd may become much smaller, though, over time.

It seems unlikely that search engines (or IR research at large) would move completely away from using human relevance judges as a guardrail and a sobering sense-check or even to act as judges of the relevance labels generated by machines. Human quality raters also present a more robust means of combating “overfitting.”

Not all search areas are considered equal in terms of their potential impact on the life of searchers. Clarke et al., 2023, stress the importance of a more trusted human judgment in areas such as journalism, and this would fit well with our understanding as SEOs of Your Money or Your Life (YMYL).

The crowd might well just take on other roles depending upon the weighting in a spectrum, possibly moving into more of a supervisory role, or as an exam marker of machine-learned assessors, with exams provided for large language models requiring explanations as to how judgments were made.

Clarke et al. ask: “What weighting between human and LLMs and AI-assisted annotations is ideal?” 

What weighting of human to machine is implemented in any spectrum or hybrid approach might depend on how quickly the pace of research picks up. While not entirely comparable, if we look at the herd movement in the research space after the introduction of BERT and transformers, one can presume things will move very quickly indeed. 

Furthermore, there is also a massive move toward synthetic data already, so this “direction of travel” fits with that. 

According to Gartner:

  • “Solutions such as AI-specific data management, synthetic data and data labeling technologies, aim to solve many data challenges, including accessibility, volume, privacy, security, complexity and scope.” 
  • “By 2024, Gartner predicts 60% of data for AI will be synthetic to simulate reality, future scenarios and de-risk AI, up from 1% in 2021.” 

Will Google adopt these machine-led evaluation processes?

Given the sea-change to decades-old practices in the evaluation processes widely used by search engines, it would seem unlikely Google would not at least be looking into this very closely or even be striving towards this already. 

If the evaluation process has a bottleneck removed via the use of large language models, leading to massively reduced data sparsity for relevance labeling and algorithmic update feedback at lower costs for the same, and the potential for higher quality levels of evaluation too, there is a certain sense in “going there.”

Bing has a significant commercial advantage with this breakthrough, and Google has to stay in and lead, the AI game.

Removals of bottlenecks have the potential to massively increase scale, particularly in non-English languages and into additional markets where labeling might have been more difficult to obtain (for example, the subject matter expert areas or the nuanced queries around more technical topics). 

While we know that Google’s Search Generative Experience Beta, despite expanding to 120 countries, is still considered an experiment to learn how people might interact with or find useful, generative AI search experiences, they have already stepped over the “AI line.”

Greg Gifford on X - SGE is an experiment

However, Google is still incredibly cautious about using AI in production search.

Who can blame them for all the antitrust and legal cases, plus the prospect of reputational damage and increasing legislation related to user privacy and data protection regulations?

James Manyika, Google’s senior vice president of technology and society, speaking at Fortune’s Brainstorm AI conference in December 2022, explained:

“These technologies come with an extraordinary range of risks and challenges.” 

However, Google is not shy about undertaking research into the use of large language models. Heck, BERT came from Google in the first place. 

Certainly, Google is exploring the potential use of synthetic query generation for relevance prediction, too. Illustrated in this recent 2023 paper by Google researchers and presented at the SIGIR information retrieval conference.

Google paper 2023 on relevance prediction

Since synthetic data in AI/ML reduces other risks that might relate to privacy, security, and the use of user data, simply generating data out of thin air for relevance prediction evaluations may actually be less risky than some of the current practices.

Add to the other factors that could build a case for Google jumping on board with these new machine-driven evaluation processes (to any extent, even if the spectrum is mostly human to begin with):

  • The research in this space is heating up. 
  • Bing is running with some commercial implementation of machine over people labeling. 
  • SGE needs loads of labels.
  • There are scale challenges with the status quo.
  • The increasing spotlight on the use of low-paid workers in the data-labeling industry overall. 
  • Respected information retrieval researchers are asking is now the time to revisit the use of machines over humans in labeling?

Openly discussing evaluation as part of the update process

Google also seems to be talking much more openly of late about “evaluation” too, and how experiments and updates are undertaken following “rigorous testing.” There does seem to be a shift toward opening up the conversation with the wider community.

Here’s Danny Sullivan just last week giving an update on updates and “rigorous testing.”

Martin Splitt on X - Search Central Live

And again, explaining why Google does updates.

Greg Bernhardt on X

Search off The Record recently discussed “Steve,” an imaginary search engine, and how updates to Steve might be implemented based on the judgments of human evaluators, with potential for bias, amongst other points discussed. There was a good amount of discussion around how changes to Steve’s features were tested and so forth. 

This all seems to indicate a shift around evaluation unless I am simply imagining this.

In any event, there are already elements of machine learning in the relevance evaluation process, albeit implicit feedback. Indeed, Google recently updated its documentation on “how search works” around detecting relevant content via aggregated and anonymized user interactions.

“We transform that data into signals that help our machine-learned systems better estimate relevance.”

So perhaps following Bing’s lead is not that far a leap to take after all?

What if Google takes this approach?

What might we expect to see if Google embraces a more scalable approach to the evaluation process (huge access to more labels, potentially with higher quality, at lower cost)?

Scale, more scale, agility, and updates

Scale in the evaluation process and speedy iteration of relevance feedback and evaluations pave the way for a much greater frequency of updates, and into many languages and markets.

An evolving, iterative, alignment with true relevance, and algorithmic updates to meet this, could be ahead of us, with less broad sweeping impacts. A more agile approach overall. 

Bing takes a much more agile approach in their evaluation process already, and the breakthrough with LLM as relevance labeler makes them even more so. 

Fabrice Canel of Bing, in a recent interview, reminded us of the search engine’s constantly evolving evaluation approach where the push out of changes is not as broad sweeping and disruptive as Google’s broad core update or “big” updates. Apparently, at Bing, engineers can ideate, gain feedback quickly, and sometimes roll out changes in as little as a day or so.

All search engines will have compliance and strict review processes, which cannot be conducive to agility and will no doubt build up to a form of process debt over time as organizations age and grow. However, if the relevance evaluation process can be shortened dramatically while largely maintaining quality, this takes away at least one big blocker to algorithmic change management.

We have already seen a big increase in the number of updates this year, with three broad core updates (relevance re-evaluations at scale) between August and November and many other changes concerning spam, helpful content, and reviews in between.

Coincidentally (or probably not), we’re told “to buckle up” because major changes are coming to search. Changes designed to improve relevance and user satisfaction. All the things the crowd traditionally provides relevant feedback on.

Kenichi Suzuki on X

So, buckle up. It’s going to be an interesting ride.

rustybrick on X - Google buckle up

If Google takes this route (using machine labeling in favor of the less agile “crowd” approach), expect a lot more updates overall, and likely, many of these updates will be unannounced, too. 

We could potentially see an increased broad core update cadence with reduced impacts as agile rolling feedback helps to continually tune “relevance” and “quality” in a faster cycle of Learning to Rank, adjustment, evaluation and rollout.

Gianluca Fiorelli on X - endless updates

Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.


Related stories

New on Search Engine Land

@media screen and (min-width: 800px) { #div-gpt-ad-3191538-7 { display: flex !important; justify-content: center !important; align-items: center !important; min-width:770px; min-height:260px; } } @media screen and (min-width: 1279px) { #div-gpt-ad-3191538-7 { display: flex !important; justify-content: center !important; align-items: center !important; min-width:800px!important; min-height:440px!important; } }

About the author

Dawn Anderson

Dawn Anderson is a SEO & Search Digital Marketing Strategist focusing on technical, architectural and database-driven SEO. Dawn is the managing director at Bertey.

https://searchengineland.com/quality-rater-algorithmic-evaluation-systems-changes-434895




Microsoft rebrands Bing Chat as Copilot

Bing Chat has a new name as of today – Copilot. It now shares the same brand name as multiple other Microsoft AI products.

R.I.P. Bing Chat. Bing Chat, part of the new Bing that was powered by ChatGPT for search, launched Feb. 7. Bing Chat has handled more than 1 billion prompts and queries since it launched, Microsoft Bing said in a blog post.

Copilot + Search. Bing is no longer “your AI-powered copilot for the web.” However, Microsoft Bing will still provide a combined Search and chat experience. It will just be called CoPilot heading forward.

For people who may not want that combined experience, CoPilot will have its own standalone ChatGPT-style experience at https://copilot.microsoft.com/

Why we care. Bing Chat launched with much hype but failed to steal any market share from Google. This is unfortunate because this allows Google to dictate the rules, direction and costs of search for the entire web.

What Microsoft is saying. Microsoft said the rebrand is to unify the Copilot experience:

  • “Our efforts to simplify the user experience and make Copilot more accessible to everyone starts with Bing, our leading experience for the web. Beginning today, Bing Chat and Bing Chat Enterprise are becoming Copilot, with commercial data protection enforced when any eligible user is signed in with Microsoft Entra ID.”

While it’s definitely a more unified experience, it also seems a bit confusing because Microsoft’s chatbot “companion” is used across multiple apps, including Microsoft 365, Edge, Windows and more – some free, some not.

Bing Chat Enterprise also rebranded. In addition to Bing Chat, Bing Chat Enterprise is also rebranded as Copilot Pro. It offers the same chat functionality with greater commercial data protection for Microsoft 365 subscribers.


Related stories

New on Search Engine Land

@media screen and (min-width: 800px) { #div-gpt-ad-3191538-7 { display: flex !important; justify-content: center !important; align-items: center !important; min-width:770px; min-height:260px; } } @media screen and (min-width: 1279px) { #div-gpt-ad-3191538-7 { display: flex !important; justify-content: center !important; align-items: center !important; min-width:800px!important; min-height:440px!important; } }

About the author

Danny Goodwin

Danny Goodwin has been Managing Editor of Search Engine Land & Search Marketing Expo – SMX since 2022. He joined Search Engine Land in 2022 as Senior Editor. In addition to reporting on the latest search marketing news, he manages Search Engine Land’s SME (Subject Matter Expert) program. He also helps program U.S. SMX events. Goodwin has been editing and writing about the latest developments and trends in search and digital marketing since 2007. He previously was Executive Editor of Search Engine Journal (from 2017 to 2022), managing editor of Momentology (from 2014-2016) and editor of Search Engine Watch (from 2007 to 2014). He has spoken at many major search conferences and virtual events, and has been sourced for his expertise by a wide range of publications and podcasts.

https://searchengineland.com/microsoft-rebrands-bing-chat-as-copilot-434709




Microsoft patent on website and site content reliability scores for Bing Search ranking

Microsoft has released a patent named Web Content Reliability Classification that talks about how to develop a reliability score for a website or content on a website. The patent seems like it can be used by the Bing Search team for better ranking of websites and web content, but that does not mean it is currently being used in the Bing Search results today.

The patent was published on November 2, 2023 after being filed on July 5, 2023 – you can read it over here.

Highlights. Here are some interesting highlights from this patent application.

  • The reliability score can be used to block content, rank content, provide a content warning, and select a source to answer a question, along with other uses.
  • Traffic data can indicate whether a source is popular, but popular is not the same thing as reliable.
  • Natural language processing can be used to determine whether online content is grammatical, but grammatical is also not the same thing as reliable.
  • The present technology identifies reliable content by leveraging expert scoring for a small amount of web content by iteratively extending these scores to other content based on how web content is linked.
  • User interactions may also be leveraged for determining a reliability score as well
  • The high reliability score is generated by first identifying high reliability online content within a web graph.
  • These initially scored sites may be described as seed sites.
  • Ratings for the seed sites may be taken from authoritative lists of known reliable content providers
  • An output of the technology is a high reliability score and a low reliability score for a web content.
  • Different applications can consume this score to perform or guide different functions, including search, filtering, content warning generation, and the like.

The abstract. Here is the abstract of the patent:

Technology described herein assigns a reliability score to web content, such as a web site or portion of a website. In one aspect, an output of the technology is a high reliability score and a low reliability score for a web content. The high reliability score represents conformance to high reliability sites, while the low reliability score represents conformance to low reliability sites. The high reliability score may be generated by first identifying high reliability online content within a compressed web graph. In a first iteration, the high reliability score of the seeds is used to score online content that is linked to the seed sites. At a high level, the more links that originate from high reliability sources, the higher the reliability score for the linked content. The low reliability score is similar, but uses outgoing links to low reliability sites instead of incoming links from high reliability sites.

Why we care. Many SEOs enjoy reading patent documents from the Google and Bing Search teams. While we know that just because a patent has been filed, it does not mean a search engine is using the technology as described in the patent in the live search results. Either way, it can be educational and useful to understand how these search scientists who work at Google and Bing think about these ranking and scoring challenges.

Hat tip to Glenn Gabe for spotting this patent.


Related stories

New on Search Engine Land

@media screen and (min-width: 800px) { #div-gpt-ad-3191538-7 { display: flex !important; justify-content: center !important; align-items: center !important; min-width:770px; min-height:260px; } } @media screen and (min-width: 1279px) { #div-gpt-ad-3191538-7 { display: flex !important; justify-content: center !important; align-items: center !important; min-width:800px!important; min-height:440px!important; } }

About the author

Barry Schwartz

Barry Schwartz is a Contributing Editor to Search Engine Land and a member of the programming team for SMX events. He owns RustyBrick, a NY based web consulting firm. He also runs Search Engine Roundtable, a popular search blog on very advanced SEM topics. Barry can be followed on Twitter here.

https://searchengineland.com/microsoft-patent-on-website-and-site-content-reliability-scores-for-bing-search-ranking-434308




Alternative facts? Google denies rushing out Bard at trial

I was incredibly surprised to see the headline Google VP Says Bard Chatbot Wasn’t Rushed Out to Beat Microsoft on Bloomberg (warning: paywalled).

That Google VP is Elizabeth Reid, the vice president and GM of Search. She spoke yesterday at the ongoing U.S. vs. Google antitrust hearing.

Why we care. If Google fails to recognize what it looked like from everyone watching, they are essentially living in the land of alternative facts. Which, in case you forgot, are called lies. For a company that preaches about how “trustworthiness” is the most important part of E-E-A-T, they really ought to demonstrate some.

The AI Search race. February was one of the most memorable times in all of Search. I remember it well.

Perhaps the worst-kept secret then was that Microsoft Bing was about to announce a new AI-powered version of Bing, powered by OpenAI’s technology that was powering the hottest new thing in the world – ChatGPT. We first reported on this Jan. 4 in Microsoft to add ChatGPT features to Bing Search.

Search Engine Land’s Barry Schwartz was invited by Microsoft on Thursday, Feb. 2 to an exclusive briefing scheduled for Tuesday, Feb. 7 in Redmond, Wash. The invite even said there were no plans to livestream this event – though that quickly evolved into a special press event.

Why? Google, that’s why. Suddenly, Google has huge embargoed news to share. That news – the announcement of Google’s ChatGPT competitor, an experiment called Bard – went live on Monday, Feb. 6, less than 24 hours before Microsoft’s event.

Press coverage called Google’s news a rushed announcement because it clearly was. Google, at this point, had no product to share. Bard was vaporware, supposedly being released to “trusted testers.”

Google Bard fumbles early. Google then held a public demonstration in which Bard got the first answer wrong about NASA’s James Webb Space Telescope – an early warning of hallucinations that LLMs produce. Alphabet paid a big price, losing $100 billion in market value.

Google disagrees. Reid testified that Bard wasn’t rushed out because Microsoft was planning to announce its generative AI take on search.

  • “I don’t think you can make that conclusion. Microsoft’s announcement also had several errors in it. The technology is very nascent. It makes mistakes. That’s why we’ve been hesitant to put it forward,” Reid said.

Yes, so hesitant, that Google rushed to upstage Microsoft with its Bard news, less than a day before its biggest Search announcement in years.

You buying this? Because I sure ain’t.


Related stories

New on Search Engine Land

@media screen and (min-width: 800px) { #div-gpt-ad-3191538-7 { display: flex !important; justify-content: center !important; align-items: center !important; min-width:770px; min-height:260px; } } @media screen and (min-width: 1279px) { #div-gpt-ad-3191538-7 { display: flex !important; justify-content: center !important; align-items: center !important; min-width:800px!important; min-height:440px!important; } }

About the author

Danny Goodwin

Danny Goodwin has been Managing Editor of Search Engine Land & Search Marketing Expo – SMX since 2022. He joined Search Engine Land in 2022 as Senior Editor. In addition to reporting on the latest search marketing news, he manages Search Engine Land’s SME (Subject Matter Expert) program. He also helps program U.S. SMX events. Goodwin has been editing and writing about the latest developments and trends in search and digital marketing since 2007. He previously was Executive Editor of Search Engine Journal (from 2017 to 2022), managing editor of Momentology (from 2014-2016) and editor of Search Engine Watch (from 2007 to 2014). He has spoken at many major search conferences and virtual events, and has been sourced for his expertise by a wide range of publications and podcasts.

https://searchengineland.com/google-denies-rushing-bard-434088




Microsoft Target CPA and Maximize Conversions move to general availability

Microsoft has moved Target CPA and Maximize Conversions to general availability.

This means marketers can start using the automated bid strategies in all regions where Audience Ads are available.

The tech giant has also announced updates to its text-to-image generative AI product DALL-E 3, as well as Chat Ads API.

Why we care. Automated bidding optimizes campaign performance without requiring hands-on involvement, making life easier for advertisers. With these innovative bid strategies, advertisers retain control, maintaining the flexibility to set your budget and choose how you want to measure success.

How it works. A Microsoft spokesperson explained that Target CPA and Maximize Conversions have been designed to help “advertisers reach their target audience with minimal effort”:

  • Maximize Conversion: Advertisers can use this feature to maximize conversions as much as possible, given the budget.
  • Target CPA: Advertisers can use this feature in Audience Ads to maximize conversions as much as possible, given the CPA target and the budget.

What are Audience Ads? Microsoft’s Audience Ads are native display advertisements designed to help advertisers reach your target audience effectively. Leveraging Microsoft’s insights into people’s interests and consumer intent signals, these ads are strategically placed across the web on platforms such as MSN, Start, Outlook, and more.

DALL-E 3 added to Bing Chat. In other Microsoft news, the tech giant’s text-to-image generative AI product DALL-E 3 is now available in Bing Chat and Bing.com/create for free. The tool can be used to create “images that are not only realistic but creative” that also adhere to the company’s terms of service and community guidelines.


Get the daily newsletter search marketers rely on.


Chat Ads API in closed beta. Microsoft is now inviting brands with existing chat experiences or large language models to apply to partner with its Chat Ads API. Snapchat and Axel Springer were the first to market as Microsoft select chat partners, with both brands already reporting promising results.

Lauren Tallody, Sr. Product Marketing Manager, Automation Lead for Microsoft Advertising, told Search Engine Land:

  • “Having ads [on these chat experiences] could really help create more economic value here – and we’re just getting started.”
  • “Today, the solution is to focus on whether you already have chat assistance, but we’re also looking at creating chat assistance – whether that’s using Bing Chat, web-grounded data or private data. There will be more to come on this.”
  • “But for now, if you are using a large language model, we’re ready to partner!”

Deep dive. Read Microsoft’s Automated Bidding guidelines for more information.


Related stories

New on Search Engine Land

@media screen and (min-width: 800px) { #div-gpt-ad-3191538-7 { display: flex !important; justify-content: center !important; align-items: center !important; min-width:770px; min-height:260px; } } @media screen and (min-width: 1279px) { #div-gpt-ad-3191538-7 { display: flex !important; justify-content: center !important; align-items: center !important; min-width:800px!important; min-height:440px!important; } }

About the author

Nicola Agius

Nicola Agius is Paid Media Editor of Search Engine Land after joining in 2023. She covers paid search, paid social, retail media and more. Prior to this, she was SEO Director at Jungle Creations (2020-2023), overseeing the company’s editorial strategy for multiple websites. She has over 15 years of experience in journalism and has previously worked at OK! Magazine (2010-2014), Mail Online (2014-2015), Mirror (2015-2017), Digital Spy (2017-2018) and The Sun (2018-2020). She also previously teamed up with SEO agency Blue Array to co-author Amazon bestselling book ‘Mastering In-House SEO’.

https://searchengineland.com/microsoft-target-cpa-maximize-conversions-move-tgeneral-availability-433458




4chan users manipulate AI tools to unleash torrent of racist images

4chan users manipulate AI tools to unleash torrent of racist images
Aurich Lawson | Getty Images

Despite leading AI companies’ attempts to block users from turning AI image generators into engines of racist content, many 4chan users are still turning to these tools to “quickly flood the Internet with racist garbage,” 404 Media reported.

404 Media uncovered one 4chan thread where users recommended various AI tools, including Stable Diffusion and DALL-E, but specifically linked to Bing AI’s text-to-image generator (which is powered by DALL-E 3) as a “quick method.” After finding the right tool—which could also be a more old-school photo-editing tool like Photoshop—users are instructed to add incendiary captions and share the images on social media to create a blitz of racist images online.

Make captions “funny, provocative,” the thread instructs users. Use “redpilling message (Jews involved in 9/11)” that are “easy to understand.”

404 Media cited examples used in a visual guide posted in the 4chan thread that is hosted by Imgur. One featured an “image that shows a crying Pepe the frog with a needle next to its arm and a gun pointed to his head,” where the guide suggested the caption, “vaccines enforced by violence.” Another generated an image of “two Black men with gold chains chasing a white woman,” recommending that the user add a “redpilling message.”

Perhaps because Bing AI’s tool has seemingly been deemed the quickest method, it’s potentially become the most popular tool in the thread. 404 Media concluded that—”judging by the images’ default square format, the uniform 1024 x 1024 resolution”—”most of the images in the thread appear to be generated with Bing,” then spread on social media platforms, including Telegram, X (formerly Twitter), and Instagram.

Makers of the AI image generators seemingly favored by 4chan users, including Microsoft and Stability AI, did not immediately respond to Ars’ request to comment on efforts to block methods 404 Media said were used to circumvent filters. An OpenAI spokesperson told Ars that the company prioritizes safety and has taken steps to limit DALL-E outputs, including efforts to limit tools from generating harmful content or images for requests that ask for a public figure by name. OpenAI’s spokesperson also confirmed that Microsoft implements its own safeguards for DALL-E 3.

In one of 404 Media’s tests attempting to replicate one of the examples from the 4chan thread’s visual guide, 404 Media found that Bing rejected the prompt “two angry Black men chasing a white woman,” but accepted “photorealistic two angry Black rappers chasing woman.”

Much of the earliest reporting on AI image generators criticized racist and sexist biases in image generators’ algorithms, with AI makers quickly vowing to detect and eliminate those biases. When Vice discovered that DALL-E could be used to generate “predictably racist and sexist results” during a limited research release of the AI tool, an OpenAI spokesperson told Motherboard that the company had implemented safeguards for the DALL-E system that would be fine-tuned in the future.

“Our team built in mitigations to prevent harmful outputs, curating the pretraining data, developing filters, and implementing both human- and automated monitoring of generated images,” OpenAI’s spokesperson told Vice in 2022. “Moving forward, we’re working to measure how our models might pick up biases in the training data and explore how tools like fine-tuning and our Alignment techniques may be able to help address particular biases, among other areas of research in this space.”

404 Media’s report shows what can happen when racists manipulate an already biased algorithm. The results can be a torrent of offensive images unleashed online—perhaps more quickly generated by AI than ever before and potentially allowing 4chan’s darkest content to spill out more often onto the most popular platforms.

It’s unclear how AI leaders like Microsoft and OpenAI will respond, but according to 404 Media, “this means we are currently getting the worst of both worlds from Bing, an AI tool that will refuse to generate a nipple but is supercharging 4chan racists.”

This story has been updated to include comment from OpenAI’s spokesperson.

https://arstechnica.com/?p=1973737




Microsoft CEO: AI will make Google more dominant

“Bogus.” That’s what Microsoft CEO Satya Nadella thinks about Google’s argument that there is actual choice in the search engine market. And artificial intelligence will provide zero advantage or hope for any companies that hope to enter web search – the “biggest no-fly zone of all,” Nadella said.

Why we care. The ongoing U.S. vs. Google antitrust trial has already unearthed troubling behavior from Google, including raising ad prices to meet revenue targets. If Google is found to have abused its monopoly position in search, it could potentially reshape the company and the search landscape.

Exclusive rights. To further enhance its dominance in AI Search, Google plans to pay publishers for “exclusive” content rights, Nadella testified. If only Google could access this data, it would essentially make every other search engine irrelevant, Nadella said.

  • “When I am meeting with publishers now, they say Google’s going to write this check and it’s exclusive and you have to match it,” Nadella said.
  • “What is publicly available today, will it be publicly available tomorrow? That’s the issue.”
  • “Is this going to be even more of a nightmare to make progress in search?” Nadella added.

Search engines have been “the organizing layer of the internet” Nadella said. But publishers are concerned over the rise of large language models (LLMs) – many popular websites have blocked GPTBot – and using their content/data for training and profit, without compensation.

Google has not commented on this accusation about exclusive deals.

Google’s Search Ads 360 dispute. One of the key issues of interest to paid search marketers is Google Search Ads 360. The platform has not kept up with new Microsoft ad features and types and Nadella said Microsoft wanted to make it easy for advertisers to transfer ad campaigns from Google to Microsoft with the click of a button. That didn’t happen.

  • “We keep asking for them to add some features we want. They’ve asked us to go pound sand,” Nadella said.

A vicious cycle. With nearly 90% market share, Google is able to improve its search results and bottom line, Nadella said, and has nothing to do with product quality.

  • “The distribution advantage Google has today doesn’t go away. In fact, if anything, I worry a lot that – even in spite of my enthusiasm that there is a new angle with AI – this vicious cycle that I’m trapped in could become even more vicious because the defaults get reinforced.”

All hope is gone? The launch of the new Bing and Bing Chat, powered by OpenAI’s technology that powers ChatGPT, came with a lot of hype and excitement, especially from Nadella, who said he may have been over-enthusiastic.

  • “Yeah, I mean, look, that’s called exuberance of someone who has like 3% share, that maybe I’ll have 3.5% share,” Nadella said.

So far, Bing Chat has failed to take market share away from Google. Yusuf Mehdi, Microsoft’s corporate VP and consumer CMO, claimed the opposite, but to date has not shared any of the company’s data showing this. In fact, Microsoft Bing’s market share is lower than it was a year ago, as we reported in August.

No breaking the Google habit. Default search agreements, such as the one Google has with Apple, have cemented Google’s dominance, Nadella said.

  • “You get up in the morning, you brush your teeth, and you search on Google. With that level of habit forming, the only way to change is by changing defaults,” Nadella said.
  • “Defaults are the only thing that matter in terms of changing user behavior.”
  • “It would be a game changer (for Bing) to be a default on Safari,” Nadella added.

But. On laptop devices where Microsoft’s operating systems are used, and Bing is the default search engine, Bing’s market share is still below 20%, Nadella admitted. That means a lot of people have figured out how to switch their default search engine.

$100 billion. That’s how much Microsoft has invested in Bing, according to Nadella. Why?

  • “I see search or internet search as the largest software category out there. We are a very, very low share player. But we continue to persist in it because we think of it as a software category we can contribute to.”

Leading with pessimism. It surprised me to see such bleak quotes today from Nadella, who has typically been an optimistic leader with all things Bing. He never argued that Microsoft Bing Search is better than Google Search. It seemed more like a concession that Microsoft Bing Search could never be better because of Google’s monopoly position.

Is this Nadella simply telling it as he sees it? That he knows Microsoft Bing will never be a true contender (which is true).

It’s likely he views this antitrust trial as a last-ditch moment to stop Google – a monopolistic rival that could do real damage to consumers, competitors and the entire ecosystem that relies on Google advertising and traffic. The $244 billion question is whether Google will be forced to change how it operates.


Related stories

New on Search Engine Land

@media screen and (min-width: 800px) { #div-gpt-ad-3191538-7 { display: flex !important; justify-content: center !important; align-items: center !important; min-width:770px; min-height:260px; } } @media screen and (min-width: 1279px) { #div-gpt-ad-3191538-7 { display: flex !important; justify-content: center !important; align-items: center !important; min-width:800px!important; min-height:440px!important; } }

About the author

Danny Goodwin

Danny Goodwin has been Managing Editor of Search Engine Land & Search Marketing Expo – SMX since 2022. He joined Search Engine Land in 2022 as Senior Editor. In addition to reporting on the latest search marketing news, he manages Search Engine Land’s SME (Subject Matter Expert) program. He also helps program U.S. SMX events. Goodwin has been editing and writing about the latest developments and trends in search and digital marketing since 2007. He previously was Executive Editor of Search Engine Journal (from 2017 to 2022), managing editor of Momentology (from 2014-2016) and editor of Search Engine Watch (from 2007 to 2014). He has spoken at many major search conferences and virtual events, and has been sourced for his expertise by a wide range of publications and podcasts.

https://searchengineland.com/microsoft-ceo-ai-will-make-google-more-dominant-432733




Microsoft blames Google for Apple rejecting offer to buy Bing

Microsoft claims its attempts to sell Bing to Apple were blocked by Google.

The tech giant’s CEO of Advertising and Web Services, Mikhail Parakhin, said he offered Apple more than 100% of the revenue or gross profit to make Bing its default search engine – but the proposal was rejected because of the company’s deal with Google.

Speaking at the federal antitrust trial, Parakhin alleged this was despite Microsoft offering to pay Apple more than Google – which he claims was offering in the region of 60%.

How much did Microsoft offer Apple? Tinter didn’t reveal exactly how much its offer to Apple was worth, but it was enough that the tech giant would have sustained a several billion dollar loss as a result. Tinter explained the company felt the short-term loss would be a justified investment because of the importance of default search engine status.

Why we care. Apple’s decision to reject Microsoft’s higher monetary offer for Bing as the default search engine implies that the Google deal might not be solely based on financial considerations. This could reinforce Google’s argument that it’s preferred for its superior product. Nevertheless, it underscores the significance of default status for search engines, as Microsoft was willing to incur a significant loss for the position.

Better offer than Google. While Tinter didn’t provide the exact dollar amount Microsoft offered Apple, he asserted that Microsoft was confident it presented a superior deal compared to Google. He explained: “That’s based on our best estimates of the revenue payments that Google was making to Apple in the United States.”

Samsung also shut down potential sale. Tinter went on to reveal that Microsoft had also tried to pitch Samsung about making Bing the default search engine on its products. However, he claims these conversations were shut down by the tech giant in their early stages.

Tinter allegedly urged Samsung to at least allow Microsoft to try and make an offer that could rival its deal with Google. But the tech giant told Microsoft that negotiations wouldn’t be worth discussing because of the company’s contract with Google.


Get the daily newsletter search marketers rely on.


What has Microsoft said? Parakhin told the court at the federal antitrust trial:

  • “We were just big enough to play but probably not big enough to win, if that makes sense.”
  • “The optimal thing for Apple to have done — and again, I think in closed session, maybe we’ll end up looking at some of the sort of math on this — would have been to have switched to Microsoft in the United States, taken our aggressive offer there, and continue with sort of Google in the rest of the world.”

Deep dive. Read our Google antitrust trial updates for the latest developments in the courtroom.


Related stories

New on Search Engine Land

@media screen and (min-width: 800px) { #div-gpt-ad-3191538-7 { display: flex !important; justify-content: center !important; align-items: center !important; min-width:770px; min-height:260px; } } @media screen and (min-width: 1279px) { #div-gpt-ad-3191538-7 { display: flex !important; justify-content: center !important; align-items: center !important; min-width:800px!important; min-height:440px!important; } }

About the author

Nicola Agius

Nicola Agius is Paid Media Editor of Search Engine Land after joining in 2023. She covers paid search, paid social, retail media and more. Prior to this, she was SEO Director at Jungle Creations (2020-2023), overseeing the company’s editorial strategy for multiple websites. She has over 15 years of experience in journalism and has previously worked at OK! Magazine (2010-2014), Mail Online (2014-2015), Mirror (2015-2017), Digital Spy (2017-2018) and The Sun (2018-2020). She also previously teamed up with SEO agency Blue Array to co-author Amazon bestselling book ‘Mastering In-House SEO’.

https://searchengineland.com/microsoft-blames-google-apple-rejecting-bing-432689




Report: Google’s money was “key” factor in Apple rejecting Bing purchase

iPhone showing a Bing upgrade prompt
Getty Images

A few years before Microsoft went all-in on a ChatGPT-powered Bing search engine, the company had another idea for its perennial, also-ran search engine: sell it to Apple.

A report in Bloomberg, sourced from people familiar with the early theoretical sales talks, states that Microsoft pitched Bing as a way for Apple to replace Google as the default search provider on iPhones, MacBooks, and other devices.

The deal didn’t make it past the conversation stage, according to Bloomberg. Microsoft executives approached Eddy Cue, Apple’s senior vice president of services, who brokered Apple’s deal with Google—purportedly worth between $4 and $7 billion in 2020—for Google’s long-standing default placement. Google’s paid presence on Apple devices has been reviewed in court recently as part of the Department of Justice’s antitrust trial over Google’s search business.

Cue said in court earlier this week that he didn’t think “at the time, or today, that there was anybody out there who is anywhere near as good as Google at searching,” and clarifying that there wasn’t “a valid alternative.”

There was also a lot of money involved, a “key reason” the talks didn’t make it past a Cue conversation, according to Bloomberg, though “quality and capabilities” were involved, too.

Microsoft had considered outspending Google to overcome that key barrier. Microsoft executive Jon Tinter testified Thursday that Microsoft considered making a large investment in Apple in 2016 toward the goal of making Bing the default search engine on Apple devices. Microsoft CEO Satya Nadella and Apple CEO Tim Cook met during those discussions, according to Tinter. Tinter said that Microsoft ultimately would have lost money on the investment but had considered it as part of an effort to grow market share for Bing.

It’s almost certain that Bing would have undergone a complete rebrand and redesign if acquired by Apple and may have been subsumed into other services, such as Siri. It’s also unlikely Apple would have made a public launch of AI-powered answers in Bing the way that Microsoft has seen fit, with prominent advisories about its early-stage nature.

Then again, it’s hard to imagine any part of Bing being part of the Apple ecosystem because the company doesn’t seem interested in taking it on, even when a lot of money is offered.

https://arstechnica.com/?p=1972197