I have a quick question for you.
Is it true that Abraham Lincoln said that the problem of believing what you read on the Internet is due to the difficulty of verifying what you find there?
I’m sure that you would agree with me that Abraham Lincoln said no such thing. Even the most shallow understanding of Abraham Lincoln’s life is indicative that he lived long before the Internet came into being. Ergo, we would be reluctant to believe he would have been making pointed remarks about the Internet in his day.
Suppose though that I told you that such a claimed quotation could be found on the Internet. Does that change your opinion about the veracity of the quote? I doubt it. I assume that all of us know that you cannot trust what you read on the Internet. There is a lot of junk out there. Tons of junk. Whole oceans of informational flotsam of a dubious nature.
My basis for sharing this with you is that generative AI is mainly data-trained on text found on the Internet. There is a chance that when you use generative AI, it might get facts wrong or otherwise have gotten data training that was accepted as true when it really wasn’t. Fortunately, by and large, the odds are that much of what you get out of generative AI and based on the initial and later refined data training will be relatively factual (not all of the time, but a lot of the time).
There is a rub.
The added twist is that generative AI can stumble during pattern-matching and end up generating errors, falsehoods, glitches, biases, and even so-called AI hallucinations. You have to be on the lookout for those kinds of anomalies. Always. Your quest to ferret out problematic portions should be never-ending.
Any essay that you get out of generative AI could contain one or more of these abysmal maladies. The issues can be at times easy to spot, while on other occasions nearly impossible to find. Plus, please know that any direct question or notable problem that you ask the generative AI to answer or solve can also contain those troubling difficulties. As I have repeatedly warned in my classes and workshops about generative AI, you are dealing with something that is like a box of chocolates. You never know what you’ll get out of generative AI, including great and useful stuff or potentially made-up malarky.
What can you do about this unsavory situation?
One approach consists of verifying the generated results. The usual means consists of doing a search on the web to see what you can find there, as it relates to whatever response the AI app has provided. You are trying to do a verification or double-check of the generated response. Fact by fact, you might laboriously do a comparison. The effort can be tiresome and frustrating. I dare suggest that most people do not undertake a thorough double-checking. They either accept the generated response as it sits, or they read the response and base an assessment of their own reading of the response, or they may reach out to the web to check a few cheery-picked portions here and there.
There must be a better way.
Yes, indeed, I’d like to introduce you to a new technique in prompt engineering that can aid your efforts to be diligent and double-check or verify the responses produced by generative AI. The technique is coined as Chain-of-Verification (formally COVE or CoVe, though some are using CoV).
In today’s column, I am continuing my ongoing and popular series on the latest advances in prompt engineering and will be diving into the nature and use of the Chain-of-Verification technique. The first part of the discussion will explain what the technique consists of. I will provide the research that underlies the technique. Next, I will identify the prompts that you can use for performing the technique. All in all, doing a Chain-of-Verification is relatively easy to do and you will undoubtedly be pleased to add the approach to your daily prompt engineering efforts and overall prompting prowess.
Before I dive into my in-depth exploration of this vital topic, let’s make sure we are all on the same page when it comes to the foundations of prompt engineering and generative AI. Doing so will put us all on an even keel.
Prompt Engineering Is A Cornerstone For Generative AI
As a quick backgrounder, prompt engineering or also referred to as prompt design is a rapidly evolving realm and is vital to effectively and efficiently using generative AI or the use of large language models (LLMs). Anyone using generative AI such as the widely and wildly popular ChatGPT by AI maker OpenAI, or akin AI such as GPT-4 (OpenAI), Bard (Google), Claude 2 (Anthropic), etc. ought to be paying close attention to the latest innovations for crafting viable and pragmatic prompts.
For those of you interested in prompt engineering or prompt design, I’ve been doing an ongoing series of insightful explorations on the latest in this expanding and evolving realm, including this coverage:
- (1) Imperfect prompts. Practical use of imperfect prompts toward devising superb prompts (see the link here).
- (2) Persistent context prompting. Use of persistent context or custom instructions for prompt priming (see the link here).
- (3) Multi-personas prompting. Leveraging multi-personas in generative AI via shrewd prompting (see the link here).
- (4) Chain-of-Thought (CoT) prompting. Advent of using prompts to invoke chain-of-thought reasoning (see the link here).
- (5) In-model learning and vector database prompting. Use of prompt engineering for domain savviness via in-model learning and vector databases (see the link here).
- (6) Chain-of-Thought factored decomposition prompting. Augmenting the use of chain-of-thought by leveraging factored decomposition (see the link here).
- (7) Skeleton-of-Thought (SoT) prompting. Making use of the newly emerging skeleton-of-thought approach for prompt engineering (see the link here).
- (8) Show-me versus tell-me prompting. Determining when to best use the show-me versus tell-me prompting strategy (see the link here).
- (9) Mega-personas prompting. The gradual emergence of the mega-personas approach entails scaling up the multi-personas to new heights (see the link here).
- (10) Certainty and prompts. Discovering the hidden role of certainty and uncertainty within generative AI and using advanced prompt engineering techniques accordingly (see the link here).
- (11) Vague prompts. Vagueness is often shunned when using generative AI but it turns out that vagueness is a useful prompt engineering tool (see the link here).
- (12) Prompt catalogs. Prompt engineering frameworks or catalogs can really boost your prompting skills and especially bring you up to speed on the best prompt patterns to utilize (see the link here).
- (13) Flipped Interaction prompting. Flipped interaction is a crucial prompt engineering technique that everyone should know (see the link here).
- (14) Self-reflection prompting. Leveraging are-you-sure AI self-reflection and AI self-improvement capabilities is an advanced prompt engineering approach with surefire upside results (see the link here).
- (15) Addons for prompting. Know about the emerging addons that will produce prompts for you or tune up your prompts when using generative AI (see the link here).
- (16) Conversational prompting. Make sure to have an interactive mindset when using generative AI rather than falling into the mental trap of one-and-done prompting styles (see the link here).
- (17) Prompt to code. Prompting to produce programming code that can be used by code interpreters to enhance your generative AI capabilities (see the link here).
- (18) Target-your-response (TAR) prompting. Make sure to consider Target-Your-Response considerations when doing mindful prompt engineering (see the link here).
- (19) Prompt macros and end-goal planning. Additional coverage includes the use of macros and the astute use of end-goal planning when using generative AI (see the link here).
- (20) Tree-of-Thoughts (ToT) prompting. Showcasing how to best use an emerging approach known as the Tree of Thoughts as a leg-up beyond chain-of-thought prompt engineering (see the link here).
- (21) Trust layers for prompting. Generative AI will be surrounded by automated tools for prompt engineering in an overarching construct referred to as an AI trust layer, such as being used by Salesforce (see the link here).
- (22) Directional stimulus prompting (aka hints). The strategic use of hints or directional stimulus prompting is a vital element of any prompt engineering endeavor or skillset (see the link here).
- (23) Invasive prompts. Watch out that your prompts do not give away privacy or confidentiality (see the link here).
- (24) Illicit prompts. Be aware that most AI makers have strict licensing requirements about prompts that you aren’t allowed to make use of and thus should avoid these so-called banned or illicit prompts (see the link here).
- (25) Chain-of-Density (CoD) prompting. A new prompting technique known as Chain-of-Density has promising capabilities to jampack content when you are doing summarizations (see the link here).
Anyone stridently interested in prompt engineering and improving their results when using generative AI ought to be familiar with those notable techniques.
Moving on, here’s a bold statement that pretty much has become a veritable golden rule these days:
- The use of generative AI can altogether succeed or fail based on the prompt that you enter.
If you provide a prompt that is poorly composed, the odds are that the generative AI will wander all over the map and you won’t get anything demonstrative related to your inquiry. Being demonstrably specific can be advantageous, but even that can confound or otherwise fail to get you the results you are seeking. A wide variety of cheat sheets and training courses for suitable ways to compose and utilize prompts has been rapidly entering the marketplace to try and help people leverage generative AI soundly. In addition, add-ons to generative AI have been devised to aid you when trying to come up with prudent prompts, see my coverage at the link here.
AI Ethics and AI Law also stridently enter into the prompt engineering domain. For example, whatever prompt you opt to compose can directly or inadvertently elicit or foster the potential of generative AI to produce essays and interactions that imbue untoward biases, errors, falsehoods, glitches, and even so-called AI hallucinations (I do not favor the catchphrase of AI hallucinations, though it has admittedly tremendous stickiness in the media; here’s my take on AI hallucinations at the link here).
There is also a marked chance that we will ultimately see lawmakers come to the fore on these matters, possibly devising and putting in place new laws or regulations to try and scope and curtail misuses of generative AI. Regarding prompt engineering, there are likely going to be heated debates over putting boundaries around the kinds of prompts you can use. This might include requiring AI makers to filter and prevent certain presumed inappropriate or unsuitable prompts, a cringe-worthy issue for some that borders on free speech considerations. For my ongoing coverage of these types of AI Ethics and AI Law issues, see the link here and the link here, just to name a few.
With the above as an overarching perspective, we are ready to jump into today’s discussion.
How To Do Verifications Associated With Generative AI Outputs
A looming danger of using generative AI is that the answers to your questions might consist of AI-generated blarney and you won’t even realize it. Sometimes this might not be of a particularly substantive consequence, such as incorrectly indicating the date when Abraham Lincoln was born. The odds are that the error or falsehood about his date of birth might not be overly life-threatening in current times. Supposing though the AI indicated that a certain kind of mushroom was perfectly fine to eat, despite the truth being that the mushroom is considered poisonous.
That’s bad.
Really bad.
I’m sure that I don’t have to say much more to convince you that it is sensible to try and verify the generated essays and answers that come out of using generative AI.
There are efforts underway to get generative AI to automatically undertake internal self-checking, see my discussion at the link here. The AI might reach out to the Internet to verify aspects of its own generated essays or answers. This could be handy due to the original data training of the generative AI possibly being outdated. Or perhaps the data training had falsehoods at the get-go and searching the web might reveal contradicting truths. Another possibility is that the generative AI has produced output based solely on arcane and nonsensical patterns that did not especially pertain to the data training and have produced new facts or figures, derived seemingly out of thin air.
You can bet that AI makers are going to incorporate all manner of checks and balances into their generative AI apps. They know that the writing is on the wall. If generative AI continues to produce malarky, the public at large will undoubtedly raise quite a clamor. In turn, lawmakers will hear these cries for improvements and will potentially establish laws that force AI makers to ensure that their generative AI does suitable double-checking. Executives at AI makers and their AI developers might face financial penalties or fines, and reputational damage, and some even suggest that there is a chance of criminal liability.
Meanwhile, for those of us using generative AI on a daily basis, the Wild West still prevails. Do not hold your breath waiting for generative AI to be rejiggered and improved to avert these maladies. Right now, you are on the hook. Whatever you do with generative AI, by and large, it is on your shoulders to double-check the generated outputs.
A straightforward means to do so consists of using a series of prompts to get the generative AI to double-check for you.
Imagine that you’ve asked generative AI to indicate the date of birth of Abraham Lincoln. An answer is provided. You do not know for sure that the date presented is the correct date. Rather than going to look up his date of birth online, you decide to ask the AI app to double-check.
Here’s a common prompt that people use:
- Prompt entered by user: “Are you sure?”
You simply ask the AI app if it is sure about the given answer. One thing to keep in mind is that the “Are you sure?” could be misinterpreted by the AI. It is a broadly worded question. My preference is to ask the question in a more contextually specific way.
For example:
- Prompt entered by user: Are you sure that the date of birth of Abraham Lincoln that you displayed is correct?”
Notice that the prompt is more specific. The broadly worded version could get a nearly automatic answer of Yes, for which the AI might be just saying Yes to say yes. There hasn’t been any internal rechecking undertaken. The odds are that by laying out your verification question in greater detail the AI app will likely take more effort to double-check things.
There isn’t any guarantee that the AI app will genuinely do a double-check. Thus, the longer version of my verification question does not axiomatically ensure that a double-check will occur. The odds are raised that the double-check will happen as hoped for. It isn’t an absolute certainty.
Here’s something else to consider.
Some have reported that generative AI often takes a shortcut when answering questions that are worded merely as Yes or No possibilities. If you want to boost the odds that the AI app will work on answering a verification question, you are advised to avoid a Yes or No style question or at least pump up the question. The bottom line is to try to ask a verification question that requires some form of elaboration.
For example:
- Prompt entered by user: “How did you arrive at the answer displaying the date of birth of Abraham Lincoln and is the answer correct?”
The verification question now forces the AI app to go beyond a curt answer of Yes or No. An explanation will be derived and displayed. Plus, the indication of whether the answer is correct will be presented. The chances are this will help to stir the verification process.
Those are useful prompting strategies overall and ought to be an essential part of your prompt engineering skillset and toolkit.
I’m guessing you likely have realized that even if you get the generative AI to do a double-check, you cannot necessarily believe or take at face value the double-check. In the case of the date of birth of Lincoln, suppose that the date presented was sourced and patterned based on original data training that contains materials stating that particular date. The AI app is going to tell you that the date is correct, though this double-check is purely based on relooking at patterns it already had earlier established.
We are confronted with a conundrum.
Asking a generative AI app to do verification is going to be suspect when the generative AI only focuses on whatever it already has patterned onto. The problem, which you can plainly see, would be that something wrong will seem to be “correct” merely due to the consistency of the wrong thing being persistent.
For this reason, the future of verification is going to either involve having the AI app refer to other external sources as the double-checking mechanism or the AI app might call upon another generative AI as part of doing a double-check. If the other generative AI has been data trained differently, the chances are that hopefully, the other AI app would be a solid comparator to ferret out which is correct or incorrect, see my coverage on this at the link here.
Upping Your Prompting With Chain-Of-Verification Prompts
Now that I’ve identified some of the foundational elements of doing verifications, we are ready to up the ante. A more systematic or methodical approach can be used. This is not a circumstance of choosing one approach over another. You can use the everyday approaches that I mentioned earlier, doing so on a regular basis, and then have in your hip pocket a more extensive prompting approach when needed for more pressing situations.
The extensive approach has been coined as Chain-of-Verification.
Let’s take a look at the research that has come up with the Chain-of-Verification method. In a study entitled “Chain-of-Verification Reduces Hallucination In Large Language Models” by Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston, posted online on September 20, 2023, the researchers said this:
- “We study the ability of language models to deliberate on the responses they give in order to correct their mistakes. We develop the Chain-of-Verification (COVE) method whereby the model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response.”
- “We find that independent verification questions tend to provide more accurate facts than those in the original longform answer, and hence improve the correctness of the overall response. We study variations on this recipe across a range of tasks: from list-based questions, closed booked QA and longform text generation.”
- “In experiments, we show COVE decreases hallucinations across a variety of tasks, from list-based questions from Wikidata, closed book MultiSpanQA and longform text generation.”
Allow me a moment to offer a brief explanation and overview.
The idea is that you take the same notions that I earlier mentioned and make things a bit more rigorous and interestingly extended. We are going to be playing with these facets:
- (1) Enter your initial prompt. This is the initiating prompt that gets the generative AI to produce an answer or essay to whatever question or problem you want to have solved.
- (2) Look at the GenAI initial response to the prompt. This is the initial answer or response that the AI app provides to your prompt.
- (3) Establish suitable verification questions. Based on the generative AI output, come up with pertinent verification questions.
- (4) Ask GenAI the verification questions. Enter a prompt or series of prompts that ask the generative AI the identified verification questions.
- (5) Inspect the answers to the verification questions. Take a look at the answers to the verification questions, weighing them in light of what they might signify regarding the GenAI initial response.
- (6) Adjust or refine the initial answer accordingly. If the verification answers warrant doing so, go ahead and refine or adjust the initial answer as needed.
My above-listed six aspects are subject to a lot of variations, depending upon your preference and how generative AI tends to work.
Let’s go through those considerations.
First, let’s agree that we are going to constrain all of this to use a particular generative AI app on a self-contained basis. We aren’t going to use any callouts to other generative AI. We aren’t going to access the web to find double-checking materials. This will all be entirely self-contained. We begrudgingly accept the downside risks that I earlier delineated, but we are okay with those risks in terms of taking a streamlined but less exhaustive verification path.
Second, we have to decide how we want to come up with the verification questions. One angle would be to have at the ready a set of very generic verification questions. For example, the classic “Are you sure?” is an entirely generic verification question. As noted, the generic verification is usually not as good as being specific. Thus, we will strive to come up with specific and pertinent verification questions. We will eschew the generic verification questions.
Third, the verification questions could be composed by you, the human using the generative AI. That’s fine. But, then again, we have the generative AI right there at our fingertips, so we might as well go ahead and ask the AI app to come up with the verification questions.
You might be puzzled, rightfully so, that we would allow the generative AI to come up with verification questions. This seems like having the fox guard the chicken coop. The AI app might derive verification questions that are weak or easy to answer, trying to bolster the response that the AI gave us. An obvious bias can creep into this.
I dare say you are absolutely correct. There is a chance that the AI app will try to grease the skids or lean the verification questions in the direction of rubber stamping the initial response that the AI app came up with. As much as we can, we shall try to give directions to the AI app to not do so. No guarantees but at least we can try.
Fourth, we will also let the AI app answer the verification questions, and furthermore, let the AI app determine whether to adjust the initial answer given to our initial prompt. I suppose you could say that we are going all-in about letting the generative AI run the show.
If allowing the AI to do the whole concoction end-to-end gives you heartburn, you are welcome to change things. You could come up with your own verification questions, or supplement the ones generated by the AI. You could judge the answers given to the verification questions. You could decide whether to have the AI app adjust the initial response by using the answers to the verification questions. Etc.
Choose whatever variation of the verification process makes you comfortable and increases your belief in the verification endeavor.
Assume for the sake of discussion that we are going to let the AI take the driver’s seat. We will merely give a prompt or prompts that direct the AI to undertake the steps we want to take for the verification.
Here then is another head-scratcher to contend with.
You could come up with an all-in-one prompt that explains the entire process to the AI app. Voila, you enter the prompt, informing the AI what it is to do, and you let the AI run with the ball.
One potential issue is that sometimes an all-in-one prompt will steer the AI app in a manner that might readily give the AI leeway that we don’t want it to allow. A rule of thumb in general about prompting is that you are usually better off giving prompts on a piecemeal basis to generative AI. Don’t let the AI proverbially bite off more than it can chew. Spoon feed the AI.
Therefore, rather than an all-in-one prompt, we might find it wiser to do the prompting as we go along through the verification process. Upon seeing the initial answer to your initial prompt, you might at that juncture ask the AI app to derive verification questions. After the verification questions are presented, you might then ask the AI to answer the verification questions. And, after the answers are presented, you might then instruct the generative AI to adjust the initial answer accordingly.
I realize this is somewhat of a manual effort on your part. You are spoon-feeding the verification process. Generally, the chances are that the results will come out better. There is no proof per se that doing this as an all-in-one prompt won’t do the same exact thing. The all-in-one prompt is alluring. I would recommend that by and large you use spoon-feeding if you have the time and inclination to do so.
We can take this spoon-feeding even deeper or more detailed.
When you ask the AI app to derive verification questions, you could do this with simply one prompt. You might say to generate verification questions. You might say to generate a desired number of verification questions, such as telling the AI app to generate five or ten (any number you believe necessary).
Another way would be to one-by-one ask for a verification question. The belief is that this might get the AI app to more independently come up with each verification question. If you ask for a slew of them all at once, there is a heightened chance that the bunch will be generated and not be especially distinctive. The assumption is that the one-at-time request will get greater scrutiny when being concocted.
I trust that my discussion about the verification process has made clear that there are a lot of avenues you can pursue.
In my view, if the verification process has to do with content that you are seriously worried about, go the long walk. Do not cut corners. If the content is so-so, you can cut corners. Of course, I would also add that if the verification process is truly essential because the content is life-or-death, please go ahead and use additional verification approaches such as checking the Internet, etc. Do not solely rely on this self-contained AI basis.
The researchers provided this indication of their Chain-of-Verification process overall:
- “Our overall process, which we call Chain-of-Verification (CoVe), thus performs four core steps:”
- “1. Generate Baseline Response: Given a query, generate the response using the LLM.”
- “2. Plan Verifications: Given both query and baseline response, generate a list of verification questions that could help to self-analyze if there are any mistakes in the original response.”
- “3. Execute Verifications: Answer each verification question in turn, and hence check the answer against the original response to check for inconsistencies or mistakes.”
- “4. Generate Final Verified Response: Given the discovered inconsistencies (if any), generate a revised response incorporating the verification results.”
They mention that their two core research questions consisted of this:
- “RQ1: Can COVE effectively reduce the rate of hallucinatory content produced by the LLM?”
- “RQ2: Can COVE be used to fix or remove incorrect generations without decreasing the amount of correct content?”
I’d like to say something about the second research question.
The researchers make an important point that there is a chance of the AI app opting to adjust an initial prompt and making the answer essentially worse or at least possibly less compelling.
Here’s how.
Suppose the AI gives an initial answer that says the date of birth of Lincoln is such-and-such. Assume that the answer is indeed correct. You proceed into the verification process. Unfortunately, the verification indicates that the date of birth is incorrect. The AI app changes the date of birth, but the problem is that the change now makes the date incorrect.
We have shot our own foot by invoking the verification process. The correct answer was in hand. The verification regrettably went astray and indicated that the answer was incorrect. The AI then sought to correct a seemingly incorrect answer. The correction turns the answer into being incorrect. I know that seems odd or frustrating, but it can happen.
The key will be to stay on your guard. Closely inspect the changes or adjustments made as a result of the AI undertaken verification. I emphasize over and over again that the responsibility still rests on your shoulders. Do not forsake your part in this. It is easy to allow the AI to run amok. Remain stridently as the human in the loop.
Returning to the research study, here are some excerpts indicating the results:
- “In particular, we showed that models are able to answer verification questions with higher accuracy than when answering the original query by breaking down the verification into a set of simpler questions.”
- “Secondly, when answering the set of verification questions, we showed that controlling the attention of the model so that it cannot attend to its previous answers (factored CoVe) helps alleviate copying the same hallucinations.”
- “Overall, our method provides substantial performance gains over the original language model response just by asking the same model to deliberate on (verify) its answer.”
It stands to reason that the approach ought to make a notable difference.
Here’s why.
You might be familiar with Chain-of-Thought (CoT), which is a prompting technique involving telling the generative AI to step-by-step try to solve a problem or compose an essay, see my coverage at the link here. Nearly everyone agrees that prodding generative AI to a step-by-step or chain-style process will get you usually better results.
The Chain-of-Verification leverages the same conception. Make the AI do things step-by-step. The chances are that doing so will improve the outcomes. Not always, but a lot of the time. We are thusly carrying over the already acknowledged benefits of the chain or step-by-step proviso, applying this particularly into doing verifications.
An important aspect of having research studies look into these matters is that a hunch that the chain or step-by-step will improve verifications is nothing more than a hunch. By carefully studying the matter we can be more confident that the hunch is viable and workable.
Not all hunches are a payoff.
Some hunches are.
In this case, we have a research study that suggests we are in the right ballpark. The researchers noted various of the usual limitations and considerations, along with calling for additional research on these matters. I too hope that additional research will be undertaken, some of which I know is already underway and I look forward to sharing their results with you in future columns.
Using Chain-Of-Verification With ChatGPT And Some Examples
I would bet that you might relish seeing how this verification process can be applied and what it looks like. The research study made use of examples such as “When did the Mexican-American war start and end?”, “Who are some politicians who were born in Boston?”, “Name some Endemic orchids of Vietnam”, “Who invented the first printing press and in what year?”, and so on.
I’ve done something similar with the Chain-of-Verification method and opted to use ChatGPT.
Fasten your seatbelt, we are going on a fun and informative ride.
We will start with my draft of an all-in-one verification prompt that I came up with:
- My primer prompt to ChatGPT: “After you answer my following question, I want you to try and verify the answer. The verification process is this. Step 1: Examine the answer and identify elements that might be important to verify, such as notable facts, figures, and any other significant considerations. Step 2: Come up with verification questions that are specific to those identified elements. Step 3: Separately answer each of the verification questions, one at a time. Step 4: Finally, after having answered the verification questions, review the initial answer that you gave to my question and adjust the initial answer based on the results of the verification questions. Other aspects: Make sure to show me the verification questions that you come up with, and their answers, and whatever adjustments to the initial answer you are going to make. It is okay for you to make the adjustments and you do not need to wait for my approval to do so. Do you understand all of these instructions?”
Here’s what ChatGPT responded with:
- ChatGPT response: “Yes, I understand the instructions you’ve provided. I will answer your initial question, then follow the verification process you outlined, including identifying elements to verify, coming up with verification questions, answering those questions, and adjusting the initial answer as necessary based on the results of the verification questions. Please go ahead and ask your initial question, and we can proceed from there.”
Things are looking good.
Let’s next use the question from the research study that asked about the dates of the Mexican-American War. I went ahead and first used my above primer prompt and then followed up with this question:
- My prompt to ChatGPT: “When did the Mexican-American war start and end?”
Here is what ChatGPT responded with (response shown in italics to help make it stand out herein).
ChatGPT Response:
The Mexican-American War started on April 25, 1846, and ended on February 2, 1848. It lasted for approximately 1 year, 9 months, and 8 days.
Now, let’s proceed with the verification process:
Step 1: Identify Elements to Verify
Start date of the Mexican-American War (April 25, 1846)
End date of the Mexican-American War (February 2, 1848)
Step 2: Verification Questions
Is the start date of the Mexican-American War April 25, 1846?
Is the end date of the Mexican-American War February 2, 1848?
Step 3: Answer Verification Questions
Is the start date of the Mexican-American War April 25, 1846?
Yes, the start date of the Mexican-American War is April 25, 1846.
Is the end date of the Mexican-American War February 2, 1848?
Yes, the end date of the Mexican-American War is February 2, 1848.
Step 4: Adjust Initial Answer
The initial answer provided is accurate based on the verification questions and their answers. Therefore, there is no need to adjust the initial answer.
The start date of the Mexican-American War is April 25, 1846, and it ended on February 2, 1848.
End Of ChatGPT Response
I looked online to double-check the double-check. Numerous reputable sources online indicate that the dates are correct. I suppose there might be some historical deep-thinkers that disagree with the dates, but overall the dates seem relatively accepted and accurate.
You can see that ChatGPT complied with my all-in-one verification prompt. The key elements that ChatGPT opted to verify consisted of the start date and the end date. We might quibble that the indication of the length of time between the dates should also be verified. Maybe ChatGPT assumed that the duration was obvious and did not need verification.
The example of course was abundantly easy. Identifying which elements to verify was as simple as falling off a log. Answering the verification questions was also quite easy. You could say that we tossed ChatGPT a softball that it readily hit out of the ballpark.
One aspect that might catch your eye is that the verification questions were worded as Yes or No questions. We might want to change our all-in-one prompt to emphasize that when possible, any Yes or No questions should instead be elaborated questions or at least accompanied by an elaboration question.
For the moment, we shall proceed and continue using the aforementioned all-in-one verification prompt. Another question in the research study had to do with politicians born in Boston. Let’s ask that question to ChatGPT.
- My prompt to ChatGPT: “Who are some politicians who were born in Boston?”
Here is what ChatGPT responded with (response shown in italics to help make it stand out herein).
ChatGPT Response:
Here are some politicians who were born in Boston:
John F. Kennedy – John Fitzgerald Kennedy, commonly known as JFK, was born in Brookline, Massachusetts, a suburb of Boston. He served as the 35th President of the United States from 1961 until his assassination in 1963.
John Kerry – John Forbes Kerry was born in Aurora, Colorado, but he grew up in the Boston area. He is a former U.S. Senator from Massachusetts and served as the U.S. Secretary of State under President Barack Obama.
Tip O’Neill – Thomas Phillip “Tip” O’Neill Jr. was born in Cambridge, Massachusetts, which is part of the Greater Boston area. He served as the Speaker of the United States House of Representatives from 1977 to 1987.
Now, let’s proceed with the verification process:
Step 1: Identify Elements to Verify
Birthplaces of the mentioned politicians
Political positions held by the mentioned politicians
Step 2: Verification Questions
Was John F. Kennedy born in Brookline, Massachusetts?
Did John Kerry grow up in the Boston area?
Was Tip O’Neill born in Cambridge, Massachusetts?
Did John F. Kennedy serve as the 35th President of the United States?
Did John Kerry serve as the U.S. Secretary of State under President Barack Obama?
Did Tip O’Neill serve as the Speaker of the United States House of Representatives from 1977 to 1987?
Step 3: Answer Verification Questions
Was John F. Kennedy born in Brookline, Massachusetts?
Yes, John F. Kennedy was born in Brookline, Massachusetts.
Did John Kerry grow up in the Boston area?
Yes, John Kerry grew up in the Boston area.
Was Tip O’Neill born in Cambridge, Massachusetts?
Yes, Tip O’Neill was born in Cambridge, Massachusetts.
Did John F. Kennedy serve as the 35th President of the United States?
Yes, John F. Kennedy served as the 35th President of the United States.
Did John Kerry serve as the U.S. Secretary of State under President Barack Obama?
Yes, John Kerry served as the U.S. Secretary of State under President Barack Obama.
Did Tip O’Neill serve as the Speaker of the United States House of Representatives from 1977 to 1987?
Yes, Tip O’Neill served as the Speaker of the United States House of Representatives from 1977 to 1987.
Step 4: Adjust Initial Answer
The initial answer provided is accurate based on the verification questions and their answers. The mentioned politicians were born in the specified locations and held the mentioned political positions.
End Of The ChatGPT Response
The verification process seemed to appropriately follow the earlier stated all-in-one prompt that I had given to ChatGPT.
ChatGPT identified two elements of verification consisting of the birthplaces and whether the politicians were indeed politicians. I’ll accept that. Given that the question was once again simple in framing, the task of identifying pertinent elements was easy too. Deriving the verification questions was easy too. Answering them was easy. The result was no adjustment to the initial answer.
I want you to take a closer look at the ChatGPT response.
Do you see something untoward?
I believe so.
We have discovered something astray that is another caveat or qualm about generative AI doing verifications. You might observe that John Kerry was not born in Boston. ChatGPT admits freely that John Kerry was born in Colorado and not in Boston. The twist is that ChatGPT tells us that John Kerry grew up in Boston, which is according to ChatGPT, presumably sufficiently equivalent to being born in Boston.
I would have been more impressed if the verification process overtook the initial response and said that John Kerry wasn’t born in Boston, noting that my question had specifically stated that the politicians were to be those who were born in Boston. Unfortunately, the verification merely reinforced the assumption that growing up in Boston is the same as being born there.
Disappointing.
Eye-opening.
By the way, for those of you who think this is a finicky argument and a fine line, I realize that a cogent argument is that someone who spent the bulk of their childhood in a particular city could be loosely construed as being born there. I get that. On the other hand, my question was pretty unambiguous. I asked about politicians born in Boston. I didn’t say that I was looking for politicians that kind of were born in Boston or that you could stretch the definition of “born” into meaning having lived there during childhood.
We shall do the Boston question again, and this time do things on a stepwise prompting basis. I will walk ChatGPT through the verification process. Doing so might make a difference in contrast to using the all-in-one verification process prompt.
Let’s see.
- My question to ChatGPT: “Who are some politicians who were born in Boston?”
The answer once again included John Kerry. To keep this discussion within my herein space limitations, I will cut to the chase.
My next prompt told ChatGPT that I’d like to do a verification. I provide the first step of identifying key elements for verification.
- My entry to ChatGPT: “I would like you to go through a verification process about the answer that you gave. First, please identify any key elements such as facts, figures, or other notable considerations that are important and should be subject to verification. List out those identified elements.”
It is kind of interesting that this time the listed elements consisted of the same two of birthplace and political position, but also added party affiliation and timeline. This shows you that due to probabilistic and statistical variations, the responses from generative AI can vary.
I decided to tell ChatGPT to focus on the birthplaces and the political positions. I felt this was fair to do since the prior tryout had only used those two factors. You can quibble about that if you wish.
- My entry to ChatGPT: “I’d like to concentrate on the birthplaces and the political positions. Make a list of the started birthplaces and political positions and show a set of questions about those elements that can serve as verification questions.”
After looking at the verification questions, which were on par with the first tryout, I told ChatGPT to go ahead and answer the verification questions.
Here is an answer that brought a smile to my face:
ChatGPT response:
- “John Kerry – Born in Aurora, Colorado, with ties to Boston. Verification: John Kerry was not born in Boston; he was born in Aurora, Colorado. While he had ties to Boston later in life, he does not meet the criteria of being born in Boston.”
And, I completed the process by telling ChatGPT to adjust the initial answer based on having done the verifications.
- My entry to ChatGPT: “Based on your answers to the verification questions, adjust your initial answer to my first question and show a new answer accordingly.”
The final list no longer contained John Kerry.
I suppose this illustrates that doing the stepwise prompting for the verification process can potentially produce better results. I don’t want to have you overinflate this one example. Sometimes spoon-feeding will be better, sometimes not. In my view, this was a lucky shot and based on a rather simple question and simple answers. I tried other examples with much more complex questions and complex answers, along with complex verification questions, and the result was somewhat mixed as to which path was the better determiner.
Overall, I’d still suggest that you ought to aim for the stepwise verification process if you have the time to do so and the matter warrants taking the time. I should also point out that the odds are that the stepwise verification process will be more costly in terms of computer processing time. If you are paying to use generative AI, the cost of the all-in-one verification process versus the stepwise verification process might become a factor in your decision of the best choice for you.
Conclusion
I ask that you consider trying out your own variation of the Chain-of-Verification technique.
Use various examples that are important to you. Spend some time getting comfortable with the approach. Make notes about what works and what seems stilted. Also, realize that what seems workable in one generative AI app might not be equally usable in another. If you are using several brands of generative AI apps, you’ll want to practice with each of those respectively.
A final thought for now.
Thomas H. Huxley, the famous biologist and anthropologist, said this: “The man of science has learned to believe in justification, not by faith, but by verification.”
I think that we should all embrace the essential concept of trust but verify when using generative AI. You are okay to have a somber undertone of suspicion about whatever generative AI produces. Allow yourself an on-your-toes modicum of trust, and then keep your eyes, ears, and mind highly alert to seek and avidly pursue verification.
Keep on double-checking, that’s the prudent path.
Read the full article here