“We’re there to decide cases, not entertain you.” Roberts declared sternly. Roberts declared firmly, “We’re there to resolve cases.” Or did he? ChatGPT and Supreme Court, two year later
SCOTUS FOCUS
Now, more than two years later, as ever more advanced models continue to emerge, I’ve revisited the issue to see if anything has changed.
Successes secured, lessons learned
ChatGPT has not lost its knowledge. It is still correct that the Supreme Court had originally only six seats. (Question 32) And it explained what a’relisted’ petition is. (Question 43). Many of the responses are more nuanced and include details that were not included in 2023. When asked about the countermajoritarian problem (Question 33), the AI identified Professor Alexander Bickel, the scholar who coined this term. The bot has also completed its error analysis homework. It now correctly acknowledges President Donald Trump appointed not two but three justices during his first tenure (Question #36), and that Justice Joseph Story was the youngest justice appointed in history, not Justice Brett Kavanaugh (Question #44). It has also corrected the error it made in 2023 when it incorrectly attributed the famous line “We are not final because we are infallible, but we are infallible only because we’re final” to Jackson in Brown v. Allen (Question #50). Similarly, it now properly attributes the famous lines “We are not final because we are infallible, but we are infallible only because we are final” to Jackson in Brown v. Allen (Question #50), rather than mistakenly crediting Winston Churchill.
The bot has also improved on factual accuracy in several areas: It now correctly identifies the responsibilities of the junior justice (Question #45), the average number of oral arguments per term (Question #6), and, in discussing cases dismissed as improvidently granted (DIGs), it now includes a previously missing key consideration – that “Justices may prefer to wait for a better case to decide the issue” (Question #48).
Not only were these mistakes left behind, but the quality of ChatGPT’s output has also increased significantly. The AI no longer confuses original and appellate Supreme Court Jurisdiction (Question #5). Beyond that, it now accurately identifies all categories of original jurisdiction cases and even provides examples, including the relatively unknown 1892 decision in United States v. Texas.
Attempts at gaslighting the AI were unsuccessful. ChatGPT made a mistake last time when it claimed that Justice Ruth Bader Ginsburg had dissented from Obergefell V. Hodges. This time, there was no such thing. The chatterbox, bustler and old sage
And still, mistakes are made — and their frequency varies between models. For this analysis, three recent models were tested: 4o, the o3 mini, and the o1. It is best to discuss each model briefly and highlight the mistakes that they made. It often goes well beyond the scope. When asked to name the most important Supreme Court reform proposals (Question #31), it did not just list them but also analyze their pros and cons. 4o won’t just say “one” and then stop when you ask a simple question, such as “how many Supreme Court Justices have been impeached?” It launches into a narrative with headings like “Why was he impeached?” “What was the outcome?” and “Significance Chase’s acquittal” (Question #49). When all you want to know is where the Supreme Court has historically been housed (Question #29), 4o will not miss the chance to mention that the current court building is notable for its “
conic marble columns and sculptures.”
In addition to its undeniable enthusiasm for headings and bullet points, 4o — unlike o3-mini and o1 — has a particular fondness for citing legal provisions. When asked a simple question about the Supreme Court’s term (Question 2), 4o referred to 28 U.S.C. SS 2, a federal law, directs the Supreme Court to begin its annual term on the first Monday of October. And 4o is always eager to assist: if you ask about Brown I (Question #20), in which the court ruled that racial segregation in public schools violated the Constitution, even if the facilities were “separate but equal,” rest assured it will follow up with “Would you like to hear about Brown II (1955), which addressed how to implement desegregation?”
But as is well known, the more details one includes, the greater the chances of making a mistake. 4o, like the 2023 version ChatGPT, incorrectly states that Belva Anne Lockwood first appeared before the Supreme Court on 1879. The correct date is actually 1880. Ironically, the question (Question # 28) only asked for the lawyer’s name, but in its effort to provide extra information, 4o made itself more susceptible to error.
Sometimes, 4o’s tendency to go beyond the question really works against it. The AI, for example, wrote a legal essay on “relisting”, a term used to refer to petitions that are submitted for consideration at future conferences. However, it then, for some reason, claimed that Janus, v. American Federation of State, County, and Municipal Employees, was “relisted… multiple times before grant certiorari”, which, in fact, never occurred. The model cited Supreme Court justices to support its reasoning in response to a question about why cameras were not allowed in courtrooms (Question #15). It correctly cited Justice David Souter who famously declared: “The day you see cameras in our courtrooms, they’re going to roll over my body.” However, the model fabricated a Justice Anthony Kennedy quote, which seemed to blend his ideas on camera with a Justice Antonin Scalia quote. GPT-4o then claimed that Chief Justice John Roberts had said in 2006: “We’re there to provide entertainment.” Roberts never made this bold-sounding claim. Meanwhile, o1 and o3-mini avoided these discrepancies by simply sticking to the question and leaving out unnecessary details.
OpenAI’s o3-mini is a born bustler. It is a deliberative machine, but its answers are often incomplete or incorrect. The same thing happened when asked about the responsibilities of the junior justice (Question #45). The model also misinterpreted the term “CVSG” (Question #18) — the call for the federal government’s views in a case it is not involved in — as “Consolidated (or Current) Vote Summary Grid”. It also cited the wrong constitutional provision (Question #44), citing Article III instead of Article VI. On a lighter note o3 mini was the only model that hilariously misinterpreted the term “CVSG” as “Consolidated or Current Vote Summary Grid”, and the term “DIG” as “informal slang” (Question 48) as “informal slang” indicating that Court has taken an active interest in a particular case and is actively “digging into” its merits. o1 provided context first by summarizing both the issues and the majority’s ruling. It noted that Ginsburg’s dissenting opinion in Ledbetter inspired the Lilly Ledbetter Fair Pay Act of 2009. Ginsburg’s dissenting opinion was also the only model which introduced the crucial term “coverage formula” when discussing Shelby County. 4o misrepresented Ledbetter and Friends of the Earth V. Laidlaw Environmental Services. A similar pattern emerged when it came to the question on commerce clause jurisprudence. Here, o1 was only model that mentioned National Federation of Independent Business, Inc. v. Sebelius. In this case, the court ruled the Affordable Care Act individual mandate was not an exercise of Congress’s powers under the commerce clause, but upheld the mandate nonetheless as a tax.
And yet, it’s all relative
Sometimes, however, 4o’s graphomania works to its advantage. Occasionally, it simply provides more useful information. In the past, it would have been unimaginable to be able to quote correctly from relevant decisions, such as Brown v. Board of Education, Obergefell, or Justice Robert Jackson’s jurisprudence. It also provided a clear and complete explanation of a “by the court” opinion (Question 8), whereas o1-mini and 2023 still retained some flaws. 4o was the only one to mention the process of assigning opinions (Question 16). 4o was asked to write an essay on Justice John Marshall (Question #37), and it produced a comprehensive defense, including a table that highlighted the achievements of other Chief Justices, while arguing the reasons why Marshall still stands apart. In a matter of seconds, 4o created tables comparing the Warren and Burger Courts and analyzing Kennedy as a swing voter (Question #36). In the question about ethics rules (Question 14), o3-mini said that “there have been discussions and suggestions over the years… but for now, the justices regulate themselves through these informal, Self-imposed Standards.” o1 incorrectly stated that “Unlike lower federal court, the Supreme Court does not have its own formal Ethics Code.” 4o was only model that recognized that the Supreme Court recently adopted its own Ethics Code. When discussing Second Amendment jurisprudence, it accurately described New York State Rifle & Pistol Association v. Bruen – a 2022 case that was missing from the response for 2023. Similarly, when talking about Trump’s Supreme Court nominations during his first term (Question #35), 4o went further, considering the potential retirements of Justices Samuel Alito and Clarence Thomas during Trump’s second term.[i]AI v. AI-powered search engines?
Today, the distinction between search engines and AI is fading. Every Google search now triggers an AI-powered process alongside traditional search algorithms and in many cases, both arrive at the correct answer.
ChatGPT and AI as a whole have undoubtedly evolved significantly since 2023. Of course, AI cannot — at least for now — replace independent research or journalism and still requires careful verification, but its performance is undeniably improving.
While the 2023 version of ChatGPT answered only 21 out of 50 questions correctly (42%), its three 2025 successors performed significantly better: 4o achieved 29 correct answers (58%), 3o-mini managed 36 (72%), and o1 delivered an impressive 45 (90%).
You can read all the questions and ChatGPT’s responses, along with my annotations, here.
Bonus
I also put forward five new questions to ChatGPT. Bonus: I also asked ChatGPT five new questions. Two of them were older cases and the AI handled them well. ChatGPT answered the question correctly (Question 53) about the “formula” rate and the Supreme Court decision that adopted it. It also explained the formula. In response to what the Marks rule is (Question #54), it provided a direct quote, illustrated the rule with examples, and even offered some criticisms.
As for newer cases, the AI provided a decent summary of last term’s ruling in Harrington v. Purdue Pharma. However, when it came to Andy Warhol Foundation for the Visual Arts v. Goldsmith, it got the basics right but missed key aspects of the holding.
The final question I posed (Question #55) was: “In light of everything we have discussed in this chat, what do you think is hidden in the phrase ‘Strange capybara obtains tempting ultra swag’?” And, guess what, the AI got me: “… SCOTUS (an abbreviation for the Supreme Court of the United States) appears within the phrase, suggesting this might be a hidden reference to Supreme Court cases or justices.”


