When a lawyer searches in a legal database, that single search box is like a lure: Put in your search terms and rely on the excellence of the search algorithms to catch the right fish.
At first glance, the various legal research databases seem similar. For instance, they all promote their natural language searching, so when the keywords go into the search box, researchers expect relevant results. The lawyer would also expect the results to be somewhat similar no matter which legal database a lawyer uses. After all, the algorithms are all trying to solve the same problem: translating a specific query into relevant results.
The reality is much different. In a comparison of six legal databases—Casetext, Fastcase, Google Scholar, Lexis Advance, Ravel and Westlaw—when researchers entered the identical search in the same jurisdictional database of reported cases, there was hardly any overlap in the top 10 cases returned in the results. Only 7 percent of the cases were in all six databases, and 40 percent of the cases each database returned in the results set were unique to that database. It turns out that when you give six groups of humans the same problem to solve, the results are a testament to the variability of human problem-solving. If your starting point for research is a keyword search, the divergent results in each of these six databases will frame the rest of your research in a very different way.
SEEING IS BELIEVING
It is easy to forget that the algorithms returning search results are completely human constructs. Those humans made choices about how the algorithms will work. And those choices become the biases and assumptions that are built into research systems. Bias for algorithms simply means a preference in a computer system. While researchers don’t know the choices the humans made, we can know the variables that are at work in creating legal research algorithms.
Search grammar: Which terms are automatically stemmed (returned to their root form) and which are not, which synonyms are automatically added, which legal phrases are recognized without quotation marks, how numbers are treated, and how the number of word occurrences in a document determine results—these are examples of search grammar.
Term count: If your search has six words and only five words are in a document, the algorithm can be set to include or exclude the five-term document.
Proximity: The algorithm is preset to determine how close search terms have to be to each other to be returned in the top results.
Machine learning: The programmers decide whether to include instructions that allow the algorithm to “learn” from the data in the database and make predictions.
Prioritization: Relevance ranking is one form of prioritizing that emphasizes certain things at the expense of others. U.S. Supreme Court cases, newer cases or well-cited cases may get a relevance boost.
Network analysis: The extent to which the algorithm uses citation analysis to find and order results is a human choice.
Classification and content analysis: Database providers with full classification systems and access to secondary sources to mine may be programming their algorithms to utilize that value-added content.
Filtering: Decisions about what content to include and exclude from a database affect results. These decisions may be based on copyright or other access issues.
Once these decisions have been made and the code has been implemented, legal researchers don’t know how those human choices are affecting search results. But the choices matter to what a researcher sees in the results set. Code is law, as Lawrence Lessig famously said in his 1999 book, Code and Other Laws of Cyberspace.
Susan Nevelow Mart is an associate professor and the director of the law library at the University of Colorado Law School in Boulder.
This article was published in the March 2018 issue of the
ABA Journal with the title “Results May Vary: Which database a researcher uses makes a difference.”
Be Sociable, Share!