Running Regex Searches With a Grep Utility

View Only

Running Regex Searches With a Grep Utility

By Sean O'Shea posted 02-01-2021 12:05

Like

A grep utility is a tool which can be used to run regular expression (regex) searches through the text of multiple files. If you have a set of hundreds or thousands of PDFs or text files, and you want to extract text from these files which matches a regex pattern, a grep utility can run the search quickly, and export the matching text to a new file.

PowerGrep is a widely used grep utility with a graphical interface. A single user license can be purchased for less than $200. The open source grep utility, grepWin, has fewer features, and lacks the support documentation provided for PowerGrep by its developer, Jan Goyvaerts. However, the no frills layout of grepWin may allow novice users to run simple searches with less trouble.

A grep utility can help you collect email addresses from thousands of email messages; extract information listed in different boxes on hundreds of invoices to separate into columns on a spreadsheet; or simply prepare a list of the beginning and ending Bates numbers of a set of non-consecutively produced documents.

In grepWin, begin by selecting the folder with the files you want to review in the 'Search in' box. In this example, the grep utility is used to search through text files of several dozen decisions of the Supreme Court of the United States. [Note that grepWin will not search through PDF files, but PowerGrep can.]

You can set a filter to only search through files with file names matching a particular pattern, or choose to exclude certain directories.

If you are running a regular expression search, select the radio button for 'Regex search'. Regex allows searches to be run that fit a particular pattern that can be matched by strings with varying characters. In this example, the regex search is designed to find citations to United States Reports:

[0-9]{3}\sU\.\sS\.\s[0-9]{3}

This regex search will find any legal citation which has a three-volume digit number, followed by 'U.S.', followed by a three-digit page number.

The first part of the search, '[0-9]{3}' searches for any of the digits in the range listed in the square brackets '[]' where they appear in a set of multiple digits that matches the number given in the curly brackets '{}'. '\s' matches a blank space. Since a period is used in the regex syntax to search for any character, when it is part of the text that we're searching for it's necessary to 'escape' the regex syntax by preceding the period with a backward slash, '\'. If you want to search for citations to United States Reports which refer to higher or lower volume or page numbers, the regex search pattern can be modified this way:

([0-9]{4}|[0-9]{3}|[0-9]{2}|[0-9]{1})(\sU\.\sS\.\s)([0-9]{4}|[0-9]{3}|[0-9]{2}|[0-9]{1})

A pipe character, '|' is used as a Boolean 'OR' operator to separate multiple search terms.

Round brackets '()' can be used to enclose multiple search terms. So, the first part of the search '([0-9]{4}|[0-9]{3}|[0-9]{2}|[0-9]{1})' looks for either 4, 3, 2, or 1 digit sequences, as does the last part.

One of the limitations of grepWin is that it will not export only the matches for a search. (This is possible with PowerGrep.) We can get around this shortcoming by replacing the regex search pattern with delimiters, being sure to include the regex pattern itself. In the 'Replace with/Capture format' box enter:

~\1\2\3~

… the tildes are used as delimiters, and a backward slash preceding the number of each part of the regex pattern is entered between the delimiters. We enter '\1\2\3' because there are three parts of the regex find pattern which are enclosed in parentheses. Click on 'Replace' and grepWin will find the regex pattern and edit the text files to include the delimiters.

With the delimiters entered in the text files:

. . . it will be easy to import them into Excel, so the legal citations can be parsed out into a new separate column.

Before proceeding, click 'Search' in grepWin, so the results include the version of the text files edited with the replace function. Export the results by clicking on the small arrow at the bottom right of the screen:

The results will be saved in a text file like this, with each line beginning with the path for the source file, followed by the line of text on which the regex pattern appears:

This text file can then be parsed in Excel. Go to the Data tab and select From Text/CSV and choose the exported text file. In the drop-down menu in the middle of import dialog box, select a ~ as the delimiter:

. . . complete the import by clicking on 'Load', and you will now have each of the searched for citations in a separate column.

It will be easy to copy the cites over to the Lexis Get & Print tool, or Westlaw Find & Print.

Note that when reviewing files with a grep utility, you may need to account for irregularities in the source files, such as unexpected line breaks.

For those interested, please visit https://www.litigationsupporttipofthenight.com/ for more tips and information.

Disclaimer: The views expressed in this blog post are those of Sean O'Shea and do not reflect the views or opinions of his employer. All content provided in this blog post is for informational purposes only. Sean O'Shea makes no representations as to the accuracy or completeness of any information on this site or found by following any link on this site. Sean O'Shea will not be liable for any errors or omissions in this information nor for the availability of this information. He will not be liable for any losses, injuries, or damages from the display or use of this information. This policy is subject to change at any time. Sean O'Shea is not an attorney, and nothing posted by him should be construed as legal advice. Sean O'Shea does not provide confirmation that any e-discovery technique or conduct is compliant with legal, regulatory, contractual or ethical requirements

#LitigationSupportoreDiscovery
#PracticeManagementandPracticeSupport
#Firm
#NewtoLegal

#AwarenessTraining

1 comment

42 views