使用Grep和正则表达式在Linux中搜索文本模式

介绍

grep 命令是 Linux 终端环境中最有用的命令之一。 grep 的名称代表“全局正则表达式打印”。这意味着您可以使用 grep 来检查它接收的输入是否与指定的模式匹配。这个看似微不足道的程序非常强大;它根据复杂规则对输入进行排序的能力使其成为许多命令链中的热门链接。

在本教程中,您将探索 grep 命令的选项,然后深入使用正则表达式进行更高级的搜索。

先决条件

要按照本指南进行操作,您需要访问运行基于 Linux 的操作系统的计算机。这可以是通过 SSH 连接的虚拟专用服务器,也可以是您的本地计算机。请注意,此教程是使用运行 Ubuntu 20.04 的 Linux 服务器进行验证的,但给出的示例应该适用于运行任何 Linux 发行版的任何版本的计算机。

如果您计划使用远程服务器来按照此指南操作,我们建议您首先完成我们的服务器初始设置指南。这样做将为您建立一个安全的服务器环境 — 包括一个具有sudo特权的非root用户和一个配置了UFW防火墙的环境 — 您可以用它来提升您的Linux技能。

基本用法

在本教程中,您将使用grep来搜索GNU通用公共许可证第3版中的各种单词和短语。

如果您使用的是Ubuntu系统,您可以在/usr/share/common-licenses文件夹中找到该文件。将其复制到您的主目录:

  1. cp /usr/share/common-licenses/GPL-3 .

如果您使用的是另一个系统,请使用curl命令下载一个副本:

  1. curl -o GPL-3 https://www.gnu.org/licenses/gpl-3.0.txt

您还将在本教程中使用BSD许可证文件。在Linux上,您可以使用以下命令将其复制到您的主目录:

  1. cp /usr/share/common-licenses/BSD .

如果您使用的是另一个系统,请使用以下命令创建该文件:

  1. cat << 'EOF' > BSD
  2. Copyright (c) The Regents of the University of California.
  3. All rights reserved.
  4. Redistribution and use in source and binary forms, with or without
  5. modification, are permitted provided that the following conditions
  6. are met:
  7. 1. Redistributions of source code must retain the above copyright
  8. notice, this list of conditions and the following disclaimer.
  9. 2. Redistributions in binary form must reproduce the above copyright
  10. notice, this list of conditions and the following disclaimer in the
  11. documentation and/or other materials provided with the distribution.
  12. 3. Neither the name of the University nor the names of its contributors
  13. may be used to endorse or promote products derived from this software
  14. without specific prior written permission.
  15. THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  16. ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  17. IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  18. ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  19. FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  20. DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  21. OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  22. HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  23. LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  24. OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  25. SUCH DAMAGE.
  26. EOF

现在您已经有了这些文件,可以开始使用grep了。

在最基本的形式中,您使用grep来匹配文本文件中的文字模式。这意味着如果您传递给grep一个要搜索的单词,它将打印出文件中包含该单词的每一行。

执行以下命令使用grep搜索包含单词GNU的每一行:

  1. grep "GNU" GPL-3

第一个参数GNU是您要搜索的模式,而第二个参数GPL-3是您要搜索的输入文件。

结果输出将是包含模式文本的每一行:

Output
GNU GENERAL PUBLIC LICENSE The GNU General Public License is a free, copyleft license for the GNU General Public License is intended to guarantee your freedom to GNU General Public License for most of our software; it applies also to Developers that use the GNU GPL protect your rights with two steps: "This License" refers to version 3 of the GNU General Public License. 13. Use with the GNU Affero General Public License. under version 3 of the GNU Affero General Public License into a single ... ...

在某些系统上,您搜索的模式将在输出中被突出显示。

常用选项

默认情况下,grep将在输入文件中搜索确切指定的模式并返回它找到的行。您可以通过为grep添加一些可选标志来使此行为更有用。

如果您希望grep忽略搜索参数的“大小写”并搜索大小写变化,您可以指定-i--ignore-case选项。

使用以下命令在与之前相同的文件中搜索单词license的每个实例(包括大写、小写或混合大小写):

  1. grep -i "license" GPL-3

结果包含:LICENSElicenseLicense

Output
GNU GENERAL PUBLIC LICENSE of this license document, but changing it is not allowed. The GNU General Public License is a free, copyleft license for The licenses for most software and other practical works are designed the GNU General Public License is intended to guarantee your freedom to GNU General Public License for most of our software; it applies also to price. Our General Public Licenses are designed to make sure that you (1) assert copyright on the software, and (2) offer you this License "This License" refers to version 3 of the GNU General Public License. "The Program" refers to any copyrightable work licensed under this ... ...

如果有一个实例带有LiCeNsE,那么也会返回。

如果您想要找到所有不包含指定模式的行,您可以使用-v--invert-match选项。

使用以下命令在BSD许可证中搜索不包含单词the的每一行:

  1. grep -v "the" BSD

您将收到此输出:

Output
All rights reserved. Redistribution and use in source and binary forms, with or without are met: may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE ... ...

由于您没有指定“忽略大小写”选项,因此最后两个项目被返回为不包含单词the

了解匹配发生在哪一行通常是有用的。您可以使用-n--line-number选项来做到这一点。重新运行上一个示例并添加此标志:

  1. grep -vn "the" BSD

这将返回以下文本:

Output
2:All rights reserved. 3: 4:Redistribution and use in source and binary forms, with or without 6:are met: 13: may be used to endorse or promote products derived from this software 14: without specific prior written permission. 15: 16:THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 17:ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE ... ...

现在您可以引用行号,如果要对每一行不包含the的内容进行更改,这将特别方便,特别是在处理源代码时。

正则表达式

在介绍中,您了解到grep代表“全局正则表达式打印”。 “正则表达式”是描述特定搜索模式的文本字符串。

不同的应用程序和编程语言以略有不同的方式实现正则表达式。在本教程中,您将只探索grep描述其模式的一小部分方式。

字面匹配

在本教程的前面示例中,当您搜索单词GNUthe时,实际上是在搜索基本正则表达式,这些表达式匹配确切的字符字符串GNUthe。精确指定要匹配的字符的模式称为“字面量”,因为它们与模式文字逐字匹配。

将这些视为匹配字符串而不是匹配单词会更有帮助。随着您学习更复杂的模式,这将成为更重要的区别。

所有字母和数字字符(以及某些其他字符)在未经其他表达式机制修改时均按文字方式匹配。

锚点匹配

锚点是指定匹配必须发生的行中的位置的特殊字符。

例如,使用锚点,您可以指定您只想知道在行的开头匹配GNU的行。要实现这一点,您可以在文字字符串之前使用^锚点。

运行以下命令来搜索 GPL-3 文件,并找到以 GNU 开头的行:

  1. grep "^GNU" GPL-3

该命令将返回以下两行:

Output
GNU General Public License for most of our software; it applies also to GNU General Public License, you may choose any version ever published

类似地,您可以在模式的末尾使用 $ 锚点来指示匹配只在行的末尾出现时才有效。

此命令将匹配 GPL-3 文件中以单词 and 结尾的每一行:

  1. grep "and$" GPL-3

您将收到以下输出:

Output
that there is no warranty for this free software. For both users' and The precise terms and conditions for copying, distribution and License. Each licensee is addressed as "you". "Licensees" and receive it, in any medium, provided that you conspicuously and alternative is allowed only occasionally and noncommercially, and network may be denied when the modification itself materially and adversely affects the operation of the network or violates the rules and provisionally, unless and until the copyright holder explicitly and receives a license from the original licensors, to run, modify and make, use, sell, offer for sale, import and otherwise run, modify and

匹配任意字符

句点字符(.)在正则表达式中用于表示指定位置可以存在任何单个字符。

例如,要匹配 GPL-3 文件中具有两个字符然后是字符串 cept 的内容,您将使用以下模式:

  1. grep "..cept" GPL-3

此命令返回以下输出:

Output
use, which is precisely where it is most unacceptable. Therefore, we infringement under applicable copyright law, except executing it on a tells the user that there is no warranty for the work (except to the License by making exceptions from one or more of its conditions. form of a separately written license, or stated as exceptions; You may not propagate or modify a covered work except as expressly 9. Acceptance Not Required for Having Copies. ... ...

此输出包含了 acceptexcept 以及两个单词的变体。如果找到,该模式也将匹配 z2cept

括号表达式

通过将一组字符放在括号内(\[\]),您可以指定该位置的字符可以是括号组中找到的任何一个字符。

例如,要查找包含tootwo的行,您可以使用以下模式简洁地指定这些变体:

  1. grep "t[wo]o" GPL-3

输出显示文件中存在这两种变体:

Output
your programs, too. freedoms that you received. You must make sure that they, too, receive Developers that use the GNU GPL protect your rights with two steps: a computer network, with no transfer of a copy, is not conveying. System Libraries, or general-purpose tools or generally available free Corresponding Source from a network server at no charge. ... ...

括号表示法为您提供了一些有趣的选项。您可以通过在括号内的字符列表前加上一个^字符,来使模式匹配除括号内字符之外的任何字符。

此示例类似于模式.ode,但不会匹配模式code

  1. grep "[^c]ode" GPL-3

以下是您将收到的输出:

Output
1. Source Code. model, to give anyone who possesses the object code either (1) a the only significant mode of use of the product. notice like this when it starts in an interactive mode:

请注意,在返回的第二行中,实际上有单词code。这不是正则表达式或grep的失败。相反,这行被返回,是因为在该行之前找到了单词model中的模式mode。返回了该行,因为存在与模式匹配的实例。

括号的另一个有用功能是您可以指定一系列字符,而不是逐个键入每个可用字符。

这意味着如果您想要找到每一行以大写字母开头的行,您可以使用以下模式:

  1. grep "^[A-Z]" GPL-3

这个表达式返回以下输出:

Output
GNU General Public License for most of our software; it applies also to States should not allow patents to restrict development and use of License. Each licensee is addressed as "you". "Licensees" and Component, and (b) serves only to enable use of the work with that Major Component, or to implement a Standard Interface for which an System Libraries, or general-purpose tools or generally available free Source. User Product is transferred to the recipient in perpetuity or for a ... ...

由于一些遗留的排序问题,通常更准确的方法是使用 POSIX 字符类,而不是像您刚刚使用的字符范围。

讨论每个 POSIX 字符类超出了本指南的范围,但一个示例可以完成与先前示例相同的过程,使用了 \[:upper:\] 字符类在括号选择器中:

  1. grep "^[[:upper:]]" GPL-3

输出将与以前相同。

重复模式零次或更多次

最后,最常用的元字符之一是星号,即 *,它表示“重复前一个字符或表达式零次或更多次”。

要查找 GPL-3 文件中包含开括号和闭括号的每一行,之间仅包含字母和单个空格的表达式,请使用以下表达式:

  1. grep "([A-Za-z ]*)" GPL-3

您将获得以下输出:

Output
Copyright (C) 2007 Free Software Foundation, Inc. distribution (with or without modification), making available to the than the work as a whole, that (a) is included in the normal form of Component, and (b) serves only to enable use of the work with that (if any) on which the executable work runs, or a compiler used to (including a physical distribution medium), accompanied by the (including a physical distribution medium), accompanied by a place (gratis or for a charge), and offer equivalent access to the ... ...

到目前为止,您已经在表达式中使用了句点、星号和其他字符,但有时您需要专门搜索这些字符。

转义元字符

有时候,您需要搜索一个字面上的句号或者一个字面上的开括号,特别是在处理源代码或配置文件时。因为这些字符在正则表达式中具有特殊含义,您需要“转义”这些字符,告诉grep在这种情况下不要使用它们的特殊含义。

您可以通过在通常具有特殊含义的字符前面使用反斜杠字符(\)来转义字符。

例如,要查找以大写字母开头并以句号结尾的任何行,请使用以下表达式,它转义了结束句号,使其表示字面上的句号,而不是通常的“任何字符”含义:

  1. grep "^[A-Z].*\.$" GPL-3

您将看到以下输出:

Output
Source. License by making exceptions from one or more of its conditions. License would be to refrain entirely from conveying the Program. ALL NECESSARY SERVICING, REPAIR OR CORRECTION. SUCH DAMAGES. Also add information on how to contact you by electronic and paper mail.

现在让我们看看其他正则表达式选项。

扩展正则表达式

grep命令支持更广泛的正则表达式语言,可以使用-E标志或者调用egrep命令代替grep

这些选项扩展了“扩展正则表达式”的功能。扩展正则表达式包括所有基本的元字符,以及额外的元字符来表示更复杂的匹配。

分组

扩展正则表达式最有用的能力之一是能够将表达式组合在一起,以便作为一个单元进行操作或引用。

要将表达式组合在一起,请将它们放在括号中。如果您想在不使用扩展正则表达式的情况下使用括号,则可以使用反斜杠进行转义,以启用此功能。这意味着以下三个表达式在功能上是等价的:

  1. grep "\(grouping\)" file.txt
  2. grep -E "(grouping)" file.txt
  3. egrep "(grouping)" file.txt

选择

与方括号表达式可以指定单个字符匹配的不同可能选择类似,选择允许您为字符串或表达式集合指定替代匹配项。

要指示选择,请使用管道字符|。这些通常在括号分组中使用,以指定应将两个或更多可能性之一视为匹配项。

以下将在文本中找到GPL通用公共许可证

  1. grep -E "(GPL|General Public License)" GPL-3

输出如下:

Output
The GNU General Public License is a free, copyleft license for the GNU General Public License is intended to guarantee your freedom to GNU General Public License for most of our software; it applies also to price. Our General Public Licenses are designed to make sure that you Developers that use the GNU GPL protect your rights with two steps: For the developers' and authors' protection, the GPL clearly explains authors' sake, the GPL requires that modified versions be marked as have designed this version of the GPL to prohibit the practice for those ... ...

选择可以通过在选择组中添加额外的选择并用额外的管道|字符分隔来在两个以上选择之间进行选择。

量词

*元字符一样,可以匹配前一个字符或字符集零次或多次,扩展正则表达式中还有其他可用的元字符,用于指定出现的次数。

要匹配一个字符零次或一次,可以使用?字符。这使得之前的字符或字符集变为可选的。

以下模式将copy放在一个可选组中,从而匹配copyrightright

  1. grep -E "(copy)?right" GPL-3

您将收到以下输出:

Output
Copyright (C) 2007 Free Software Foundation, Inc. To protect your rights, we need to prevent others from denying you these rights or asking you to surrender the rights. Therefore, you have know their rights. Developers that use the GNU GPL protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License "Copyright" also means copyright-like laws that apply to other kinds of ...

+字符匹配一个表达式一次或多次。这几乎与*元字符相似,但使用+字符时,表达式必须至少匹配一次。

以下表达式匹配字符串free加上一个或多个非空格字符:

  1. grep -E "free[^[:space:]]+" GPL-3

您将看到以下输出:

Output
The GNU General Public License is a free, copyleft license for to take away your freedom to share and change the works. By contrast, the GNU General Public License is intended to guarantee your freedom to When we speak of free software, we are referring to freedom, not have the freedom to distribute copies of free software (and charge for you modify it: responsibilities to respect the freedom of others. freedomss that you received. You must make sure that they, too, receive protecting users' freedom to change the software. The systematic of the GPL, as needed to protect the freedom of users. patents cannot be used to render the program non-free.

指定匹配重复

要指定匹配重复次数,请使用大括号字符({})。这些字符允许您指定一个确切的数量,一个范围,或者一个表达式可以匹配的次数的上限或下限。

使用以下表达式查找GPL-3文件中包含三个元音字母的所有行:

  1. grep -E "[AEIOUaeiou]{3}" GPL-3

返回的每一行都有一个有三个元音字母的单词:

Output
changed, so that their problems will not be attributed erroneously to authors of previous versions. receive it, in any medium, provided that you conspicuously and give under the previous paragraph, plus a right to possession of the covered work so as to satisfy simultaneously your obligations under this

要匹配任何具有 16 到 20 个字符的单词,请使用以下表达式:

  1. grep -E "[[:alpha:]]{16,20}" GPL-3

以下是该命令的输出:

Output
certain responsibilities if you distribute copies of the software, or if you modify it: responsibilities to respect the freedom of others. c) Prohibiting misrepresentation of the origin of that material, or

只显示包含该长度单词的行。

结论

grep 在查找文件或文件系统层次结构中的模式方面非常有用,因此值得花时间熟悉它的选项和语法。

正则表达式更加灵活,可以与许多流行的程序一起使用。例如,许多文本编辑器实现了正则表达式来搜索和替换文本。

此外,大多数现代编程语言都使用正则表达式来对特定数据片段执行操作。一旦您了解了正则表达式,就能够将这些知识转移到许多常见的与计算机相关的任务中,从在文本编辑器中执行高级搜索到验证用户输入。

Source:
https://www.digitalocean.com/community/tutorials/using-grep-regular-expressions-to-search-for-text-patterns-in-linux