KMP和Sunday算法实现 strStr() 函数

发表于 2021-02-08 更新于 2023-05-12 分类于算法 Valine：

实现 strStr() 函数

给定一个 haystack 字符串和一个 needle 字符串，在 haystack 字符串中找出 needle 字符串出现的第一个位置 (从0开始)。如果不存在，则返回 -1。

示例 1:

输入: haystack = "hello", needle = "ll" 输出: 2

示例 2:

输入: haystack = "aaaaa", needle = "bba" 输出: -1

注：本文算法复杂度中的是模式串的长度，是主串的长度，是SIMD操作引入的常数。

库函数解法

谁不爱库函数呢😋

class Solution {
public:
    int strStr(string haystack, string needle) {
        return haystack.find(needle);
    }
};

复杂度分析

经过查阅资料得知，C++下的string::find是SIMD指令集优化后的朴素算法^[1]^[2]

预处理时间复杂度：无
匹配时间复杂度：
空间复杂度：无

KMP解法

看了Carl大师兄的KMP算法解读，终于“入门”了KMP🤪，这边强烈推荐下：

KMP算法的具体理论相对比较复杂，大家可以百度，这里提一下我的理解

根据我的理解，KMP算法的核心就在于求得前缀表，体现在代码上就是构造模式串匹配字符指针的转移数组，即 next 数组

个人感觉求 next 数组的过程有点像动态规划

下面由菜鸡为大家手撕KMP

实现

class Solution {
public:
    int strStr(string haystack, string needle) {
        int h_len = haystack.size(), n_len = needle.size();
        if (!n_len)
            return 0;

        auto next = getNext(needle); // 预处理
        int j = 0; // j 模式串指针
        for (int i = 0; i < h_len; ++i) { // i 主串指针
            while (j > 0 && haystack[i] != needle[j]) // 不匹配
                j = next[j - 1]; // j 回溯到之前匹配的位置

            if (haystack[i] == needle[j]) // 匹配
                ++j;

            if (j == n_len) // j 指针走到了模式串的末尾（即在主串中找到了模式串）
                return i - j + 1;
        }

        return -1;
    }

private:
    // 构造前缀表
    vector<int> getNext(const string& s) {
        int n = s.size();
        vector<int> next(n);
        int j = 0;
        next[0] = j; // 第一个字符的前缀长度必定是0

        for (int i = 1; i < n; ++i) { // i 从1开始
            while (j > 0 && s[i] != s[j]) // 前后缀不同
                j = next[j - 1]; // 向前回溯

            if (s[i] == s[j]) // 前后缀相同 
                ++j;

            next[i] = j; // 保存前缀长度
        }

        return next;
    }
};

复杂度分析

预处理时间复杂度：
匹配时间复杂度：
空间复杂度：

Sunday解法

Sunday算法是BMH算法的变种，写法比BMH更简单，思想比KMP简单，且在实际情况下，都比BMH和KMP有更好的效率^[3]。

这种又简单效率又高的算法，当然要学一学了，Sunday的思想在于如何在遇到不匹配的字符时跳过尽可能多的字符，以减少对比次数。

下面由菜鸡表演

实现

class Solution {
public:
    int strStr(string haystack, string needle) {
        int h_len = haystack.size(), n_len = needle.size();
        if (!n_len)
            return 0;
            
        // 构建偏移表
        int shift[0xFF + 1]{};
        // 必须正向遍历，因为如果有重复的字母
        // 需要匹配索引最大的(移动步长最短)
        // 防止跳的过远遗漏数据
        for (int i = 0; i < n_len; ++i) {
            shift[needle[i]] = n_len - i;
        }

        for (int i = 0; i <= h_len - n_len;) { // i 主串指针
            bool found = true;
            for (int j = 0; j < n_len; ++j) { // j 模式串指针
                if (haystack[i + j] != needle[j]) { // 不匹配
                    found = false;
                    break;
                }
            }

            if (found) // 匹配，返回索引
                return i;
                
            // 不匹配，判断参与匹配的末尾字符的下一位是否在模式串中存在
            int step = shift[haystack[i + n_len]];
            if (step > 0) {
                i += step; // 出现在模式串中，按偏移量跳过
            } else {
                i += n_len + 1; // 没出现在模式串中，跳过模式串长度 + 1
            }
        }

        return -1;
    }
};

复杂度分析

预处理时间复杂度：
匹配时间复杂度：
- 平均：
- 最好：
- 最差：
空间复杂度：

总结

共同的核心思想

无论是KMP和Sunday算法， 减少重复匹配 都是核心思想，且都是通过跳转来完成的

思想

KMP通过前缀表，也就是next数组，告诉j，如果匹配失败的话应该跳转到next[j]位置进行匹配，next数组能保证的是next[j]前面的位置已经匹配上了
Sunday则是通过偏移表，在匹配失败时，确定需要移动的位数，然后让i去位移

实用意义

KMP在实际的字符串匹配需求中已经被抛弃，但KMP的思想是值得学习的。
Sunday在现实中的实用意义很大，因为其效率很高（发明者宣称在实际场景下，比BM快大约4.5倍）。但缺点也很明显，比如遇到"aaaaaaaaaaaab","ab"，就会退化成朴素算法，时间复杂度：，不过现实中极少有这么极端的情况。

埋坑

实现一个支持单字通配符的Sunday模糊匹配算法

引用

Kumar, Aditya. "libc++: Improve string::find algorithm"
Kumar, Aditya. "libstdc++: Improve string::find algorithm"
Hume; Sunday (1991). "Fast String Searching"
Wiki. "String-searching algorithm"

更改

[2021-02-23] 更新Sunday算法代码，将偏移表从unordered_map改为原生数组，unordered_map过重，直接用数组当做特殊的哈希表即可