一次遍历找出“出现次数最多的子串”_统计出现次数最多的子串-CSDN博客

本文链接：https://blog.csdn.net/rushman/article/details/2328713

昨天，我写了一篇关于求出现次数最多的子字符串的算法及实现。网友 yy8354 对原文所述算法的效率提出置疑，这使我有了更进一步的想法。

原文，对出现次数最多的子串作出了一些归纳（详情见原文）。然而，进一步的思考，我们会有更多的发现。

设 RS 为所有符合条件的子串的集合。则在结果集 RS 中，必然存一个子集 SS，且 SS 满足：

SS 中的所有字符串都不是 OS* 中任一字符串的子串，
OS* 中任一字符串均为 SS 中某一字符串的子串，
所有的字符在 SS 的各个字符串中总共只出现一次。

_{* 其中 OS 为 SS 的补集。}

以此，重新设计算法：

扫描输入的每个字符，记录每个字符 C 的出现次数，以及对该字符的后续字符 NC 的关联次数。
扫描统计数据中每个字符的出现次数，找出最大值 M。
再次扫描统计数据，把出现次数满足 M 的字符 C 放入集合 RS 中。
如果 C 存在同样满足 M 的后续字符 NC，则把该 NC 作为 C 的关联数据保存至 RS 中。

至此，集合 RS 已经唯一确定了满足最大出现次数的所有字符串。

程序实现：

 
    
   #pragma 
    warning(disable : 4786) 
   
#include  
   < 
   iostream 
   > 
   
#include  
   < 
   string 
   > 
   
#include  
   < 
   map 
   > 
   

 
   using 
    std::cout;
 
   using 
    std::endl;
 
   using 
    std:: 
   string 
   ;
 
   using 
    std::map;

 
   struct 
    CHARINFO;

typedef map 
   < 
   char 
   ,  
   int 
   > 
    CUSED;
typedef map 
   < 
   char 
   , CHARINFO 
   > 
    TEXTINFO;
typedef map 
   < 
   char 
   ,  
   int 
   > 
    RESUSET;

 
   struct 
    CHARINFO 
   ... 
   {
    int count;
    CUSED cused;
    CHARINFO():count(0),cused()...{}
    int operator ++(int)...{return count++;}
} 
   ;

 
   int GetNextChar(const RESUSET& resuSet, char cc)...{
    RESUSET::const_iterator iter;
    iter = resuSet.find(cc);
    if(iter != resuSet.end())...{
        return iter->second;
    }
    return -1;
}

void main(void)...{
    TEXTINFO txtInfo;
    TEXTINFO::iterator iter;

    char cc, pc;
    pc = getc(stdin);
    for(txtInfo[pc].count++; !feof(stdin); pc = cc)...{
        cc = getc(stdin);
        if(cc == EOF) break;
        txtInfo[cc]++;
        txtInfo[pc].cused[cc]++;
    }

/**///  找出最大出现次数
    int maxCount;
    for(maxCount = 0, iter = txtInfo.begin(); iter != txtInfo.end(); iter++)...{
        if(maxCount < iter->second.count)maxCount = iter->second.count;
    }

/**///  生成结果
    RESUSET resuSet;
    CUSED::iterator cuseIter;
    for(iter = txtInfo.begin(); iter != txtInfo.end(); iter++)...{
        if(iter->second.count < maxCount)continue;
        resuSet[iter->first] = -1;
        for(cuseIter  = txtInfo[iter->first].cused.begin();
            cuseIter != txtInfo[iter->first].cused.end();
            cuseIter++)
        ...{
            if(cuseIter->second < maxCount)continue;
            resuSet[iter->first] = cuseIter->first;
        }
    }

/**///  输出结果
    cout<<endl<<"The result substrings :"<<endl;
    cout<<"-------------------------------------"<<endl;
    
    RESUSET::iterator riter;
    string item;
    int resultCount = 0;
    int nc;
    for(riter = resuSet.begin(); riter != resuSet.end(); riter++)...{
        item = string(1, riter->first);
        resultCount++;
        cout<<'"'<<item<<'"'<<endl;
        for(nc = GetNextChar(resuSet, riter->first); nc != -1; nc = GetNextChar(resuSet, nc))...{
            item += char(nc);
            resultCount++;
            cout<<'"'<<item<<'"'<<endl;
        }
    }

    cout<<"-------------------------------------"<<endl;
    cout<<"Total : "<<resultCount<<endl;
}
 
  

测试输入：

abcdefg
bcdef
hijklmnopqrstabcdefg

输出：

The result substrings :
-------------------------------------
"
"
"b"
"bc"
"bcd"
"bcde"
"bcdef"
"c"
"cd"
"cde"
"cdef"
"d"
"de"
"def"
"e"
"ef"
"f"
-------------------------------------
Total : 16

再次感谢 yy8354，尽管我不赞同对原算法时间复杂度为O(n*n)的结论。