How do I iterate over the words of a string?

Asked 2023-09-20 20:18:01 View 671,902

How do I iterate over the words of a string composed of words separated by whitespace?

Note that I'm not interested in C string functions or that kind of character manipulation/access. I prefer elegance over efficiency. My current solution:

#include <iostream>
#include <sstream>
#include <string>

using namespace std;

int main() {
    string s = "Somewhere down the road";
    istringstream iss(s);

    do {
        string subs;
        iss >> subs;
        cout << "Substring: " << subs << endl;
    } while (iss);
}
  • Dude... Elegance is just a fancy way to say "efficiency-that-looks-pretty" in my book. Don't shy away from using C functions and quick methods to accomplish anything just because it is not contained within a template ;) - anyone
  • while (iss) { string subs; iss >> subs; cout << "Substring: " << sub << endl; } - anyone
  • @Eduardo: that's wrong too... you need to test iss between trying to stream another value and using that value, i.e. string sub; while (iss >> sub) cout << "Substring: " << sub << '\n'; - anyone
  • Various options in C++ to do this by default: cplusplus.com/faq/sequences/strings/split - anyone
  • There's more to elegance than just pretty efficiency. Elegant attributes include low line count and high legibility. IMHO Elegance is not a proxy for efficiency but maintainability. - anyone

Answers

I use this to split string by a delimiter. The first puts the results in a pre-constructed vector, the second returns a new vector.

#include <string>
#include <sstream>
#include <vector>
#include <iterator>

template <typename Out>
void split(const std::string &s, char delim, Out result) {
    std::istringstream iss(s);
    std::string item;
    while (std::getline(iss, item, delim)) {
        *result++ = item;
    }
}

std::vector<std::string> split(const std::string &s, char delim) {
    std::vector<std::string> elems;
    split(s, delim, std::back_inserter(elems));
    return elems;
}

Note that this solution does not skip empty tokens, so the following will find 4 items, one of which is empty:

std::vector<std::string> x = split("one:two::three", ':');

Answered   2023-09-20 20:18:01

  • In order to avoid it skipping empty tokens, do an empty() check: if (!item.empty()) elems.push_back(item) - anyone
  • How about the delim contains two chars as ->? - anyone
  • @herohuyongtao, this solution only works for single char delimiters. - anyone
  • @JeshwanthKumarNK, it's not necessary, but it lets you do things like pass the result directly to a function like this: f(split(s, d, v)) while still having the benefit of a pre-allocated vector if you like. - anyone
  • Caveat: split("one:two::three", ':') and split("one:two::three:", ':') return the same value. - anyone

For what it's worth, here's another way to extract tokens from an input string, relying only on standard library facilities. It's an example of the power and elegance behind the design of the STL.

#include <iostream>
#include <string>
#include <sstream>
#include <algorithm>
#include <iterator>

int main() {
    using namespace std;
    string sentence = "And I feel fine...";
    istringstream iss(sentence);
    copy(istream_iterator<string>(iss),
         istream_iterator<string>(),
         ostream_iterator<string>(cout, "\n"));
}

Instead of copying the extracted tokens to an output stream, one could insert them into a container, using the same generic copy algorithm.

vector<string> tokens;
copy(istream_iterator<string>(iss),
     istream_iterator<string>(),
     back_inserter(tokens));

... or create the vector directly:

vector<string> tokens{istream_iterator<string>{iss},
                      istream_iterator<string>{}};

Answered   2023-09-20 20:18:01

  • Is it possible to specify a delimiter for this? Like for instance splitting on commas? - anyone
  • @Jonathan: \n is not the delimiter in this case, it's the deliminer for outputting to cout. - anyone
  • This is a poor solution as it doesn't take any other delimiter, therefore not scalable and not maintable. - anyone
  • Actually, this can work just fine with other delimiters (though doing some is somewhat ugly). You create a ctype facet that classifies the desired delimiters as whitespace, create a locale containing that facet, then imbue the stringstream with that locale before extracting strings. - anyone
  • @Kinderchocolate "The string can be assumed to be composed of words separated by whitespace" - Hmm, doesn't sound like a poor solution to the question's problem. "not scalable and not maintable" - Hah, nice one. - anyone

A possible solution using Boost might be:

#include <boost/algorithm/string.hpp>
std::vector<std::string> strs;
boost::split(strs, "string to split", boost::is_any_of("\t "));

This approach might be even faster than the stringstream approach. And since this is a generic template function it can be used to split other types of strings (wchar, etc. or UTF-8) using all kinds of delimiters.

See the documentation for details.

Answered   2023-09-20 20:18:01

  • Speed is irrelevant here, as both of these cases are much slower than a strtok-like function. - anyone
  • And for those who don't already have boost... bcp copies over 1,000 files for this :) - anyone
  • Warning, when given an empty string (""), this method return a vector containing the "" string. So add an "if (!string_to_split.empty())" before the split. - anyone
  • @Ian Embedded developers aren't all using boost. - anyone
  • as an addendum: I use boost only when I must, normally I prefer to add to my own library of code which is standalone and portable so that I can achieve small precise specific code, which accomplishes a given aim. That way the code is non-public, performant, trivial and portable. Boost has its place but I would suggest that its a bit of overkill for tokenising strings: you wouldnt have your whole house transported to an engineering firm to get a new nail hammered into the wall to hang a picture.... they may do it extremely well, but the prosare by far outweighed by the cons. - anyone
#include <vector>
#include <string>
#include <sstream>

int main()
{
    std::string str("Split me by whitespaces");
    std::string buf;                 // Have a buffer string
    std::stringstream ss(str);       // Insert the string into a stream

    std::vector<std::string> tokens; // Create vector to hold our words

    while (ss >> buf)
        tokens.push_back(buf);

    return 0;
}

Answered   2023-09-20 20:18:01

  • You can also split on other delimiters if you use getline in the while condition e.g. to split by commas, use while(getline(ss, buff, ',')). - anyone
  • I don't understand how this got 400 upvotes. This is basically the same as in OQ: use a stringstream and >> from it. Exactly what OP did even in revision 1 of the question history. - anyone

An efficient, small, and elegant solution using a template function:

template <class ContainerT>
void split(const std::string& str, ContainerT& tokens,
           const std::string& delimiters = " ", bool trimEmpty = false)
{
   std::string::size_type pos, lastPos = 0, length = str.length();
   
   using value_type = typename ContainerT::value_type;
   using size_type = typename ContainerT::size_type;
   
   while (lastPos < length + 1)
   {
      pos = str.find_first_of(delimiters, lastPos);
      if (pos == std::string::npos)
         pos = length;

      if (pos != lastPos || !trimEmpty)
         tokens.emplace_back(value_type(str.data() + lastPos,
               (size_type)pos - lastPos));

      lastPos = pos + 1;
   }
}

I usually choose to use std::vector<std::string> types as my second parameter (ContainerT)... but list<...> may sometimes be preferred over vector<...>.

It also allows you to specify whether to trim empty tokens from the results via a last optional parameter.

All it requires is std::string included via <string>. It does not use streams or the boost library explicitly but will be able to accept some of these types.

Also since C++-17 you can use std::vector<std::string_view> which is much faster and more memory-efficient than using std::string. Here is a revised version which also supports the container as a return type:

#include <vector>
#include <string_view>
#include <utility>
    
template < typename StringT,
           typename DelimiterT = char,
           typename ContainerT = std::vector<std::string_view> >
ContainerT split(StringT const& str, DelimiterT const& delimiters = ' ', bool trimEmpty = true, ContainerT&& tokens = {})
{
    typename StringT::size_type pos, lastPos = 0, length = str.length();

    while (lastPos < length + 1)
    {
        pos = str.find_first_of(delimiters, lastPos);
        if (pos == StringT::npos)
            pos = length;

      if (pos != lastPos || !trimEmpty)
            tokens.emplace_back(str.data() + lastPos, pos - lastPos);

        lastPos = pos + 1;
    }

    return std::forward<ContainerT>(tokens);
}

Care has been taken not to make any unneeded copies.

This will allow for either:

for (auto const& line : split(str, '\n'))

Or:

auto& lines = split(str, '\n');

Both returning the default template container type of std::vector<std::string_view>.

To get a specific container type back, or to pass an existing container, use the tokens input parameter with either a typed initial container or an existing container variable:

auto& lines = split(str, '\n', false, std::vector<std::string>());

Or:

std::vector<std::string> lines;
split(str, '\n', false, lines);

Answered   2023-09-20 20:18:01

  • I'm quite a fan of this, but for g++ (and probably good practice) anyone using this will want typedefs and typenames: typedef ContainerT Base; typedef typename Base::value_type ValueType; typedef typename ValueType::size_type SizeType; Then to substitute out the value_type and size_types accordingly. - anyone
  • For those of us for whom the template stuff and the first comment are completely foreign, a usage example cmplete with required includes would be lovely. - anyone
  • Ahh well, I figured it out. I put the C++ lines from aws' comment inside the function body of tokenize(), then edited the tokens.push_back() lines to change the ContainerT::value_type to just ValueType and changed (ContainerT::value_type::size_type) to (SizeType). Fixed the bits g++ had been whining about. Just invoke it as tokenize( some_string, some_vector ); - anyone
  • Apart from running a few performance tests on sample data, primarily I've reduced it to as few as possible instructions and also as little as possible memory copies enabled by the use of a substring class that only references offsets/lengths in other strings. (I rolled my own, but there are some other implementations). Unfortunately there is not too much else one can do to improve on this, but incremental increases were possible. - anyone
  • That's the correct output for when trimEmpty = true. Keep in mind that "abo" is not a delimiter in this answer, but the list of delimiter characters. It would be simple to modify it to take a single delimiter string of characters (I think str.find_first_of should change to str.find_first, but I could be wrong... can't test) - anyone

Here's another solution. It's compact and reasonably efficient:

std::vector<std::string> split(const std::string &text, char sep) {
  std::vector<std::string> tokens;
  std::size_t start = 0, end = 0;
  while ((end = text.find(sep, start)) != std::string::npos) {
    tokens.push_back(text.substr(start, end - start));
    start = end + 1;
  }
  tokens.push_back(text.substr(start));
  return tokens;
}

It can easily be templatised to handle string separators, wide strings, etc.

Note that splitting "" results in a single empty string and splitting "," (ie. sep) results in two empty strings.

It can also be easily expanded to skip empty tokens:

std::vector<std::string> split(const std::string &text, char sep) {
    std::vector<std::string> tokens;
    std::size_t start = 0, end = 0;
    while ((end = text.find(sep, start)) != std::string::npos) {
        if (end != start) {
          tokens.push_back(text.substr(start, end - start));
        }
        start = end + 1;
    }
    if (end != start) {
       tokens.push_back(text.substr(start));
    }
    return tokens;
}

If splitting a string at multiple delimiters while skipping empty tokens is desired, this version may be used:

std::vector<std::string> split(const std::string& text, const std::string& delims)
{
    std::vector<std::string> tokens;
    std::size_t start = text.find_first_not_of(delims), end = 0;

    while((end = text.find_first_of(delims, start)) != std::string::npos)
    {
        tokens.push_back(text.substr(start, end - start));
        start = text.find_first_not_of(delims, end);
    }
    if(start != std::string::npos)
        tokens.push_back(text.substr(start));

    return tokens;
}

Answered   2023-09-20 20:18:01

  • The first version is simple and gets the job done perfectly. The only change I would made would be to return the result directly, instead of passing it as a parameter. - anyone
  • The output is passed as a parameter for efficiency. If the result were returned it would require either a copy of the vector, or a heap allocation which would then have to be freed. - anyone
  • @AlecThomas: Even before C++11, wouldn't most compilers optimise away the return copy via NRVO? (+1 anyway; very succinct) - anyone
  • Out of all the answers this appears to be one of the most appealing and flexible. Together with the getline with a delimiter, although its a less obvious solution. Does the c++11 standard not have anything for this? Does c++11 support punch cards these days? - anyone
  • @LearnCocos2D Please don't alter the meaning of a post with an edit. This behaviour is by design. It is identical behaviour to Python's split operator. I'll add a note to make this clear. - anyone

This is my favorite way to iterate through a string. You can do whatever you want per word.

string line = "a line of text to iterate through";
string word;

istringstream iss(line, istringstream::in);

while( iss >> word )     
{
    // Do something on `word` here...
}

Answered   2023-09-20 20:18:01

  • Is it possible to declare word as a char? - anyone
  • Sorry abatishchev, C++ is not my strong point. But I imagine it would not be difficult to add an inner loop to loop through every character in each word. But right now I believe the current loop depends on spaces for word separation. Unless you know that there is only a single character between every space, in which case you can just cast "word" to a char... sorry I cant be of more help, ive been meaning to brush up on my C++ - anyone
  • if you declare word as a char it will iterate over every non-whitespace character. It's simple enough to try: stringstream ss("Hello World, this is*@#&$(@ a string"); char c; while(ss >> c) cout << c; - anyone
  • I don't understand how this got 140 upvotes. This is basically the same as in OQ: use a stringstream and >> from it. Exactly what OP did even in revision 1 of the question history. - anyone

This is similar to Stack Overflow question How do I tokenize a string in C++?. Requires Boost external library

#include <iostream>
#include <string>
#include <boost/tokenizer.hpp>

using namespace std;
using namespace boost;

int main(int argc, char** argv)
{
    string text = "token  test\tstring";

    char_separator<char> sep(" \t");
    tokenizer<char_separator<char>> tokens(text, sep);
    for (const string& t : tokens)
    {
        cout << t << "." << endl;
    }
}

Answered   2023-09-20 20:18:01

  • Does this materialize a copy of all of the tokens, or does it only keep the start and end position of the current token? - anyone

I like the following because it puts the results into a vector, supports a string as a delim and gives control over keeping empty values. But, it doesn't look as good then.

#include <ostream>
#include <string>
#include <vector>
#include <algorithm>
#include <iterator>
using namespace std;

vector<string> split(const string& s, const string& delim, const bool keep_empty = true) {
    vector<string> result;
    if (delim.empty()) {
        result.push_back(s);
        return result;
    }
    string::const_iterator substart = s.begin(), subend;
    while (true) {
        subend = search(substart, s.end(), delim.begin(), delim.end());
        string temp(substart, subend);
        if (keep_empty || !temp.empty()) {
            result.push_back(temp);
        }
        if (subend == s.end()) {
            break;
        }
        substart = subend + delim.size();
    }
    return result;
}

int main() {
    const vector<string> words = split("So close no matter how far", " ");
    copy(words.begin(), words.end(), ostream_iterator<string>(cout, "\n"));
}

Of course, Boost has a split() that works partially like that. And, if by 'white-space', you really do mean any type of white-space, using Boost's split with is_any_of() works great.

Answered   2023-09-20 20:18:01

  • Finally a solution that is handling empty tokens correctly at both sides of the string - anyone

The STL does not have such a method available already.

However, you can either use C's strtok() function by using the std::string::c_str() member, or you can write your own. Here is a code sample I found after a quick Google search ("STL string split"):

void Tokenize(const string& str,
              vector<string>& tokens,
              const string& delimiters = " ")
{
    // Skip delimiters at beginning.
    string::size_type lastPos = str.find_first_not_of(delimiters, 0);
    // Find first "non-delimiter".
    string::size_type pos     = str.find_first_of(delimiters, lastPos);

    while (string::npos != pos || string::npos != lastPos)
    {
        // Found a token, add it to the vector.
        tokens.push_back(str.substr(lastPos, pos - lastPos));
        // Skip delimiters.  Note the "not_of"
        lastPos = str.find_first_not_of(delimiters, pos);
        // Find next "non-delimiter"
        pos = str.find_first_of(delimiters, lastPos);
    }
}

Taken from: http://oopweb.com/CPP/Documents/CPPHOWTO/Volume/C++Programming-HOWTO-7.html

If you have questions about the code sample, leave a comment and I will explain.

And just because it does not implement a typedef called iterator or overload the << operator does not mean it is bad code. I use C functions quite frequently. For example, printf and scanf both are faster than std::cin and std::cout (significantly), the fopen syntax is a lot more friendly for binary types, and they also tend to produce smaller EXEs.

Don't get sold on this "Elegance over performance" deal.

Answered   2023-09-20 20:18:01

  • I'm aware of the C string functions and I'm aware of the performance issues too (both of which I've noted in my question). However, for this specific question, I'm looking for an elegant C++ solution. - anyone
  • @Nelson LaQuet: Let me guess: Because strtok is not reentrant? - anyone
  • @Nelson don't ever pass string.c_str() to strtok! strtok trashes the input string (inserts '\0' chars to replace each foudn delimiter) and c_str() returns a non-modifiable string. - anyone
  • @Nelson: That array needs to be of size str.size() + 1 in your last comment. But I agree with your thesis that it's silly to avoid C functions for "aesthetic" reasons. - anyone
  • @paulm: No, the slowness of C++ streams is caused by facets. They're still slower than stdio.h functions even when synchronization is disabled (and on stringstreams, which can't synchronize). - anyone

Here is a split function that:

  • is generic
  • uses standard C++ (no boost)
  • accepts multiple delimiters
  • ignores empty tokens (can easily be changed)

    template<typename T>
    vector<T> 
    split(const T & str, const T & delimiters) {
        vector<T> v;
        typename T::size_type start = 0;
        auto pos = str.find_first_of(delimiters, start);
        while(pos != T::npos) {
            if(pos != start) // ignore empty tokens
                v.emplace_back(str, start, pos - start);
            start = pos + 1;
            pos = str.find_first_of(delimiters, start);
        }
        if(start < str.length()) // ignore trailing delimiter
            v.emplace_back(str, start, str.length() - start); // add what's left of the string
        return v;
    }
    

Example usage:

    vector<string> v = split<string>("Hello, there; World", ";,");
    vector<wstring> v = split<wstring>(L"Hello, there; World", L";,");

Answered   2023-09-20 20:18:01

  • You forgot to add to use list: "extremely inefficient" - anyone
  • @XanderTulip, can you be more constructive and explain how or why? - anyone
  • @XanderTulip: I assume you are referring to it returning the vector by value. The Return-Value-Optimization (RVO, google it) should take care of this. Also in C++11 you could return by move reference. - anyone
  • This can actually be optimized further: instead of .push_back(str.substr(...)) one can use .emplace_back(str, start, pos - start). This way the string object is constructed in the container and thus we avoid a move operation + other shenanigans done by the .substr function. - anyone
  • @zoopp yes. Good idea. VS10 didn't have emplace_back support when I wrote this. I will update my answer. Thanks - anyone

I have a 2 lines solution to this problem:

char sep = ' ';
std::string s="1 This is an example";

for(size_t p=0, q=0; p!=s.npos; p=q)
  std::cout << s.substr(p+(p!=0), (q=s.find(sep, p+1))-p-(p!=0)) << std::endl;

Then instead of printing you can put it in a vector.

Answered   2023-09-20 20:18:01

  • it's only a two-liner because one of those two lines is huge and cryptic... no one who actually has to read code ever, wants to read something like this, or would write it. contrived brevity is worse than tasteful verbosity. - anyone
  • You can even make it a one liner by putting everything on a single line! Isn't that wonderful? - anyone

Yet another flexible and fast way

template<typename Operator>
void tokenize(Operator& op, const char* input, const char* delimiters) {
  const char* s = input;
  const char* e = s;
  while (*e != 0) {
    e = s;
    while (*e != 0 && strchr(delimiters, *e) == 0) ++e;
    if (e - s > 0) {
      op(s, e - s);
    }
    s = e + 1;
  }
}

To use it with a vector of strings (Edit: Since someone pointed out not to inherit STL classes... hrmf ;) ) :

template<class ContainerType>
class Appender {
public:
  Appender(ContainerType& container) : container_(container) {;}
  void operator() (const char* s, unsigned length) { 
    container_.push_back(std::string(s,length));
  }
private:
  ContainerType& container_;
};

std::vector<std::string> strVector;
Appender v(strVector);
tokenize(v, "A number of words to be tokenized", " \t");

That's it! And that's just one way to use the tokenizer, like how to just count words:

class WordCounter {
public:
  WordCounter() : noOfWords(0) {}
  void operator() (const char*, unsigned) {
    ++noOfWords;
  }
  unsigned noOfWords;
};

WordCounter wc;
tokenize(wc, "A number of words to be counted", " \t"); 
ASSERT( wc.noOfWords == 7 );

Limited by imagination ;)

Answered   2023-09-20 20:18:01

Here's a simple solution that uses only the standard regex library

#include <regex>
#include <string>
#include <vector>

std::vector<string> Tokenize( const string str, const std::regex regex )
{
    using namespace std;

    std::vector<string> result;

    sregex_token_iterator it( str.begin(), str.end(), regex, -1 );
    sregex_token_iterator reg_end;

    for ( ; it != reg_end; ++it ) {
        if ( !it->str().empty() ) //token could be empty:check
            result.emplace_back( it->str() );
    }

    return result;
}

The regex argument allows checking for multiple arguments (spaces, commas, etc.)

I usually only check to split on spaces and commas, so I also have this default function:

std::vector<string> TokenizeDefault( const string str )
{
    using namespace std;

    regex re( "[\\s,]+" );

    return Tokenize( str, re );
}

The "[\\s,]+" checks for spaces (\\s) and commas (,).

Note, if you want to split wstring instead of string,

  • change all std::regex to std::wregex
  • change all sregex_token_iterator to wsregex_token_iterator

Note, you might also want to take the string argument by reference, depending on your compiler.

Answered   2023-09-20 20:18:01

  • This would have been my favourite answer, but std::regex is broken in GCC 4.8. They said that they implemented it correctly in GCC 4.9. I am still giving you my +1 - anyone
  • This is my favorite with minor changes: vector returned as reference as you said, and the arguments "str" and "regex" passed by references also. thx. - anyone
  • Raw strings are pretty useful while dealing with regex patterns. That way, you don't have to use the escape sequences... You can just use R"([\s,]+)". - anyone

Using std::stringstream as you have works perfectly fine, and do exactly what you wanted. If you're just looking for different way of doing things though, you can use std::find()/std::find_first_of() and std::string::substr().

Here's an example:

#include <iostream>
#include <string>

int main()
{
    std::string s("Somewhere down the road");
    std::string::size_type prev_pos = 0, pos = 0;

    while( (pos = s.find(' ', pos)) != std::string::npos )
    {
        std::string substring( s.substr(prev_pos, pos-prev_pos) );

        std::cout << substring << '\n';

        prev_pos = ++pos;
    }

    std::string substring( s.substr(prev_pos, pos-prev_pos) ); // Last word
    std::cout << substring << '\n';

    return 0;
}

Answered   2023-09-20 20:18:01

  • This only works for single character delimiters. A simple change lets it work with multicharacter: prev_pos = pos += delimiter.length(); - anyone

If you like to use boost, but want to use a whole string as delimiter (instead of single characters as in most of the previously proposed solutions), you can use the boost_split_iterator.

Example code including convenient template:

#include <iostream>
#include <vector>
#include <boost/algorithm/string.hpp>

template<typename _OutputIterator>
inline void split(
    const std::string& str, 
    const std::string& delim, 
    _OutputIterator result)
{
    using namespace boost::algorithm;
    typedef split_iterator<std::string::const_iterator> It;

    for(It iter=make_split_iterator(str, first_finder(delim, is_equal()));
            iter!=It();
            ++iter)
    {
        *(result++) = boost::copy_range<std::string>(*iter);
    }
}

int main(int argc, char* argv[])
{
    using namespace std;

    vector<string> splitted;
    split("HelloFOOworldFOO!", "FOO", back_inserter(splitted));

    // or directly to console, for example
    split("HelloFOOworldFOO!", "FOO", ostream_iterator<string>(cout, "\n"));
    return 0;
}

Answered   2023-09-20 20:18:01

Heres a regex solution that only uses the standard regex library. (I'm a little rusty, so there may be a few syntax errors, but this is at least the general idea)

#include <regex.h>
#include <string.h>
#include <vector.h>

using namespace std;

vector<string> split(string s){
    regex r ("\\w+"); //regex matches whole words, (greedy, so no fragment words)
    regex_iterator<string::iterator> rit ( s.begin(), s.end(), r );
    regex_iterator<string::iterator> rend; //iterators to iterate thru words
    vector<string> result<regex_iterator>(rit, rend);
    return result;  //iterates through the matches to fill the vector
}

Answered   2023-09-20 20:18:01

There is a function named strtok.

#include<string>
using namespace std;

vector<string> split(char* str,const char* delim)
{
    char* saveptr;
    char* token = strtok_r(str,delim,&saveptr);

    vector<string> result;

    while(token != NULL)
    {
        result.push_back(token);
        token = strtok_r(NULL,delim,&saveptr);
    }
    return result;
}

Answered   2023-09-20 20:18:01

  • strtok is from the C standard library, not C++. It is not safe to use in multithreaded programs. It modifies the input string. - anyone
  • Because it stores the char pointer from the first call in a static variable, so that on the subsequent calls when NULL is passed, it remembers what pointer should be used. If a second thread calls strtok when another thread is still processing, this char pointer will be overwritten, and both threads will then have incorrect results. mkssoftware.com/docs/man3/strtok.3.asp - anyone
  • as mentioned before strtok is unsafe and even in C strtok_r is recommended for use - anyone
  • strtok_r can be used if you are in a section of code that may be accessed. this is the only solution of all of the above that isn't "line noise", and is a testament to what, exactly, is wrong with c++ - anyone
  • strtok is evil. It treats two delimiters as a single delimiter if there is nothing between them. - anyone

C++20 finally blesses us with a split function. Or rather, a range adapter. Godbolt link.

#include <iostream>
#include <ranges>
#include <string_view>

namespace ranges = std::ranges;
namespace views = std::views;

using str = std::string_view;

auto view =
    "Multiple words"
    | views::split(' ')
    | views::transform([](auto &&r) -> str {
        return str(r.begin(), r.end());
    });

auto main() -> int {
    for (str &&sv : view) {
        std::cout << sv << '\n';
    }
}

Answered   2023-09-20 20:18:01

Using std::string_view and Eric Niebler's range-v3 library:

https://wandbox.org/permlink/kW5lwRCL1pxjp2pW

#include <iostream>
#include <string>
#include <string_view>
#include "range/v3/view.hpp"
#include "range/v3/algorithm.hpp"

int main() {
    std::string s = "Somewhere down the range v3 library";
    ranges::for_each(s  
        |   ranges::view::split(' ')
        |   ranges::view::transform([](auto &&sub) {
                return std::string_view(&*sub.begin(), ranges::distance(sub));
            }),
        [](auto s) {std::cout << "Substring: " << s << "\n";}
    );
}

By using a range for loop instead of ranges::for_each algorithm:

#include <iostream>
#include <string>
#include <string_view>
#include "range/v3/view.hpp"

int main()
{
    std::string str = "Somewhere down the range v3 library";
    for (auto s : str | ranges::view::split(' ')
                      | ranges::view::transform([](auto&& sub) { return std::string_view(&*sub.begin(), ranges::distance(sub)); }
                      ))
    {
        std::cout << "Substring: " << s << "\n";
    }
}

Answered   2023-09-20 20:18:01

  • Yepp, the range for based looks better - I agree - anyone

The stringstream can be convenient if you need to parse the string by non-space symbols:

string s = "Name:JAck; Spouse:Susan; ...";
string dummy, name, spouse;

istringstream iss(s);
getline(iss, dummy, ':');
getline(iss, name, ';');
getline(iss, dummy, ':');
getline(iss, spouse, ';')

Answered   2023-09-20 20:18:01

So far I used the one in Boost, but I needed something that doesn't depends on it, so I came to this:

static void Split(std::vector<std::string>& lst, const std::string& input, const std::string& separators, bool remove_empty = true)
{
    std::ostringstream word;
    for (size_t n = 0; n < input.size(); ++n)
    {
        if (std::string::npos == separators.find(input[n]))
            word << input[n];
        else
        {
            if (!word.str().empty() || !remove_empty)
                lst.push_back(word.str());
            word.str("");
        }
    }
    if (!word.str().empty() || !remove_empty)
        lst.push_back(word.str());
}

A good point is that in separators you can pass more than one character.

Answered   2023-09-20 20:18:01

Short and elegant

#include <vector>
#include <string>
using namespace std;

vector<string> split(string data, string token)
{
    vector<string> output;
    size_t pos = string::npos; // size_t to avoid improbable overflow
    do
    {
        pos = data.find(token);
        output.push_back(data.substr(0, pos));
        if (string::npos != pos)
            data = data.substr(pos + token.size());
    } while (string::npos != pos);
    return output;
}

can use any string as delimiter, also can be used with binary data (std::string supports binary data, including nulls)

using:

auto a = split("this!!is!!!example!string", "!!");

output:

this
is
!example!string

Answered   2023-09-20 20:18:01

  • I like this solution because it allows the separator to be a string and not a char, however, it is modifying in place the string, so it is forcing the creation of a copy of the original string. - anyone

I've rolled my own using strtok and used boost to split a string. The best method I have found is the C++ String Toolkit Library. It is incredibly flexible and fast.

#include <iostream>
#include <vector>
#include <string>
#include <strtk.hpp>

const char *whitespace  = " \t\r\n\f";
const char *whitespace_and_punctuation  = " \t\r\n\f;,=";

int main()
{
    {   // normal parsing of a string into a vector of strings
        std::string s("Somewhere down the road");
        std::vector<std::string> result;
        if( strtk::parse( s, whitespace, result ) )
        {
            for(size_t i = 0; i < result.size(); ++i )
                std::cout << result[i] << std::endl;
        }
    }

    {  // parsing a string into a vector of floats with other separators
        // besides spaces

        std::string s("3.0, 3.14; 4.0");
        std::vector<float> values;
        if( strtk::parse( s, whitespace_and_punctuation, values ) )
        {
            for(size_t i = 0; i < values.size(); ++i )
                std::cout << values[i] << std::endl;
        }
    }

    {  // parsing a string into specific variables

        std::string s("angle = 45; radius = 9.9");
        std::string w1, w2;
        float v1, v2;
        if( strtk::parse( s, whitespace_and_punctuation, w1, v1, w2, v2) )
        {
            std::cout << "word " << w1 << ", value " << v1 << std::endl;
            std::cout << "word " << w2 << ", value " << v2 << std::endl;
        }
    }

    return 0;
}

The toolkit has much more flexibility than this simple example shows but its utility in parsing a string into useful elements is incredible.

Answered   2023-09-20 20:18:01

This answer takes the string and puts it into a vector of strings. It uses the boost library.

#include <boost/algorithm/string.hpp>
std::vector<std::string> strs;
boost::split(strs, "string to split", boost::is_any_of("\t "));

Answered   2023-09-20 20:18:01

I made this because I needed an easy way to split strings and C-based strings. Hopefully someone else can find it useful as well. Also, it doesn't rely on tokens, and you can use fields as delimiters, which is another key I needed.

I'm sure there are improvements that can be made to even further improve its elegance, and please do by all means.

StringSplitter.hpp:

#include <vector>
#include <iostream>
#include <string.h>

using namespace std;

class StringSplit
{
private:
    void copy_fragment(char*, char*, char*);
    void copy_fragment(char*, char*, char);
    bool match_fragment(char*, char*, int);
    int untilnextdelim(char*, char);
    int untilnextdelim(char*, char*);
    void assimilate(char*, char);
    void assimilate(char*, char*);
    bool string_contains(char*, char*);
    long calc_string_size(char*);
    void copy_string(char*, char*);

public:
    vector<char*> split_cstr(char);
    vector<char*> split_cstr(char*);
    vector<string> split_string(char);
    vector<string> split_string(char*);
    char* String;
    bool do_string;
    bool keep_empty;
    vector<char*> Container;
    vector<string> ContainerS;

    StringSplit(char * in)
    {
        String = in;
    }

    StringSplit(string in)
    {
        size_t len = calc_string_size((char*)in.c_str());
        String = new char[len + 1];
        memset(String, 0, len + 1);
        copy_string(String, (char*)in.c_str());
        do_string = true;
    }

    ~StringSplit()
    {
        for (int i = 0; i < Container.size(); i++)
        {
            if (Container[i] != NULL)
            {
                delete[] Container[i];
            }
        }
        if (do_string)
        {
            delete[] String;
        }
    }
};

StringSplitter.cpp:

#include <string.h>
#include <iostream>
#include <vector>
#include "StringSplit.hpp"

using namespace std;

void StringSplit::assimilate(char*src, char delim)
{
    int until = untilnextdelim(src, delim);
    if (until > 0)
    {
        char * temp = new char[until + 1];
        memset(temp, 0, until + 1);
        copy_fragment(temp, src, delim);
        if (keep_empty || *temp != 0)
        {
            if (!do_string)
            {
                Container.push_back(temp);
            }
            else
            {
                string x = temp;
                ContainerS.push_back(x);
            }

        }
        else
        {
            delete[] temp;
        }
    }
}

void StringSplit::assimilate(char*src, char* delim)
{
    int until = untilnextdelim(src, delim);
    if (until > 0)
    {
        char * temp = new char[until + 1];
        memset(temp, 0, until + 1);
        copy_fragment(temp, src, delim);
        if (keep_empty || *temp != 0)
        {
            if (!do_string)
            {
                Container.push_back(temp);
            }
            else
            {
                string x = temp;
                ContainerS.push_back(x);
            }
        }
        else
        {
            delete[] temp;
        }
    }
}

long StringSplit::calc_string_size(char* _in)
{
    long i = 0;
    while (*_in++)
    {
        i++;
    }
    return i;
}

bool StringSplit::string_contains(char* haystack, char* needle)
{
    size_t len = calc_string_size(needle);
    size_t lenh = calc_string_size(haystack);
    while (lenh--)
    {
        if (match_fragment(haystack + lenh, needle, len))
        {
            return true;
        }
    }
    return false;
}

bool StringSplit::match_fragment(char* _src, char* cmp, int len)
{
    while (len--)
    {
        if (*(_src + len) != *(cmp + len))
        {
            return false;
        }
    }
    return true;
}

int StringSplit::untilnextdelim(char* _in, char delim)
{
    size_t len = calc_string_size(_in);
    if (*_in == delim)
    {
        _in += 1;
        return len - 1;
    }

    int c = 0;
    while (*(_in + c) != delim && c < len)
    {
        c++;
    }

    return c;
}

int StringSplit::untilnextdelim(char* _in, char* delim)
{
    int s = calc_string_size(delim);
    int c = 1 + s;

    if (!string_contains(_in, delim))
    {
        return calc_string_size(_in);
    }
    else if (match_fragment(_in, delim, s))
    {
        _in += s;
        return calc_string_size(_in);
    }

    while (!match_fragment(_in + c, delim, s))
    {
        c++;
    }

    return c;
}

void StringSplit::copy_fragment(char* dest, char* src, char delim)
{
    if (*src == delim)
    {
        src++;
    }
        
    int c = 0;
    while (*(src + c) != delim && *(src + c))
    {
        *(dest + c) = *(src + c);
        c++;
    }
    *(dest + c) = 0;
}

void StringSplit::copy_string(char* dest, char* src)
{
    int i = 0;
    while (*(src + i))
    {
        *(dest + i) = *(src + i);
        i++;
    }
}

void StringSplit::copy_fragment(char* dest, char* src, char* delim)
{
    size_t len = calc_string_size(delim);
    size_t lens = calc_string_size(src);
    
    if (match_fragment(src, delim, len))
    {
        src += len;
        lens -= len;
    }
    
    int c = 0;
    while (!match_fragment(src + c, delim, len) && (c < lens))
    {
        *(dest + c) = *(src + c);
        c++;
    }
    *(dest + c) = 0;
}

vector<char*> StringSplit::split_cstr(char Delimiter)
{
    int i = 0;
    while (*String)
    {
        if (*String != Delimiter && i == 0)
        {
            assimilate(String, Delimiter);
        }
        if (*String == Delimiter)
        {
            assimilate(String, Delimiter);
        }
        i++;
        String++;
    }

    String -= i;
    delete[] String;

    return Container;
}

vector<string> StringSplit::split_string(char Delimiter)
{
    do_string = true;
    
    int i = 0;
    while (*String)
    {
        if (*String != Delimiter && i == 0)
        {
            assimilate(String, Delimiter);
        }
        if (*String == Delimiter)
        {
            assimilate(String, Delimiter);
        }
        i++;
        String++;
    }

    String -= i;
    delete[] String;

    return ContainerS;
}

vector<char*> StringSplit::split_cstr(char* Delimiter)
{
    int i = 0;
    size_t LenDelim = calc_string_size(Delimiter);

    while(*String)
    {
        if (!match_fragment(String, Delimiter, LenDelim) && i == 0)
        {
            assimilate(String, Delimiter);
        }
        if (match_fragment(String, Delimiter, LenDelim))
        {
            assimilate(String,Delimiter);
        }
        i++;
        String++;
    }

    String -= i;
    delete[] String;

    return Container;
}

vector<string> StringSplit::split_string(char* Delimiter)
{
    do_string = true;
    int i = 0;
    size_t LenDelim = calc_string_size(Delimiter);

    while (*String)
    {
        if (!match_fragment(String, Delimiter, LenDelim) && i == 0)
        {
            assimilate(String, Delimiter);
        }
        if (match_fragment(String, Delimiter, LenDelim))
        {
            assimilate(String, Delimiter);
        }
        i++;
        String++;
    }

    String -= i;
    delete[] String;

    return ContainerS;
}

Examples:

int main(int argc, char*argv[])
{
    StringSplit ss = "This:CUT:is:CUT:an:CUT:example:CUT:cstring";
    vector<char*> Split = ss.split_cstr(":CUT:");

    for (int i = 0; i < Split.size(); i++)
    {
        cout << Split[i] << endl;
    }

    return 0;
}

Will output:

This
is
an
example
cstring

int main(int argc, char*argv[])
{
    StringSplit ss = "This:is:an:example:cstring";
    vector<char*> Split = ss.split_cstr(':');

    for (int i = 0; i < Split.size(); i++)
    {
        cout << Split[i] << endl;
    }

    return 0;
}

int main(int argc, char*argv[])
{
    string mystring = "This[SPLIT]is[SPLIT]an[SPLIT]example[SPLIT]string";
    StringSplit ss = mystring;
    vector<string> Split = ss.split_string("[SPLIT]");

    for (int i = 0; i < Split.size(); i++)
    {
        cout << Split[i] << endl;
    }

    return 0;
}

int main(int argc, char*argv[])
{
    string mystring = "This|is|an|example|string";
    StringSplit ss = mystring;
    vector<string> Split = ss.split_string('|');

    for (int i = 0; i < Split.size(); i++)
    {
        cout << Split[i] << endl;
    }

    return 0;
}

To keep empty entries (by default empties will be excluded):

StringSplit ss = mystring;
ss.keep_empty = true;
vector<string> Split = ss.split_string(":DELIM:");

The goal was to make it similar to C#'s Split() method where splitting a string is as easy as:

String[] Split = 
    "Hey:cut:what's:cut:your:cut:name?".Split(new[]{":cut:"}, StringSplitOptions.None);

foreach(String X in Split)
{
    Console.Write(X);
}

I hope someone else can find this as useful as I do.

Answered   2023-09-20 20:18:01

Here's another way of doing it..

void split_string(string text,vector<string>& words)
{
  int i=0;
  char ch;
  string word;

  while(ch=text[i++])
  {
    if (isspace(ch))
    {
      if (!word.empty())
      {
        words.push_back(word);
      }
      word = "";
    }
    else
    {
      word += ch;
    }
  }
  if (!word.empty())
  {
    words.push_back(word);
  }
}

Answered   2023-09-20 20:18:01

  • I believe this could be optimized a bit by using word.clear() instead of word = "". Calling the clear method will empty the string but keep the already allocated buffer, which will be reused upon further concatenations. Right now a new buffer is created for every word, resulting in extra allocations. - anyone

Recently I had to split a camel-cased word into subwords. There are no delimiters, just upper characters.

#include <string>
#include <list>
#include <locale> // std::isupper

template<class String>
const std::list<String> split_camel_case_string(const String &s)
{
    std::list<String> R;
    String w;

    for (String::const_iterator i = s.begin(); i < s.end(); ++i) {  {
        if (std::isupper(*i)) {
            if (w.length()) {
                R.push_back(w);
                w.clear();
            }
        }
        w += *i;
    }

    if (w.length())
        R.push_back(w);
    return R;
}

For example, this splits "AQueryTrades" into "A", "Query" and "Trades". The function works with narrow and wide strings. Because it respects the current locale it splits "RaumfahrtÜberwachungsVerordnung" into "Raumfahrt", "Überwachungs" and "Verordnung".

Note std::upper should be really passed as function template argument. Then the more generalized from of this function can split at delimiters like ",", ";" or " " too.

Answered   2023-09-20 20:18:01

  • There have been 2 revs. That's nice. Seems as if my English had to much of a "German". However, the revisionist did not fixed two minor bugs maybe because they were obvious anyway: std::isupper could be passed as argument, not std::upper. Second put a typename before the String::const_iterator. - anyone
  • std::isupper is guaranteed to be defined only in <cctype> header (the C++ version of the C <ctype.h> header), so you must include that. This is like relying we can use std::string by using the <iostream> header instead of the <string> header. - anyone

What about this:

#include <string>
#include <vector>

using namespace std;

vector<string> split(string str, const char delim) {
    vector<string> v;
    string tmp;

    for(string::const_iterator i; i = str.begin(); i <= str.end(); ++i) {
        if(*i != delim && i != str.end()) {
            tmp += *i; 
        } else {
            v.push_back(tmp);
            tmp = ""; 
        }   
    }   

    return v;
}

Answered   2023-09-20 20:18:01

  • This is the best answer here, if you only want to split on a single delimiter character. The original question wanted to split on whitespace though, meaning any combination of one or more consecutive spaces or tabs. You have actually answered stackoverflow.com/questions/53849 - anyone

I like to use the boost/regex methods for this task since they provide maximum flexibility for specifying the splitting criteria.

#include <iostream>
#include <string>
#include <boost/regex.hpp>

int main() {
    std::string line("A:::line::to:split");
    const boost::regex re(":+"); // one or more colons

    // -1 means find inverse matches aka split
    boost::sregex_token_iterator tokens(line.begin(),line.end(),re,-1);
    boost::sregex_token_iterator end;

    for (; tokens != end; ++tokens)
        std::cout << *tokens << std::endl;
}

Answered   2023-09-20 20:18:01