Splitting Strings in C++

November 13, 2010 12:00

I can't tell you how many times I've come across problems with splitting strings in various languages. Whether it be splitting a string on all white space and getting only text, splitting a string on commas and periods or breaking up a string by line breaks, I've done it all. Sometimes there's an easy way to perform what I need to do in the language I'm working in, but often there's not. I end up hacking together something that will work in my particular situation and being satisfied enough to move on.

This has happened so many times that I finally decided to put together a simple method that does it all. It reads in a string and returns an array created by splitting that string based on an array of delimiters the user provides. I haven't done any testing against possible algorithms that already exist to see if my solution is slower and by how much, but I'll get to that at another time. Onto the solution:

int splitString(char* stringToSplit, char* splitters, int numSplitters, char** toModify) {   
   int size = 0;
   int curIndex = 0;
   bool parsingWord = false;
   char* curWord = new char();  
   for (int i = 0; stringToSplit[i] != '\0'; i++) {    
      if (splitterContains(stringToSplit[i], splitters, numSplitters)) {      
         if (parsingWord) {	
            toModify[size++] = curWord;	
            curWord = new char();	
            curIndex = 0;      
         }      
      parsingWord = false;    
      }    
      else      
         parsingWord = true;
   if (parsingWord)      
      curWord[curIndex++] = stringToSplit[i];   
   }  
   if (parsingWord)    
      *(toModify + (size++)) = curWord;  
   return size;
}

bool splitterContains(char toCheck, char* splitters, int numSplitters) {  
   for (int i = 0; i < numSplitters; i++)    
      if (toCheck == *(splitters + i))      
         return true;  
   return false;
}
The input parameters are as follows:
char* stringToSplit  
   -Pointer to the string to split
char* splitters  
   -Pointer to the array of delimiters
int numSplitters  
   -the size of splitter
char** to Modify   
   -A pointer to an array of char pointers. An example of intialization for this parameter would be char* toModify[16];
return value  
   -returns the size of toModify after the algorithm has been run.
The method may be a tad dirty for now, But this is my initial solution. I'll do some benchmarking soon and attempt to get my algorithm either equivalent to or faster than the current library algorithms. Following is some sample input/output for the program. Enjoy!
char testString[] = "This 	 is   		  my		   test 	 	 string"; //Full of tabs and spaces between words
char* result[16];
char splitters[] = {' ', '\t'};
int size = splitString(testString, splitters, 2, result);
for (int i = 0; i < size; i++)  
   cout << result[i] << endl;

Output:
This
is
my
test
string
char testString[] = "This,,.....;is,another;test.string";
char* result[16];
char splitters[] = {',', '.',';'};
int size = splitString(testString, splitters, 3, result);
for (int i = 0; i < size; i++)  
   cout << result[i] << endl;

Output:
This
is
another
test
string
Download the source here

Got something to say? Tell me!

Name*

Homepage

Comment*



Nov.15.2010
04:15

Adam Fairbanks

How about

string[] result = Regex.Split(value, "\r\n\.,; ");


Nov.15.2010
04:18

Adam Fairbanks

The r n and period within the quotes should have a backslash before them (the blog removed the slashes).


Nov.15.2010
09:12

Andrew Woolston

Funny you should mention that - I actually did some benchmarking originally using Regular expressions to parse on any white space in a string. Running on around 350,000 strings, It took around 3 seconds. Using this self written method, it took roughly half as long.

I know that most if not all languages have support for doing something like this, I just felt like seeing how well an algorithm like this would compare to the actual libraries in use.


Nov.15.2010
18:58

Andrew Woolston

By the way, thanks for pointing out the slash issue. That's my own algorithm messing up. I'll get that formatting issue taken care of too so line breaks are actually line breaks.