[c++] 정규 표현식 regex

유용한 기술

[c++] 정규 표현식 regex

무한 나무 2024. 5. 26. 01:11

include <regex>

정규 표현식 기반, 문자열에 대한 탐색/교체

정규 표현식 기호 종류

.	임의의 한 문자를 나타낸다. 예를 들면, a.c → "abc" \| "a2c" \| "a&c" 등.
*	앞의 요소가 '0'번 이상 반복됨을 나타낸다. 예를 들면, *a → "" \| "a" \| "aa"** 등.
+	앞의 요소가 '1'번 이상 반복됨을 나타낸다. 예를 들면, *a → "a" \| "aa"** 등. ( "" 은 x)
?	앞의 요소가 0번 또는 1번 나타난다. 예를 들면, a? → "" \| "a"
\\d	숫자를 나타낸다. 0 ~ 9 까지 숫자 문자.
\\w	글자를 나타낸다. 알파벳, 숫자, 밑줄(_)
\\b	단어 경계를 나타낸다. \\b ~~~ \\b (단어의 시작과 끝을 구분하는 용도)
[xxx..]	대괄호 안에 있는 문자와 일치하는 글자를 나타낸다. 예를 들면, [abc] → "a" \| "b" \| "c".
[^xxx..]	대괄호 안에 있는 문자와 일치하지 않는 글자를 나타낸다. 예를 들면, [^abc] → "a" \| "b" \| "c" 를 제외한 모든 글자
^x	"x" 로 시작하는 문자열 또는 줄을 나타낸다. 예를 들면, ^a → 'a' 로 시작하는 문자열.
$x	"x" 로 끝나는 문자열 또는 줄을 나타낸다. 예를 들면, ^a → 'a' 로 끝나는 문자열.
{m, n}	앞의 요소가 m번 이상 n번 이하로 포함된다는 것을 나타낸다. ex) a{1, 3} → a가 1번이상, 3번 이하 포함된 문자열
{min}	앞의 요소가 min번 이상 포함된다는 것을 나타낸다. ex) a{1} → a가 1번이상 포함된 문자열
()	그룹핑 : 말 그대로 조건에 그룹을 표현하는 것. ex) ( //d+001* )+ 캡쳐 그룹 : 괄호 안의 조건 해당하는 글자를 골라내 따로 저장하는 용도 (regex_match 에서 씀)
\|	or 연산
\	이스케이프 문자. ex) \. 은 '.' 이 임의의 한 문자를 나타내기 때문에 말그대로 마침표(.) 글자를 표현하기 위함 \\, \a, \b, \n, \f, \r, \t, \v. 백슬래시, 경고, 백스페이스, 줄 바꿈 , 양식 피드, 캐리지 리턴, 가로 탭, 세로 탭.
[ x - y ]	ex) [a-c] → a, b, c 중 매칭되는 것 하나

[예제]

"(100+1+ | 01)+" 라는 표현식 이 있다고 했을 때,

100+ : 10 뒤에 0이 1개 이상 나오는 문자열
1+ : 1이 1개 이상 나오는 문자열
01 : "01" 문자열
| : 100+1+ 패턴이나 01 패턴 중 하나
() : 위 패턴이 하나 이상 반복되는 문자열

<3자리 국가코드, 4자리 중간번호, 4자리 마지막 번호> 로 표현 되는 핸트폰 번호 정규 표현식

"\\b010-\\d{4}-\\d{4}\\b"

\\b ~ \\b : 단어의 경계
010- : "010-" 으로 시작되는 문자열
- : "-" 글자
\\d{4} : 4글자 숫자 인 문자열

"01[0-6]{1}-\\d{3,4}-\\d{4}"

01 : "01" 로 시작되는 문자열
[0-6]{1} : 0~6 까지의 숫자 문자 1개. ( {1} 생략 가능)
\\d{3,4} : 3글자 혹은 4글자의 숫자 문자열

Regular expressions library (since C++11) - cppreference.com

Regular expressions library The regular expressions library provides a class that represents regular expressions, which are a kind of mini-language used to perform pattern matching within strings. Almost all operations with regexes can be characterized by

en.cppreference.com

미리보기

std::regex pattern("Get|GetValue");
std::cmatch m;
std::regex_search ("GetValue", m, re); // returns true, and m[0] contains "Get"
std::regex_match ("GetValue", m, re); // returns true, and m[0] contains "GetValue"
std::regex_search ("GetValues", m, re); // returns true, and m[0] contains "Get"
std::regex_match ("GetValues", m, re); // returns false

regex

정규 표현식 객체

가장 먼저 정규 표현식 객체를 정의해야 한다.

std::regex pattern("db-\\d*-log\\.txt");

정규 표현식 문법에는 여러 종류가 있다.

ECMAScript: JavaScript 및 .NET 언어에서 사용하는 문법과 가장 가깝습니다. (기본값)
basic: POSIX basic 정규식 또는 BRE입니다.
extended: POSIX extended 정규식 또는 ERE입니다.
awk: 이 extended경우 인쇄되지 않는 문자에 대한 이스케이프가 더 많이 있습니다.
grep: 이 basic경우 줄 바꿈(\n) 문자가 교대로 구분됩니다.
egrep: 이 extended경우 줄 바꿈 문자가 교대로 구분할 수도 있습니다.

그리고 여러 플래그를 적용할 수도 있다.

icase: 일치 시 대/소문자를 무시합니다.
nosubs: 표시된 일치 항목(즉, 괄호 안의 식)을 무시합니다. 대체 항목이 저장되지 않습니다.
optimize: 정규 표현식 객체를 생성하는데에는 시간이 좀 더 걸리지만 정규 표현식 객체를 사용하는 작업은 좀 더 빠르게 수행
collate: 로캘 구분 데이터 정렬 시퀀스(예: 양식 [a-z]범위)를 사용합니다.
(0개 이상의 플래그를 문법과 결합하여 정규식 엔진 동작을 지정할 수 있다)

std::regex pattern("db-\\d*-log\\.txt", std::regex::grep | std::regex::icase);

regex_match

문자열 매칭하기.

해당 문자열이 정규 표현식과 일치하는지 확인하는 함수. (bool 반환)

<"db-(숫자)-log.txt" 파일 이름 형식과 맞는지 확인하는 예제>

#include <iostream>
#include <regex>
#include <vector>

using namespace std;

int main() 
{
  // 확인할 파일 이름들.
  vector<string> file_names = {"db-123-log.txt", "db-124-log.txt",
                               "not-db-log.txt", "db-12-log.txt",
                               "db-12-log.jpg"};
                               
  regex pattern("db-\\d*-log\\.txt");
  
  for (const auto &file_name : file_names) 
  {
    // std::boolalpha 는 bool 을 0 과 1 대신에 false, true 로 표현하게 해줍니다.
    cout << file_name << ": " << std::boolalpha << regex_match(file_name, re) << endl;
  }
}

매칭되는 문자열 중, 부분 뽑아내기

정규 표현식과 일치하는 문자열 중에 일부분 추출하여 저장하는 방법

() 캡처 그룹을 사용하여 정규 표현식 중 원하는 부위에 감싼다.
smatch 타입 변수에 저장한다.

< 핸드폰 번호 중, 중간/마지막 번호 뽑아내는 예제>

vector<string> phone_numbers = { "010-1234-5678", "000-123-4567",
                                 "011-1234-5567", "010-12345-6789",
                                 "123-4567-8901", "010-1234-567" };
   
    regex pattern("01[0-6]{1}-(\\d{3,4})-(\\d{4})");
    smatch match;  // 매칭된 결과를 string 으로 보관
    
    for (const auto& number : phone_numbers) {
        if (regex_match(number, match, re)) {
            for (size_t i = 0; i < match.size(); i++) {
                cout << "Match : " << match[i].str() << endl;
            }
            cout << "-----------------------\n";
        }
    }
    
    // 그냥 match.str() 을 한다면 가장 처음 수집된 문자열이 반환된다.

regex_search

문자열 검색하기.

해당 문자열중에 정규 표현식과 일치하는 "일부" 문자열이 있는지 확인하는 함수. (bool 반환)

<문자열에서 숫자가 있는지 확인 및 뽑아내는 예제>

#include <iostream>
#include <string>
#include <regex>

std::string extractNumbers(const std::string& input) {
    std::regex pattern("\\d+");  // Regular expression to match one or more digits
    std::smatch match;
    std::string result;

    // Search the input string for numbers
    if (std::regex_search(input, match, re)) {
        result = match.str();  // Extract the first matched number
    }

    return result;
}

int main() {
    std::string input = "The price is 42 dollars";
    std::string numbers = extractNumbers(input);
    
    if (!numbers.empty()) {
        std::cout << "Found number: " << numbers << std::endl;
    } else {
        std::cout << "No numbers found." << std::endl;
    }

    return 0;
}

문자열 반복 검색하기.

regex_search() 는 반복 호출 시, 그냥 기존 문자열 기준으로 검색된 같은 패턴을 반복해서 반환한다.

검색된 패턴 이후 부터 다시 검색하는 방법은 match.suffix() 를 호출하여 반환된 문자열 기준으로 다시 검색하는 것이다.

match.suffix() 은 기존 문자열에서 검색된 패턴 바로 뒤 부터 끝까지 해당하는 ssub_match 객체를 리턴한다.
ssub_match 객체는 string 으로 캐스팅 변환하는 연산자가 들어있다.

<문자열에 존재하는 모든 숫자를 찾아내어 뽑아내는 예제>

std::vector<string> extractNumbers(const std::string& input) {
    std::regex pattern("\\d+");  // Regular expression to match one or more digits
    std::smatch match;
    std::vector<string> result;
    string check = input;
    // Search the input string for numbers

    while (std::regex_search(check, match, re)) {
        result.push_back(match.str());
        check = match.suffix();
    }

    return result;
}

int main() {
    std::string input = "The price is 42 or 75 dollars";
    std::vector<string> numbers = extractNumbers(input);

    if (!numbers.empty()) {
        for(string str : numbers)
        std::cout << "Found number: " << str << std::endl;
    }
    else {
        std::cout << "No numbers found." << std::endl;
    }

    return 0;
}

regex_iterator.

iterator 를 이용하면 좀 더 편리한 반복 검색을 수행할 수 있다.

sregex_iterator 단독 선언은 반복자의 종단(end) 를 의미한다.

std::vector<string> extractSecondNumber(const std::string& input) {
    std::regex pattern("\\d+");  // Regular expression to match one or more digits
    std::sregex_iterator currentMatch(input.begin(), input.end(), re);
    std::sregex_iterator lastMatch;

    std::vector<std::string> matches;

    while (currentMatch != lastMatch) {
        matches.push_back(currentMatch->str());
        ++currentMatch;
    }

   return matches;  // Return the second matched number
}

int main() {
    std::string input = "There are 42 apples and 23 oranges";
    std::vector<string> Numbers = extractSecondNumber(input);

    if (!Numbers.empty()) {
        for(string str : Numbers)
            std::cout << "Found number: " << str << std::endl;
    }
    else {
        std::cout << "numbers not founded." << std::endl;
    }

    return 0;
}