regex - Regular Expression for Digits and Special Characters - C# -
i use html-agility-pack
extract information websites. in process data in form of string
, use data in program.
sometimes data includes multiple details in single string
. name of movie "dog eats dog (2012) (2012)". name should have been "dog eats dog (2012)" rather first one.
above 1 example many. in order correct issue tried use string.distinct()
method remove duplicate characters in string
in above example return "dog eats (2012)". solved initial problem removing 2nd (2012) created new 1 changing actual title.
i thought problem solved regex
have no idea how can use here. far know if use regex
tell me there duplicate items in string
according defined regex
code.
but how remove it? there can string
"meme 2013 (2013) (2013)". actual title "meme 2013" year (2013) , duplicate year (2013). if bool
value indicating string
has duplicate year, cant think of method remove duplicate substring
.
the duplicate year comes in end of string
. should regex
use determine string has 2 years in it, (2012) (2012)?
if can correctly identify string
contains duplicate maybe can use string.lastindexof()
try , remove duplicate part. if there better way please let me know.
thanks.
the right regex "( \(\d{4}\))\1+"
.
string pattern = @"( \(\d{4}\))\1+"; new regex(pattern).replace(s, "$1");
example here : https://repl.it/evcy/2
explanation:
capture 1 " (dddd)" block, , remove following identical ones.
( \(\d{4}\))
capture, \1+
finds non empty sequence of captured block
finally, replace initial block , copies initial block alone.
Comments
Post a Comment