Comment Spam and False Positives
Every successful social media system I have ever been involved with has to tackle the problem of spam. It is one of signs that you are successful. When the spammers start targeting you, it is a sign you have arrived.
Over the years Disqus has had to fight comment spam and they've done a pretty good job of it. Their spam filters catch most of the comment spam. Occasionally one gets through and I manually delete it, most often via email with a reply with just the word "delete" in it (without the quotes).
In the past month, I've noticed a significant uptick in the amount of comment spam being targeted at the AVC comment threads. More is coming in and more is getting through. I asked the Disqus team about this a few weeks ago and they told me they are seeing a significant uptick in spam across all of their communities and they are dedicating additional development resources to fighting it.
One of the costs of tightening up the spam filters is you get false positives. And thanks to Harry Demott, I noticed this morning that a bunch of legit comments by AVC regulars had been marked as spam. I just went in and manually approved those comments and notified Disqus of this issue. I suspect they tightened something up in the past week a bit too much.
If you have been having trouble getting a comment to post in the past few days, this is likely the source of the issue. If it continues to happen, please let me know via email. I will make sure to visit the spam page in my Disqus moderation panel regularly for the next few days to make sure this isn't continuing to happen. And I am confident that Disqus will get this fixed in short order.
I have a spammer who visits my blog everyday from China. Given the timing of posting the spam comments, it doesn’t look like a bot, yet the spammer never notices that he’s blacklisted and that all comments from his IP are rejected by the computer.
Evan….same story with a guy from China on my blog. Might even be the same guy.
Sorry, actually that was me, using IP over Avian Carrier via a Chinese proxy. With the distance, I didn’t expect so many to survive, so I deployed a bit of redundancy. Turns out, the birds all hitched a ride on a floating shipping container full of Oreos that had been dumped overboard into a fast-moving current, and they all made most of the journey quite quickly. They seem to have taken differing amounts of time for the last bit of the journey, though, so the comments are trickling in.It should stop in a few days.
hehehe, avian carrier last mile providers could be cheaper than cable
The term ‘false positive’ is correct.Several problems in computing are essentially the same as ‘statistical hypothesis’ testing. Examples include looking for spam, monitoring for selected, known problems, monitoring for problems never seen before.Usually given a test, we can select the rate of false positives (false alarms, Type I error).Then the ‘best’ detector gives that rate of false positives and also gives the lowest rate of false negatives (missed detections of actual spam).Under some reasonable assumptions, how to construct the best detector has been known for about 60 years and is called the Neyman-Pearson lemma.The argument for Neyman-Pearson goes: Think of false positives as money, and think of investing all that money and getting the greatest ROI for it. Then sort all the investing opportunities into descending order and start investing at the beginning of the sorted list and continue until run out of money. There are also calculus proofs. A fully general proof is from the Hahn decomposition from the Radon-Nikodym theorem.Curiously, Neyman-Pearson can result in a knapsack problem known to be in NP-complete.But in practice we do not always have all the data Neyman-Pearson requires.There are many hypothesis tests.Perhaps one way to do spam filtering is to borrow from some of the distribution-free (make no assumptions about probability distributions) tests. Good starts include:Sidney Siegel, ‘Nonparametric Statistics for the Behavioral Sciences’, McGraw-Hill, New York.E. L. Lehmann, ‘Nonparametrics: Statistical Methods Based on Ranks’, ISBN 0-8162-4994-6, Holden-Day, San Francisco.Siegel has long been popular in the social sciences. Lehmann was long a leader in hypothesis test statistics.
the way I explain classification to non-mathy types is a specific gaussian example.consider two bell curves. if they overlap you’ll have to tolerate false alarms or leakers
Yes, take the two Gaussian curves.Uh, a ‘Gaussian’ curve is the most important case of a ‘bell curve’ with a big bump in the middle and long tails on each side. Since the curve is a ‘probability density’, the area under the curve is 1.Of the two Gaussian curves, let one curve be the distribution of some measurement from the posts when there is no spam, and let the other curve be the distribution when there is spam.We pick a false alarm rate (probability of Type I error) we willing to tolerate, say, 1%.Neyman-Pearson says how to pick the 1% to get the best spam detection rate we can for only 1% false positives (false alarms).Virgina is throwing a big picnic where she will cook brand A of hot dogs. She has $100 to spend on hot dogs. She checks prices on brand A at ten stores. She starts at the store with the cheapest price (most brand A hot dogs per dollar, best ratio of hot dogs to dollars) and buys all they have. Then she goes to the store with the next cheapest price and buys them out also. When she has spent all her $100, then she has all the hot dogs she could get for her $100 and stops buying (uh, right, in principle have to assume she spends exactly her $100 or we get into a knapsack problem which is NP-complete — absurd detail). Simple. And that’s the idea behind the proof of Neyman-Pearson for best hypothesis testing.Here is how the intuitive explanation of Neyman-Pearson goes for your two Gaussian curves:We regard the 1% as money. We are buying real estate under the curves; this real estate is where we will claim spam and raise an alarm.We want to get the best ROI we can in this real estate buying. So, we want to start buying where we get the best ratio we can of probability of spam to non-spam.So, we take the ratio of the Gaussian (following your assumption) of the spam (what we want to buy) to the Gaussian of the non-spam (what we are paying for what we are buying) — just the obvious thing to do, just as for the hot dogs.We start buying real estate where this ratio is the highest. We keep buying until we have 1% of the non-spam messages, that is, until we have spent all our money.The usual proof in a junior level course in mathematical statistics is in terms of calculus; can write a quite general proof using the Hahn decomposition, but that result needs prerequisites a bit beyond juniors!If the data is discrete, then the situation is still closer to buying hot dogs, and we f’get about calculus. Yes, with discrete data, we can find some points where the probability of spam is positive and the probability of non-spam is zero. So, our ratio would be positive infinity. Right: Free hot dogs. No problem: Just like Virginia, we start our buying with all the free hot dogs we can get.It is also possible, and, really, more common to do such tests without assuming a distribution, e.g., Gaussian. If the test makes no assumptions about the distribution, then it is ‘distribution-free’. The two books I mentioned are all about distribution-free tests. Right: These are also called ‘non-parametric’ because we don’t need ‘parameters’ such as mean and variance of a Gaussian.Your Gaussian distribution was likely for data with just real number values. So, we were using just one dimensional data.Can also do tests, and apply Neyman-Pearson, where the data is in several dimensions. Yes, Neyman-Pearson wants the two distributions, and in several dimensions that is asking a bit much in practice (uh, a ‘distribution’ is something we believe exists but often can’t see in much detail!).A Gaussian curve does not promise to be very relevant for spam detections; distribution-free would likely be more relevant; and so would several dimensions.Tests that are both distribution-free AND multi-dimensional, and possibly relevant to better spam detection, require next semester!Uh, we could have a finite group of measure preserving transformations and sum over the group. We could apply a classic result of Ulam (as in Teller-Ulam) to show we have better than a ‘trivial’ test. In the case of a continuous density, we should be able to get a cute ‘best possible’ asymptotic result. Why asymptotic? Because in the limit as the amount of data grows to infinity, what happens becomes simple and, thus, is an approximation to what happens with large amounts of data that we couldn’t know without more assumptions. Asymptotic results are especially relevant now that large amounts of data are so common.With infinite groups of measure preserving transformations, we would likely be into ergodic theory — Liouville, Poincaré, Hopf, von Neumann, Birkhoff (the elder), and a nice proof by Garcia.In http://www-history.mcs.st-a…Ulam is on the left and von Neumann on the right. The guy in the middle is just a physicist!Poincaré did the ‘coffee cup’ theorem where pour cream into coffee, note where the cream is, stir, and if keep stirring long enough then the cream will return as close as we please to where it started. So, there’s a ‘glob’ of fixed volume wandering around and has to keep coming back closer and closer to where it started. Cute. Physicists, apparently now for black hole ‘information’, like the connection with the classic Liouville theorem.As some humorous posts in this thread exemplify, what is ‘spam’ can be a bit tough to detect just using keywords. Or, what is really spam has to do with meaning, and keywords make hash out of meaning. So, for processing text, for spam and more, it would be good to avoid just keywords. How to do that would be another course!Uh, for Disqus, already have some data that promises to be good for spam detection — just user ID! Then, for more, with a spam voting button (‘user contributed content’) have a semi-automatic way to be more clear about the ‘validity’ of the spam voting. Then often also have user ID of people clicking on the button!For new Disqus users, hmm …!So, we have a problem, spam detection. We have quite a lot of data. A question is how to process that data to detect spam with low rates of both false positives (false alarms, Type I error) and false negatives (missed detections of spam, false negatives, Type II error).The processing is necessarily mathematically something, understood or not, powerful or not. For more powerful processing, sometimes some math can help. E.g., with the right assumptions, Neyman-Pearson tells us how to do the best possible processing. Yup, such work isn’t computer programming or computer science. Instead, it’s applied math. So, we have an example of how some applied math can be important for some work in ‘information technology entrepreneurship’. Is this the only such example? Nope!As we continue in computer applications, to exploit the cheap cores, bytes, bandwidth, infrastructure software, we want to ‘automate’ work. Here we want the ‘automation’ to do high quality work, e.g., so we don’t gave to get in and do things by hand again.So, my view is that applied math has to be the ‘Moore’s law’ for the rest of this century: Often math lets us tell what we have just from some scribbles on paper.Do very many people believe this? Nope! That’s a problem, and the flip side is an opportunity!”Scribbles”? It is still an unending source of surprise for me how a few scribbles on a blackboard or on a piece of paper can change the course of human affairs. Stanislaw UlamSorry computer science people.But I repeat myself. A 200 foot yacht on Long Island Sound would be more convincing evidence. Back to it.
What always amazes me about spammers of any kind is the effort they put in to get around systems (email and comment). What a waste of energy just do something positive instead.
So true. My Chinese spammer probably isn’t making much with his spam. I could easily teach him some online skills that would make him a decent wage and we’d all benefit. Unfortunately, he keeps spending 5 minutes everyday visiting my blog with nothing to show for it.
a favorite issue of mine because of how it relates to the collapse of everything……spammers are decentralized, if disqus counters with a centralized defense……i think they will lose. this basic issue of decentralized attack on centralized systems will be a huge issue, particularly for you fred IMHO, in light of your portfolio. i think federated systems will allow for decentralized defense.also, i imagine disqus already knows this, but reputation presumably is an easy way to reduce the trigger of false positives…..i.e. i have over 1,000 comments here in fredland, i am one of the most prolific fredland commenters of all time, a hall of fame candidate likely to be inducted on my first ballot appearance. i didn’t just play the game. i changed it.lol anyway factoring reputation will reduce false positives IMHO
i knew something was messed up when i saw one of your comments in the spam filterdecentralized to me means allowing all of you to identify and mark spam
IMHO to generate max returns, spammers will look for where a system iscentralized, and design attack strategies accordingly. but on the brightside just imagine how badly crapple is going to get embarrassed with theirextremely centralized, big brother approach…..of course they will usespammers/hackers to justify even more draconian policies….i.e. “we need tobe big brother for reasons of national/network security” (just like they aredoing with flash, although those who do not have YOUNGSTER tattooed on theirforehead realize they just want to make sure all money goes through theminstead of through flash apps)….all crapple needs to do is make sure theyroll out a steady stream of slick propaganda to sell the masses on theirtyrannical empire…..will history repeat? how far are we from false flagattacks from the empire of $teve? perhaps if they were more transparent andhad a system of checks and balances it would be easier to trust them…….
that’s distributed moderation, and it was a killer feature of friendfeed
I would actually support this change. Sort of like a reddit vote up / vote down system, except behind the scenes. You could set a threshold where you are notified when a comment receives > 5 spam votes, and leave the rest to Disqus’ platform to separate the wheat from the chaff.
i love it, that is exactly what we needit is what twitter needs too btw
I know I overfollow people (I like knowing about lots of things, food, books, business, tech, beauty, random people), that being said, I still want to know what makes for a spam follower
is there no way for you to whitelist regular commenters?another thing i’d love to see disqus implement (from the old bbs days): individual kill files. not for any a vc commenters, of course, but it might be useful on *other* disqus blogs.
related: would love to see a vc commenters highlighted when i’m on other blogs. i sometimes recognize them on chris dixon’s blog.
The same spam flood on my tiny blog:- Chinese characters in the name (probably hard to read by spam filters)- “Ligit”, generic comment: “thank you for sharing”- Dodgy link behind the name of the poster or in the comment
Cheap handbags cheap handbags here. Want to meet cool singles in your area?Whoops, looks like I got past your filters. ;)Seriously, it’s right that when you get spam it means you’ve arrived. Can’t wait. 😉
Aaaaaah…I thought I’d been censored 🙂
i don’t censor comments unless they are spam, porn, or hate speechpeople have said some awful things about me in the comments and i let those comments stand
I wouldn’t hold it against you if you filtered some of the “me too” comments as long as you advertised that you did so.I think the hardest spam to filter is that which is created by a human and individualized, but includes a link and smells of a “bait and switch”; i.e. as soon as the comment is approved the link turns into something spammy.
I think “me too’s” are actually interesting data. You can roughly quantify sentiment/support/opposition with Me Too’s.When a whole bunch of people who don’t usually come out to speak come out of the shadows and say “Me Too”, that’s info you’d want to know.
I second this, it gets the shy people to come out. Also, the linguistics of me toos are sort of funny, many people don’t say the exact same thing, and sentiment will get to the point where we start rating how much support those me toos are saying on say a scale of 1-10. I rather encourage than block.
I am curious about how big this problem is actually and whether there are good measures maintained by the security industry about the ebb and flow of spam tides. Akismet seems to do a near-perfect job on my WordPress blog, and I almost never go to check for false positives these days…
I’ve had failure problems with Akismet.
What good is a spam filter if you have to read through flagged comments.I’m surprised spammers still bother to target comments.The ultimate spam filter, charge 10cents per comment.
Totally agree regarding having to read through spam filters. I do it even in Gmail and I usually find a couple of false positives each time.I’ve read many people defending the micropayment system to prevent spam, but for email. I don’t like it (call me cheap!), but I recognize it could be a valid approach.Both for comments and for email I would prefer tight filters but with a notification to the writer/sender if doesn’t go through and a way to prove it’s not spam (something like a captcha). It would be a pain when you have to prove, but knowing that if you didn’t receive a notification everything was fine is also great. With the current state of this you can’t assume that your email arrived to the inbox, so for important thing you need to check. And that is a bigger pain.
I was just going to say the same thing. Charge $x per comment and if the moderator keeps it up for more than y days, refund the payment in full.There are very few problems that don’t have a free-market solution, especially once micro-payments become the “norm”
pffft. other way around, commenters should be getting a piece of the ad revenue (preferably via fredbucks….honest money for honest comments). blog stars are nothing without their fans! nothing!
I’m building my brand and following down here “below the fold” on AVC. Itwon’t be long before I lay down the “equity or I take my followers with me”gauntlet. shhhhh
LOL.Planning an LBO of AVC, Andy?Enough TLAs!(three-letter acronyms)
Since we’re all (or several of us) are being silly, perhaps this is the time to suggest an AVC franchise? Would not satisfy the need for blog domination being expressed by some but a reasonable compromise?
this is one of them andyyou can’t charge for free speech or it isn’t freethey’ve tried this in email spam filtering and it was a massive failure that consumed tens of millions of dollars of wasted investmentthe answer is reputation, like the kind of reputation everyone earns here every day, combined with a highly distributed crowdsourced spam detection and reporting system
People pay to publish their speech every day.The email spam is probably a good lesson, I agree. I’m still not giving upon the “super-verified disqus” profile with credit-card attached, and theability of a blog moderator to ding your card for comments that containspam, or are really stupid or offensive.I’d put that on my blog. I’d love it. I’d set up a pool where the bestcomment of the month (as judged by ME and ME alone) would receive the”taxes” collected from the spammers and generally stupid comments (again, asjudged by ME).
Simple implementations of reputation can be defeated when spammers set up their own blogs and mark spam as ham, and ham as spam. There is no such thing as universal reputation.
How ’bout just make a commenter watch a quick ad and answer a question about it?Then you are creating some valuable attention that is free to user and can charge out….and keeps spammers away. A trifecta.Obviously for us Comment Trolls it would become annoying pretty quickly.But never forget there are free ways to ‘monetize’.
If you have people pay for community norms, oddly, people feel less of a need to help and obey community norms- it becomes a transactional issue- why do it, when one can pay to perform a good/bad behavior.
I’ve seen interesting alternatives to cash proposed for this use.Consider providing a computational task with a particular associated cost in terms of, for example, CPU cycles. Require verifiable completion of the calculation before accepting the comment. This raises the costs of comment spamming due to multiplication by the spam volume, without raising the costs to an unacceptable level for the average user. Find a way to leverage those compute cycles either for the common good or for profit, all the better.I wonder whether or not the “cost” could be acceptable on mobile platforms while still, in aggregate, being expensive for high-volume spamming. Hmm… the wheels are spinning about mashing up mechanical turk and comment spam protection. 😉
I use Disqus and have noticed some spam coming in from people trying to increase their link-backs (sort of a black hat SEO technique?). They really have been getting more clever about it with how they hide their links. I mostly moderate my comments via e-mail, like you, so I was just getting the text of their message and not the actual HTML. Is there a feature I could enable that would e-mail me the HTML of the message content instead? That would be pretty good.
I had always thought that comment systems were set up to create all links as no-follow so they would not improve seo. Isn’t it like that? (my seo knowledge is quite basic, so excuse me if I’ve said something stupid)
That’s right, and how default blogs are setup Fernanado.
Hey everyone, sorry if your comments were caught up in the spam filter recently. Fred pointed this out to me and we’re taking a look.Once I get some more details, I may want to write something longer about how it all works. Briefly though, this is likely the reason:- There becomes an influx of spammer on a specific site- A moderator will mark all of these as spam (as he should)- If comment spam is not typical, these influx of new new things marked as spam will tell the spam filters a lot of information. Common traits between posts may be picked up. Sometimes the wrong common trait is focused on- That’s where false positives usually come in. A wrong common trait will be noteworthy to the filter and it’ll be suspicious of otherwise good commentsThis behavior can be improved with tweaking, which we always do with more information like this.
Any interest in a distributed reader flagging system suggested earlier in the thread?I see you guys have a mouse over flag button bottom left, is this already used?
WHY am I suddenly not allowed to post on a number of the Independent’s pages? I’ve done absolutely nothing to break the rules. Please help.
This may foster too much democracy, but how about empowering some commenters to mark the spammers like other websites?
Thanks for the clarification. Glad I wasn’t the only one experiencing issues.
Well thank you Fred, Daniel, and Harry.Kind of interesting, you know, it shows sort of the zen complexities of computing- “If a tree falls in the forest and no one hears it, did it fall?”I’m sort of coming to a conclusion that “computing” must be “noticed” to take place on some level- if it is abstracted out to the background, we don’t care and don’t worry about the problem. Certain problems become more and more relevant as they become more and more “in our perception” and more “human” or “affecting humans”Hence why the worry about spam.Must think on this.
Maybe Disqus needs to integrate with Mollom (a spam filtering solution)?
What I don’t understand is what economic value spammers get from adding comments to nofollow blogs? http://blog.oddhead.com/200…
I don’t know either. Traffic?
You can call me Pascal. ;)PEGhttp://card.biz/peg
An 85 y.o. woman was recently quoted as saying she would consider marrying again if the guy was handy with tools and had ED. Someone in the group thought that was contradictory. Oh, T, sometimes you bring out the worst in me.
Happy to, Donna!