A Comment About Comment Spam
The AVC blog has been hit with a rash of comment spam in the past several weeks. Every day we take down at least five or ten of these annoying things. And some days, we take down a lot more.
The “we” is me, William, and Shana. I really appreciate their help with this chore. They’ve been helping me moderate and manage the comments here at AVC for several years now. It’s an unpaid and mostly unnoticed job. I really appreciate their help on it.
I honestly don’t understand why these spammers do this. The Disqus system has its own filters and very little of it actually gets through and into the comments. And the stuff that does get through gets taken down by us.
These past few weeks has been particularly bad. I am not sure if the problem is an increase in spam attacks or some sort of change in the Disqus filters. But whatever is causing the increase in spam, it’s pissing me off.
If you leave comments here regularly, you will certainly have experienced getting a reply that is spam. So you know how I feel about this stuff.
One of the many reasons I moved away from Typepad comments and to Disqus back in 2007 was to manage comment spam. I could not manage it on Typepad and the old AVC comments were full of the stuff. That’s why I left all of those old comments behind on Typepad when I moved to WordPress. I am not going to let that happen again.
The fight against The Internet Axis Of Evil never stops but it does change. When I wrote that post a decade ago, it was mostly about email spam and viruses. I don’t think so much about those scourges now. We’ve moved on to getting hacked and DDOS’ed and worse. The good thing about the Internet is you can do what you want on it. The bad thing about the Internet is you can do what you want on it.
So if you’ve noticed an increase in comment spam recently, you are correct in your observation. We are taking it down as fast as it comes in. Maybe the spammers will move on to somewhere more fruitful for them. Or maybe Disqus will tighten their filters a bit. Either way, we are going to keep this bar clean of that stuff. I promise you that.
I would only disagree on one point.The work of the team is noticed (at least in part)And appreciated.I presume the best way we can help is to flag or is there more we can do?
Flagging it helps. It sends an email to moderators and there’s a threshold after which it gets removed automatically.
Can we flag spam comments from our email inboxes? (real question, wasn’t paying attention to all of the email options on the ones I received.)
I don’t believe you can, but the moderators can (that’s how we get most of them).But to do that, you’d need to have it configured to receive *every* comment from AVC, which is a lot of daily emails.
I wouldn’t wish that on any of you!!!Thanks, William.
That’s an important post. My friend makes $9,426 a week just by taking surveys http://m.youtube.com/watch?…
Charlie – you’re badass. Tell your friend he’s a badass too 🙂
I am laughing pretty hard right now
A sample punch on the face makes more than 10000 words.
reddit to the rescue?
WordPress with akismet – it works almost 100% perfect filtering out spam. I understand you invest in disqus, but if it really is pissing you off, you should really try my recommendation.
Disqus also works with Akismet.
what’s Akismet Jim?
A company founded by same guy that founded WordPress, to address comment spam.This http://akismet.com/
that’s like saying you should abandon your child and adopt another one who is prettier. i would never ever do that.
Maybe don’t abandon your child, but go meet that other “prettier” one and see how and why it behaves so well in the SPAM school, and then you could teach your own child those lessons…That’s what i meant about “trying” not “moving”
you can’t run two comment systems simultaneouslythey both use Akismet so that’s not the answer
you can use them together
Perhaps one of the parents should suggest the children play nicely together, and maybe share their Legos. I haven’t been in WP world for 7 years now, but Akismet kicked ass back then. I did operate a free MU based platform and could not afford licensing Akismet for it – ended up writing my own solution – but I think Disqus could work out a licensing and data sharing deal. Akismet could become better when provided the Disqus data, assuming Disqus drives more comment engagement, and user collaboration in marking spam.
disqus uses akismet
“Honey, you aren’t as pretty as Dave’s daughter ( I wish you were) but that’s ok I will always love you”.
Perfect segue to have Disqus chime in.Is this important to them and what is the plan to address if at all?
You have a career in politics ahead of you Jim ;)Since we are on the topic–can you answer if there are plans to remove the most annoying bug that Disqus has, the inability to remove a graphic once posted? This has stopped me from posting pics on any Disqus site.
I’ll have to get back to you on that one later once I have a chance to review with Product Mgmt on west coast.
Please post to the string.If this is a feature, it is quite evil.
Not an intended feature.
of course i know that!I just wish Disqus was more transparent. posting about the issues with comment spam before it becomes a groundswell, posting about this weird can’t delete bug, posting about your vision for product growth.in communities, people are amazingly forgiving to those that share (with smarts of course). amazing not so, when communications is responsive.my two cents.
Thanks, I agree with you more than you agree with yourself, as @JLM:disqus likes to say.
100% on that one. Must be a crapload of code they have to unravel to fix that one or some system design that prevents an easy fix. Other reason? Bigger problems or fish to fry with the amount of labor units they currently employ.
[I work at Disqus with Jim]This is something we’d like to improve, but as of today, you can delete an image by editing your comment and then removing the image URL. Keep in mind that you’ll need to have two or more characters ‘left over’ after you’ve removed the image URL. There are some nuances to this, so let me know if you have any questions.
Good info.I will try this.You need (as In Disqus) needs to start sharing these tidbits to the community at large. Or do you somewhere that I”m not noticing?
Some of this stuff is sprinkled throughout our help docs, but I don’t believe there’s one place where we have ‘tips and tricks.’
[I work for Disqus]A couple of months ago the service we used to supplement our spam checking was acquired and stopped providing spam checking services. 🙁 They were doing a great job but we had to switch to a new service provider for which we’re still working out the kinks.We’ve received feedback from other sites that are being affected, which has lead to an increased concern from our Product and Engineering teams. We’ve recently made some additional changes to our spam checking logic in an attempt to get spam back under control.Appreciate your patience!
Sorry, but if you are in the business of providing a commenting platform, spam detection should be a top priority, a big part of your business. It should be probably one of your core competencies.I know for sure that one the main selling points of Gmail was, and still is, its ability to handle spam. Even them are feeling some pressure; I’ve been noticing many false positives recently, and even one infrequent spam passing through.The same cloud platforms and technologies that are supporting terrific innovation now also available for spammers. For some reason, they attract smart people too. Anywhere there’s a vulnerability, they will go.
I agree with you and it is a big priority.
Thanks, and let me be clear, I didn’t want to bash your hard work. It’s just that the way you worded your comment, it seemed like you relied (may be a little to much) on a third party for what should be a core competency. That’s something that you need to make absolutely clear.
No idea of the specifics of how disque approches it, but the nature of detecting spam is that the larger the body of data you have, the better you can do. As such it can make good technical sense for industry players to collaborate via a neutral third party.
You’re actually right when you said “It’s just that the way you worded your comment” and the reason that companies have PR departments to package things and don’t want employees speaking off message.I mean imagine if you were in the hospital, had a bad outcome, and the nurse said:A couple of months ago the service we used to supplement our infection control checking was acquired and stopped providing infection control checking services. 🙁 They were doing a great job but we had to switch to a new service provider for which we’re still working out the kinks.Thanks! So let me know when I will be off the IV Trapper John.
I try to be a straight talker. Not spinning a message here. Hope that’s OK.
Reading comments is a learning experience for some people. Quite possible that people reading disqus comments aren’t aware of the downside of “not spinning” and being honest.  I’m glad you are honest and don’t spin obviously and I definitely feel your personal pain at having to be on the front lines of the shit that I (and others like Arnold today) say about the company you work for.Otoh, one of the things that has always bugged me is how the higher ups (who don’t feel the pain, like the Airline execs vs. counter clerks) don’t sometimes take things as seriously as they should (because they aren’t feeling the pain). So your job (being on the front lines) is to effectively relay this to the management (which I”m sure you are doing) to let them know that issues that are raised about disqus are real and need to be addressed. There are cases where you can air dirty laundry and cases where you shouldn’t. I wouldn’t want someone just hatched recently to think that it’s always good to not spin and air dirty laundry because it’s not.
That’s ok! But I’m not talking about spinning messages or any other kind of PR BS. Please take it as well intentioned feedback regarding your original message 😉
Understood, by all means! Thanks.
I know an expert if Bayesian Statistics if you need help…he does do consulting work, and probably could help you back engineer your previous service.
Thanks. Will let you know.
You guys run a comments site. Spam (and the elimination of it) is at the core of Disqus’ business. It seems like the team should be creating its own best-in-class spam filter instead of renting someone else’s, no?
You’re right, spam comments can poison the conversations we’re helping publishers nurture. In the context of 1M comments a day (on peak days), it’s a small number of comments. But that doesn’t mean it’s not important; it is. So, we don’t farm this out. We compliment our tech with tech that sees and analyzes spam across other platforms – bigger data set, more learnings. See my colleague, @Joe Dudas:disqus ‘s, comment elsewhere in this thread.
it is really complicated to do that. They should buy one. Most of these sorts of outfits (math first) tend to be hired guns who disappear quickly (I seem to know a lot of that type) or sucked into wallstreet, or into bigco. As a result, lots of these systems are good for maybe 5 years
[Daniel from Disqus]Using external services combined with inhouse abuse detection has been our route for some time. There’s no perfection, sad to say, and sometimes the bad guys find a way to cut through the net for brief periods.It’s super important, like you said, and we feel the same way.
Is it really compulsory?Does all car makers needs to make their own tyre?We all know cars don’t run without tyre :-).P.S. This was statement provided by a investor when i said “I wish to make my own lens for the imaging product”…. well in my case it was wise to procure the lens rather than spending another 2-3 years of optics research….and may be failing at it later or producing something costlier.
You might be able to use some extra helpers @disqushelp on this one?
and I meant to attach an explanatory screen cap of my recent Twitter exchange on this topic…if it’s not there and you need it LMK.
Not sure I understand.
(sorry, I tried to attach a screen cap of my twitter exchange with @disqushelp, looks like it didn’t work.)When I asked via twitter about spam com (@ USV.com and pointsandfigures.com) via twitter, they asked me to send them links.The weekend passed before I received a response, asking me to send links to the spam. I was pretty sure the mods had taken the spam down by then…
Social media is not our main support channel. It’s monitored, but publishers know to hit us up through their Disqus dashboards for the most responsive support.
Disqus Twitter profile is not quite this clear!
Hey @annelibby:disqus, @disqushelp is a good place to start for quick questions, but contacting our Support team directly is definitely the best bet for technical issues (this spam situation, included). Like others in this thread have mentioned, we’re making some changes on our end to address the increase of spam you’ve been seeing. If you still have links, please pass them along. Otherwise, we do have a list of examples we’re working with currently—so no problem if they’re not still up!
Thanks, Amanda. I’m not reading the thread today real time, just seeing the responses in my inbox.Here’s what I also asked @disqushelp: can you find the spam comments if I give you my disqus ID (in my case, my real name) and the blog name?If hadn’t been for the affinity I feel for Disqus due to my engagement here, I wouldn’t have bothered reporting it at all.Too much friction, not enough time in the day!So I’m looking for the easiest way for me to be able to be a helpful citizen of Disqus…Thank you so much.
What was the old service that you used?
I don’t remember, but I think they were acquired.
There are just a few comments posted now, and a few are spam.We like to applaud the advent of intelligent assistants which chat and email as part business processes. In the same way, spam must be passing some kind of Turing’s test now. It will probably get worse, ever harder to detect.Imagine the day when spammers learn the techniques of “media training”. Public figures like politicians and artists learn to take any question and answer what they want to talk – even if it’s a totally different subject. Spam robots may grow to the point where they may pass as if they’re actively participating in the discussion, while cleverly inserting their messages in the thread.I believe that the best way to handle this is some kind of reputation system, which requires long term identities. I know this was already tested and didn’t work very well due to privacy concerns and people’s desire of anonymity. At some point this balance may turn again, who knows?
Every user on Disqus has a reputation score. Please see my comment in the thread for a more detailed explanation.
disqus has a reputation system. all the regulars here have high reputations. but we can’t block all the comments from low reputations. which all the spammers have.
I wonder how Hackenews, WordPress, and Youtube deal with it. Auto flagging should work. Ironic …there is actually a spammy comment in this thread.
That’s one of the regular commenters trying to be funny, evidently.
Thanks…I now realize. 🙂
What is the point of comment spam? I get some too and the filters can’t catch it all. If one of these bogus comments get’s through what’s the benefit to the spammer? They are trying to get links in some cases but won’t Google filter them out anyway since they must be a bogus site? I have never been able to figure out the point of all this digital defecation.
Math works in their favor. Cheap machine time, tons of spam, somebody is gonna click somewhere, sometime. If one leaves a million spam links for a few cents (do the math, you’ll see it’s possible), that’s still cheaper than actual advertising – and that’s not considering that some products may not be legally advertised at all.
Okay I guess that makes sense. I see the links in my spam comments are typically for some sketchy product or service. So I guess they get random clicks which at scale can mount into something. Still the whole thing seems pretty soul crushing. I guess every type needs to figure out ways to make $.
free backlinks, basically
I’m not sure how the backlinks work, because Disqus automatically adds rel=”nofollow” onto links. Maybe spammers are just paid to spam, and they actually don’t care what happens to the link or not.
It should be a fun day…between wannabe spammers, spammers and jokers.
Don’t forget the trolls….well, yeah, please do.
Gee, it looks like my high school plane geometry teacher was from Norway!
$$$$$ Make thousands. Learn to code and hack in two minutes. Come to this website and learn how. http://www.Ihatespammers dot com
wannabe spammer alert 🙂
Spammers are a pain in the ass but maybe this keeps them busy so they don;t do something worse
A great point. We can reference “bootleggers who got into dealing drugs after prohibition ended”.A base line of petty crime is probably important for an economy. The only issue is if it rises to a level that gets unmanageable or to detrimental. I’m sure some academic has studied this extensively.I’m reminded of an employee that used to use the company gas card to pay for his wife’s gas in addition to his gas. And how he used to take time from work to help his wife with her small business. In the end it ended up being much more economical than if I hadn’t allowed that (I just looked the other way) and instead he then came in and demanded a raise which would have been more expensive overall.
I have enjoyed your blog for months and have a bit of an odd request. Have you heard of the international scavenger hunt called GISHWHES? It’s going on this week and one of the items is to get a legitimate Term Sheet from a professionally-listed venture capital firm (with their contact information on their official letterhead) detailing their investment in your startup app called “Granny-Grinder” – Grinder for seniors. Would you be willing to play along? Understandably you are a busy man but I have always felt the answer is always no if you never ask the question. Again, I appreciate your blog. I’ve learned a lot. Thank you,Sara
Long time no see!I’ve noticed that disqus has a wider spam issue happening for a couple of weeks now if not months, and it’s been happening ever since their “Community” update and profile changes. nowadays I get more and more so called followers who are just spam accounts following for no reason…It is indeed getting quite annoying and I hope the folks at disqus will be able to fix the issue.
You have a community right here that would be happy to assist in spam removal.Perhaps Disqus should build a feature that allows blog owners to opt-in and empower some or all users to remove spam (maybe three flags from trusted members = removal?)
We have something. That’s how @wmoug:disqus and @ShanaC:disqus were deputized as moderators.
Thank you @JimHirshfield:disqus.Perhaps Fred might deputize a few dozen more, provided he can limit powers to this function.
Well, he has the flagging feature configured to hide a flagged comment if only one person flags it. Also, flagging emails him, William, and Shana. So, everyone’s efforts already play a role.
We’ve already discussed that a few months ago. Increasing the number of moderators would also increase the management and communications aspects, which isn’t something that was desirable.As @JimHirshfield:disqus said, any user can flag an errant spam comment, and it will be deleted automatically after 3 flags, and moderators get an email notification.I think all and all, this working pretty good, and AVC is pretty spam free on the whole. We always catch the false positives or false negatives within minutes, if not a couple of hours, typically.
Good points William. Thanks for considering the idea.
hang ’em highnone of it is ever native to the discussions or comments. a complete waste of their time.
Just after reading this post, I got an offer to “make $73 every hour on the computer” posted to a comment I made last week. Lol.Is Disqus just taking down the flagrant comments? Or also disabling user accounts that are clearly mostly spam / disabling the ability for associated IP addresses to create another account?
All the above. However, professional spammers know how to quickly change their IP addresses, email addresses, Disqus accounts, etc. But the good news is that once we see them on any Disqus site (among the ~3M sites that use Disqus) and they’re neutralized, they’re neutralized everywhere.
Sometimes I read my old comments and I think they are so incoherent they are spam.
As I wrote last year, spam is both a blessing and a curse.http://hegranes.com/blog/20…It means you’ve arrived, enough so for spammers to target. But it also means you now need to rid yourselves of it asap.There’s very little user-patience for these types of things.I’ve seen Twitter and SoundCloud struggle with this a lot, as well as Disqus. The networked nature of each makes for an easier target… But hopefully also a more cogent solution.
They post for SEO purposes. Thousands of people around the world make/made their living from nothing but posting these. Many have yet to realize that google’s new algorithm’s from last year (panda/penguin etc) penalize for this. Plus, they have no incentive to lose their job, so they keep going. Additionally, if done correctly, the links can still help a site. As AVC becomes more popular, it’s worth in terms of SEO backlinks grows – as do the amount of spam posted. Very hard to automate spam filters – users here commented (and I agree) that gmail spam box is having much more false positives than before. If mighty Google is helpless, what’s a Disqus to do? Alas, I see more spam in AVC’s future – perhaps the solution is a business/investment opportunity in the making.
All links in comments are No Follow.And the spam/links are deleted in most cases within minutes if they do get posted – hardly there long enough for Google to index them even if they wanted to.So…not seeing how this is an SEO play.
Agreed, Jim. Should have nuanced by saying indirect SEO from actions people might take once on your site (e.g. interaction with social media etc) and of course their are the advantages of direct traffic, peddling whatever it is on their site, etc (and a side note that not all blogs are as active as AVC – comments can stay for a very long time, or in some cases, forever)
So…not seeing how this is an SEO play.In a spray and pray strategy the whole point is to not spend time on making individual decisions as to the validity of the approach..I’ll give you a real life example. We send out postal notices to customers overseas. We could easily a) stop doing that (we also send email) or b) individually determine if it made sense to send a particular invoice (they cost about $1 each to mail). Over time though it has been determined that it makes sense not to do either of those since sending the invoice seems to increase the chance of a bill getting paid and also build further customer good will (just to mention two things  ).For example let’s say you want to become a late night comedian. Hypothetically, that there are 1000 networks (not just 10) that have late night venues. Spray and pray dictates that if you send a postal letter to 10,000 people at 1000 networks you will have better results than sending 1000 letters (one per network), many of which won’t end up in the right place (wrong guy, left job etc.) Even at the risk of annoying some decision maker. The rest of the reasons are secret and will be contained in the ebook that I don’t ever plan on writing.
I agree with you. But none of that has anything to do with SEO which is what I was replying to.
Hi Jim,They could be reading 5 year old articles still in Google’s index that say “comment on a lot of blogs.”Information gets out in waves, and to some people, it never reaches them
I’m afraid I don’t understand.
> it’s pissing me off.From the movie, ‘Armageddon’ (1998):A.J.: You’re pissed. Okay, I can see that.Harry Stamper: No. You know what, A.J.? I’m notpissed. You’ve seen me pissed. This is way, waybeyond pissed, though.
Procedural question: Does anyone get value out of following someone on Disqus? If so how? I get value out of following on Twitter and Tumblr, but wonder about comments on blogs.
Yes, I do. Granted I’m biased, but it helps me discover what my favorite commenters are commenting on elsewhere.https://disqus.com/home/ is an evolving hub of my favorite conversations based on the people and communities I follow.
I’m the founder of Because, a startup that provides a platform for clear and concise conversation in sites and blogs comment sections.We’re addressing the spam issuewith a lot of success by limiting all comments to 280 characters. Check us out in the WordPress directory: https://wordpress.org/plugi…
Respectfully, I’m not sure how well your approach addresses comment spam – there’s nothing in a spam comment’s nature that requires it to exceed your 280 character threshold.
Ironically, Nick Santillo’s comment is spam.
That won’t work for me unless you meant MORE than 280 characters :-))
Since we’re having fun, here’s some (what I hope is) anti-spam — i.e., unexpected good news.Some of you know Fred supported the chess program at IS318 (featured in “Brooklyn Castle”) last year, and later in the year, decided to do the same for PS/MS 282, which won the national chess championship in the U900 section in 2013.Today there was a feature in the NYTimes about “Unlocking the Truth”. The lead guitar player, Malcolm, is a PS/MS 282 student — he’s the one wearing the “black nerds unite” shirt. Better still, thanks to Fred’s foundation, CSNYC, Malcolm will be learning how to code in September.http://www.nytimes.com/2014…
We have the same problem in our products Q&A site. It gets a few hundred page views everyday yet the spammers think it’s worth to work around our filters. It’s driving me crazy. Maybe that’s why they are doing it.
Well the good news is that with respect to email spam of course you are a bit hedged with your investment in returnpath.http://blog.returnpath.com/…http://www.returnpath.com/s…
@fredwilson:disqus Would it be worth $.25 per-comment to you to have humans verify a comment is spam-free within 1 minute of being posted? #mechanicalturk
Perhaps another option is a version of Luis von Ahn’s reCAPTCHA
Raise your hand if you love Captchas?They’re a real show-stopper IMHO.
I agree Captchas aren’t fun, but they do serve a good purpose.
Absolutely positively forget captchas.I’m amazed at the amount of sites that implement captchas that don’t need to.
Captchas don’t work well enough on many sites. Even if you refresh the captcha multiple times, sometimes you can’t read it.
Yes, I agree many Captchas are hard to read.My suggestion above was a Captcha, required in order to submit a new comment, where the Turing test is whether or not another comment within the environment is spam. Run comments through this test three or four times and users themselves would identify spam comments to be removed.
Okay, I may have misunderstood you. That sounds like a good idea. But have you thought about the possibility of gaming it?
Okay, I may have misunderstood you. That idea sounds good. Have you thought about any issues with it, though?BTW, your site http://bothsider.com/ looks good visually, and the concept is interesting. Does it have any goal, other than just letting people post on both sides of an issue?
Thank you @vasudevram:disqus. The goals were lofty: a massive and robust discussion network that pulls people out of their like-minded worlds and into one with ideas and opinions on both sides of any issue, side-by-side. A dozen or two people really love it, but otherwise there’s not much engagement.Not sure whether it’s chicken-and-egg problem, social media fatigue, or just not something people want to do. Also, because opinions are side-by-side, it may be fundamentally flawed for mobile (I’ll never ignore the “mobile first” advice again).Thanks again. I’d be happy to learn more from you: [email protected]
Thanks for the reply and details. I’ll write more off-thread.
Mark, there’s already services that do that for UGC. CrowdFlower is one. And then there are a host of outsourced human moderation services that many of the largest publishers use.($0.25 per comment is expensive in this context)
Thank you @JimHirshfield:disqusI hope no one took my thought and left their job to start a company doing this in the 2 minutes between my comment and your reply.
Jim… may be you should consider geo-blocking by country as an option in the Disqus account management (AWS recently introduced this as part of Cloudfront).
This would kill a lot of good comments/commenters.And spammers know how to use VPNs to make it look like they’re elsewhere.
True – it’s a rather blunt instrument. But depending on the specifics of a blog, some bloggers might find it useful.
I hate comment spam. But really the main differences I see between comment spam and in-stream ads on facebook or twitter is the quality of the targeting.
Okay, we want a ‘spam’ detector.So, fundamentally and inescapably, we have two ways to be wrong: (1) False alarms, false positives, Type I error and (2) Missed detections, false negatives, and Type II error.The terminology Type I/II error is from the topic of ‘hypothesis testing’ in statistics, e.g., as inE. L. Lehmann, ‘Testing Statistical Hypotheses’.E. L. Lehmann, ‘Nonparametrics: Statistical Methods Based on Ranks’.Sidney Siegel, ‘Nonparametric Statistics for the Behavioral Sciences’.The terminology false positives/negatives is heavily from testing in medicine.The usual outline of the mathematics is to consider when the situation is normal (usual). Then commonly we have enough information to do some probability calculations to ask, what is the probability of seeing the data we just saw? So, if the probability is really low, then either (1) the situation is normal (nothing wrong) and we just observed something rare or (2) what we observed is too rare to be believable and we conclude that something is wrong, that is, we have detected a problem in what we are looking at, e.g., spam. Maybe we consider a probability of 1%. Then when our probability calculation says that, if the situation is normal, what we just observed has probability 1% or less, then the 1% is our false alarm rate or probability of Type I error.Commonly in practice we can adjust the false alarm rate.The rate of missed detections, probability of Type II error, is usually more difficult to consider, discuss, or determine.The attached graph shows the ‘generic’ situation of the trade-off between the rates of false alarms and missed detections and how a ‘better’ detector can yield a better trade-off.The best possible solution is well-known from the Neyman-Pearson result, e.g., at times used in US military radar target detection, but needs more information than is generally available in practice.Generally we know that more information should help, and in practice commonly we want to make good use of ‘multi-dimensional’ information.Alas, usually in practice, the probability calculation we would like to do, say, for the Neyman-Pearson best possible result, we can’t do. So, in that case we look for techniques that are ‘distribution-free’ (sometimes called ‘non-parametric’ because we do not consider ‘parameters’ of probability distributions, say, mean and variance of the Gaussian distribution), that is, make no assumptions about probability distributions.Distribution-free techniques are most often based on ranks, that is, sort some results into ascending order and see where the case of interest appears in the sorted results.Commonly distribution-free techniques have a less good trade-off between rates of false alarms and missed detections, but in cases of a lot of data, i.e., now ‘big data’, there is a chance (maybe should prove some theorems) that the trade-off can be better than in the past.I’ve sent Fred a PDF of a paper that is for such detection (statistical hypothesis testing) that is, uniquely or nearly so, both multi-dimensional and distribution-free, actually, with a large collection of such tests.With meager assumptions that commonly hold well enough in practice, the techniques in the paper permit adjusting false alarm rate in small steps over a wide range and achieving that rate exactly.The paper was intended for the cases when there was a lot of data on the normal situation but little or nothing on the situations to be detected. So, the paper was for what in computer and network monitoring is called ‘zero day’ problems.But, with the large number of spam messages and the enormous number, ‘big data’, of non-spam messages, likely more can be done.
I’ve had 3 spam replies from the fredwilson.vc comment from last week. Surprising that they weren’t caught by Disqus as they are all pretty obvious… Never had that happen before. Probably a high spam point across Disqus.
That’s why we’re talking about it. Disqus said they switched spam algo’s a couple of weeks ago, and are working to continue improving it.
Is it bot spam or just people posting garbage, or both?
Likely mostly bot, from what I see.
Tough crowd today
that’s ok. not a bad thing from time to time
WHY?COST OF SPAM ALMOST ZERO.BENEFIT HIGHER THAN THAT.END STATEMENT.
Do spammers taste better than top commenters?
THAT THING EAT AT 4AM AFTER 200 BEERS IN DIRTY RESTAURANT YOU WISH YOU NO REMEMBER IN MORNING?SPAMMER TASTE LIKE THAT.
Nothing like pounding a few hundred beers after a long day, I hear ya. But lay off the junk food, it’ll extinct ya.
exactly.ignore the spam.BFD.
If u could have DISQUS implement a paywall for commenting that used “crypto currencies” then we could say 0.000000010 coins would be needed to post a comment possibly or more… whatever that small amount would be it would add up for spammers, as for the people who comment AVC could then pay back the crypto currency in some other form …Idea being using crypto currencies we could limit spammers.Some folks will say they don’t have any crypto currencies. AVC could tell them to create a crypto currency account & get a wallet , then they could deposit a certain amount that will allow say the first 10 comments to be posted by them using that, for further comments they would purchase the coins & it would be a good way for people to start using crypto currencies. I am sure Mr.Wilson has been given pitches on such services already a gazillion of them I bet…
Paywalls are bad for free flow of conversations. It’s an additional barrier and many people won’t comment. It literally turns free speech into paid speech.
It is not to be considered as a wall. When you comment you have to sign in to leave a comment, that itself creates a wall. If you incorporate the crypto currency payment mechanism into the sign in process then you are not creating an extra layer.Why crypto currency cuz most folks wont mind paying 0.00001 cent etc for a comment while for spammers it becomes an issue.
Yes, it is most definitely a wall. 1) not all have access to crypto – 2) at that low rate, it will not deter spam. Fail.
as I stated not everyone has crypto currency, but maybe they are given a starting base amount. The amount that is small enough for individuals but isn’t for Spammers will be defined in time. You will see it being used in many services in the coming years…
so how do you decide who is a Spammer and who is not; and how do you determine pocketbooks and ROI of Spammers?
Spammers will decide that for you. At a price point it doesn’t pay to spam. As for pocket books, everyone who surfs the net can get a crypto coin wallet for free, something like Multibit and AVC would not mind depositing a certain base amount in that pocket book. People would be willing to pay a small amount to comment and or email, this is something that many will accept reluctantly but it is going to happen. The mechanics are being worked out as we speak…
Not I. Gated speech is not free speech. Yes, spam is annoying. Ignore it. If you don’t feed it, it will go away. Better filters. Perhaps DISQUS should tweak. Slippery slope, pay-to-play.
ther eis nothing free to begin with, also if you haven’t read about “siren servers” you may want to and know that as individuals what we are doing in the so called world of commenting and expressing ourselves at various sites/platforms and not getting compensated is something we will have to reckon with someday if not now.Getting to pay or be paid to participate in a forum is no different than when you go to listen to Mr.Wilson speak at a gathering, even if the payments go to a charity etc.By restricting who gets to participate (paywalls etc) is something we as humans have done successfully and that model isn’t going away…
Your expectation to be “compensated” for your “comments”. At what value? At whose determination? We share thoughts, ideas, relationships – freely. Yes, it is *very* different than paying to hear Fred give a talk. You are not Fred. We are not Fred. Not to say we cannot be “Fred”. Paywalls are bullshit and I personally choose not to support that model.
It is an extra barrier for first time commenters, and for the first time it’s deployed for all commenters…and then for refilling the credit account every so often. I don’t have to log in currently every time I comment – I’m always logged in to Disqus (for the most part). It’s all extra steps that complicate the process for the avg person.And sophisticated spammers wouldn’t be put off by creating fake crypto currency accounts and continuing their bad behavior.
So, let’s start on designing a Disqus spam detector.I gave some generalities in this thread in http://avc.com/2014/08/a-co…For more, a good place to start is, for post some humans have determined are spam, what data do we have on the posts?For a first step, take the set of all Disqus user IDs that have never been accused of spam and have the spam filter declare all new posts from those users not spam. For the other users, continue as below.So, for the data available for spam posts, a guess at the data available is user name, user’s IP address, time of day (especially in the time zone of the IP address), and the text of the post. Do we have all that? Is that all we have?Okay, suppose we have that, for a lot of spam posts. Now, consider all the IP addresses. Is that set very large? If not, then we have progress. If the set is large but is in just a few ranges, that is, can be ‘covered’ fairly accurately by not many IP address intervals, then again we have progress.Or, if parse the IP addresses into the the four bytes, can we get some relatively small numbers for some of the byte positions?For the user names that have been used for spam, blocking posts from those names is an obvious step toward a partial solution.To detect spam from spam posters who change their user names frequently, look at the user names the spam has come from and try to put the names into sets each of which have something in common in the name, IP address, IP address range, IP address geographical location, If so, then progress.Consider the length of the posts, maybe time of day, especially in the time zone of the IP address geographical location.Then, with enough data, can set aside much of the classical statistics of hypothesis testing and just find a function, likely non-linear (that is, not just linear discriminate analysis from classic multi-variate statistics) that seems to separate spam posts from the rest.Then for rates of false alarms and missed detections, just apply the function to the data and count. For a really big collection of data, that should still give fairly accurate numbers for the rates. Else, do the traditional thing of partitioning the data into two part, the ‘training’ data and the ‘test or validation’ data.Then determine the rates from the test data.Now some popular non-linear functions are from fitting parameters in neural networks, but other approaches are also possible.
For a first step, take the set of all Disqus user IDs that have never been accused of spam and have the spam filter declare all new posts from those users not spam.So spammers act nice for 2 to 3 comments, get whitelisted, then act badly. :-/the data available is user name, user’s IP address, time of day (especially in the time zone of the IP address), and the text of the post.Yeah, but reading too much into IP address, time of day/zone isn’t that insightful if it’s a bot.blocking posts from those names is an obvious step toward a partial solution.This is done every day, every time. Pro spammers whip up new account names, email address, and IP addresses faster than you can say whack-a-mole.Consider the length of the posts…Spam comes in all lengths. Mostly one to three sentences, in my unscientific observations. But I don’t see how length is a signal one way or the other.That said, your closing statements are on target. With enough data and sophisticated analysis, pattern matching of sorts, etc. This is all state of the current offerings.The challenge is that the signals like account name, email address, IP address are all fleeting. We capture those pretty quickly and shut them down across the network. But the mass spammers create new ones quickly and continue on with their business. These spammers don’t use their real IP addresses…it’s all masked. And blocking by IP address is risky because you might end up blocking a whole university or corporation because of one bad apple.So, the real data set to analyze is the actual comment text…looking for patterns and getting smarter therein.
Gee, what’s the secret to how you get those niftyvertical bars designating a quote? Need a secrethandshake for that one?> Spam comes in all lengths. Mostly one to threesentences, in my unscientific observations. But Idon’t see how length is a signal one way or theother.It doesn’t “signal one way or another” and doesn’thave to to be effective when used just ‘jointly’with other data. The secret: Be’multi-dimensional’. Also, we’re beingprobabilistic which generally means that we arebanning things out in the regions of low probabilitydensity of non-spam posts.Why “low”? Because our false alarm rate is our’money’ to invest, and for that money we want toreject the fewest good posts we can. So, permulti-dimensional unit of ‘area’ (or volume, if youwill) to reject a post we want to be where theprobability of a good post is low, that is,generally “out in the regions of low probabilitydensity of non-spam posts”. Making mathematics outof this general, intuitive view leads to the bestpossible result of Neyman and Pearson. That’s K.Pearson near 1900 and J. Neyman, long at Berkeley.My proof of the Neyman-Pearson result is based onthe Hahn decomposition based on the Radon-Nikodymresult with a famous proof by von Neumann.Elementary proofs require some hard swallowing and arugged stomach.A good example is, you have data on points, eachwith two coordinates, on a checker board and foreach point want to determine if the point is on ared square or a black one. Well, each coordinatealone is just useless since exactly half the pointson the board with that one coordinate are red andexactly half are black (advanced readers, yes weknow about regular conditional probabilities).So, with this example, one coordinate is justuseless, but the two coordinates taken together givea perfect answer. [Yes, we ignore points on linesbetween squares since with any reasonableprobability distribution (the technical term is thatthe distribution is ‘absolutely continuous withrespect to Lebesgue measure’), those points haveprobability zero, and in probability we can,necessarily have to, ignore sets of probabilityzero. So, yes, events of probability zero actuallycan happen, but, still, we can and must ignorethem.]Warm advice: Take very seriously the advantage ofusing several variables ‘jointly’, i.e., being’multi-dimensional’. The checkerboard example isdirt simple, crystal clear and easy to see, andright on target. Ignore that lesson at your peril.For more on being multi-dimensional in detection, Ipublished a paper in ‘Information Sciences’. Fredhas a PDF of the paper in one, likely exactly one,of his email messages ofDate: Tue, 23 Jul 2013 15:08:09 -0400> Yeah, but reading too much into IP address, timeof day/zone isn’t that insightful if it’s a bot.Then, IP address alone doesn’t catch bots: Not asurprise since this result is a special case of amore general empirical result we already know toowell, no one thing catches all the cases of spam.But think again about that checkerboard: It may bethat with several data variables, that is, being’multi-dimensional’, bots can still be detected withgood rates of Type I/II error (see http://avc.com/2014/08/a-co…in this thread) even when time of day is included asone of the variables (dimensions).Since so far we don’t have enough in mathematicalassumptions to do some derivations to tell for surein advance, we have to use the TIFO method — try itand find out.> This is done every day, every time. Pro spammerswhip up new account names, email address, and IPaddresses faster than you can say whack-a-mole.The glass is half full, not half empty: You blockthem for a while. And, with some slightly trickywork in letting their post appear to them, maybedon’t let them know for a while that they have beendetected (maybe in Hacker News discussions called’hell banning’).And over time there may be a detectable and/orexploitable pattern to the changes they make.Yes, an advanced spammer can still get through, butlikely not all spammers will be advanced, and maybecan block the less advanced ones. So, that would beprogress. Again, can’t get a perfect detector andcan’t get the best detector from just simpleconsiderations and, instead, have to chip away atparts of the problem so that in the end get goodrates for false positives/negatives.> So spammers act nice for 2 to 3 comments, getwhite listed, then act badly. :-/For most blogs, the “acting nice” and not beingdetected will require posting a genuine, at leastsomewhat thoughtful, post that is specific to theblog. At’s a lot’s uh trouble for the spammer!For some big time spammer and/or bot, posting at AVCand not looking silly takes some thought, even for’regulars’ at AVC, too much thought for a spammer.Who said anything about 2-3? Here at AVC I’veposted 2081 times and growing. Although maybe Fred,Shana, William and nearly all AVC ‘regulars’ regardmy posts as spam, I’ve not actually been blockedyet!More generally, we’re playing the ‘odds’ here, thatis, are being probabilistic. As in my posthttp://avc.com/2014/08/a-co…on this thread with the graph of generics, a’perfect’ detector is asking a bit much. So,instead, and inescapably and necessarily, we arelooking for a good combination of low false alarmrate and high detection rate, i.e., a goodcombination of Type I/II error. Putting all userIDs who have posted 100 or more posts at AVC withoutbeing detected for spam on a white list shouldreally help the two rates in question. So, firstlift out the obvious baby and then worry about whatelse might be in the bathwater.In case of a false alarm, have an ‘appeal’ procedurewhere the user gives some identification, say,e-mail address which can be confirmed, and that userID is white listed. If later get spam from the userID, then they get black listed and wasted theireffort in the appeal. A guess is that that won’thappen very often.> And blocking by IP address is risky because youmight end up blocking a whole university orcorporation because of one bad apple.So, don’t do that. Instead, be multi-dimensional.So, IP address would be at most just one of thedimensions considered. Then have a shot at blockingspam from that IP address but not the rest of theuniversity.Looking at the text, consider also, among thedimensions, the ‘difficulty level’ of the vocabularyand what words are misspelled.
Viruses = bad.run-of-the-mill spam – IGNORE.
Nuisance comments fall into many different categories, is what I think you’re hitting at. Can’t just bucket it all as “spam”.Some spam is mass bot generated, some is just manual troublemakers targeting specific users or blogs, some is actually intentionally virulent or otherwise trollish behavior (not actually spam, but just as unwelcome in most cases).Motivations vary from commercial to grabbing attention for the sake of it, to sadistic tendencies.
agree. but it is WE who choose to give them power – or not. By spinning wheels on how to blockade, we waste time better spent in meaningful discourse. I can ignore an annoyance; therefore it fails to annoy.
All good, thanks.Vertical line on your excerpts is just HTML using the blockquote tag.Hell banning doesn’t work on user intent on causing trouble because they are very familiar with this practice and have a separate window to monitor it. For bots, it’s not relevant. For casual troublemaker, who cares; just as easy to shut them down, nothing gained from fooling them.I agree with you regarding multiple coordinates checkers analogy. I’m no programmer, stats guy…but I suspect this is part of the algos used.
I just get peeved when my pithy, golden-gem comments get filtered.
Making OUR OWN EVERYTHING ….really worth the trouble?Why Micro-soft does not make the best anti-virus package?Why The Farmer does not make the best retail chain?Why BMW and Audi does not make the best tyres?Why intel does not make the best OS?Well it depends on what your business is ….and whether tyre is more important than carburetor/engine for your business.
thnxif that is so how do you account for the increase in obvious spam on avc, in many ways a marque site for disqus?
I guess I understand Joe.Your comment above implied that you don’t rely on third parties but if the increase is dramatic when the 3rd party is down and spam is noticeably more prevalent, then I guess you do–kinda…We all rely on a host of third party pieces to run our businesses and unless you own them outright, you simply never control them.
The internet is forever.
Comments that you delete used to be deleted and were removed from the thread.Disqus(t) never gave a reason behind their change.