Dec 3, 2015

WE Festival 2016

The sixth annual WE Festival will take place in NYC on April 13th and 14th 2016. The Gotham Gal has built this event into the premier networking and learning event for women entrepreneurs. After five years of doing it in partnership with NYU, she has taken it over and is running it together with her sister Susan. I’ve had a front row seat to this process and I can tell you that they have taken it up a notch. I am confident that this year’s WE Festival will be the best yet.

As usual, the event starts with an evening conversation keynote (Rachael Ray is doing it this year) followed by a networking event. The next day is packed with talks, workshops, and networking events. The theme for this year’s event is Resilience, a great mantra for entrepreneurs of all genders.

If you’ve been to the WE Festival in the past, you will know what a special event it is. If you haven’t been to one before, check out this page for an idea of what the event is like.

The central idea of the WE Festival is showing women entrepreneurs that they aren’t alone and that there are many others just like them, some of them farther along, and some not yet even started. It is, at its core, a celebration of women entrepreneurs, thus the term festival in the name.

Attendance costs $100 if you are a student, $350 if you are a WE Festival alum, and $375 otherwise. If this sounds like something you’d like to attend, you can apply here.

#entrepreneurship

Comments (Archived):

aminTorres Dec 3, 2015

This is awesome. Not every city can be Gotham but every city should have a Gotham Gal.
JimHirshfield Dec 3, 2015

Gotham Gal for Mayor
1. jason wright Dec 3, 2015
  
  not wealthy enough.
William Mougayar Dec 3, 2015

The back story on Joanne’s blog yesterday about almost quitting it, and your convincing her to not do that was an interesting revelation.
1. jason wright Dec 3, 2015
  
  “…for a variety of reasons…”
2. fredwilson Dec 3, 2015
  
  it was all about the margaritas
awaldstein Dec 3, 2015

Love this event and have three people that should apply.What is not clear to me is how they decide whom to accept? Admission criteria is good to know when you apply for anything.
1. fredwilson Dec 3, 2015
  
  great feedback. i will suggest that they update the website with that
  1. awaldstein Dec 3, 2015
    
    Thanks!
  2. LE Dec 3, 2015
    
    It seems like wefestival is only open to women, although it doesn’t specifically say that anywhere that is obvious on the website other than “An event for Women Entrepreneurs”.
2. Ruth BT Dec 3, 2015
  
  That was my thought. I’d love to go but need to make sure it was the right fit to fly across the world! Yes, Australia has great events but since the majority of my business is in the USA, it’s about time I focused there more.
Twain Twain Dec 3, 2015

Here’s why the talents and special domain expertise of women (language & communication) really matter in product engineering, code and bottom-line revenues (search, data analytics, advertising).If global society wants to get to optimal “human potential and equality” as Zuckerberg and others envision, then NOW IS THE TIME TO INVEST IN WOMEN.
1. Michael Elling Dec 3, 2015
  
  Just had conversations with a friend over the past couple of days about this topic. She was pointing to all the “male” pundits concerns about AI becoming dangerous and the machines taking over. Which led me to think: a) just how are we going to build human nature into the machines, and b) don’t we, based simply on recent events, have more to fear in our own human nature than the machines? I mean if we really do create sentient machines (beings?) would not their own logical powers of deduction and introspection lead them to doubt our nature as optimal? As in the movie Her, they’ll most likely pack their bags and find a safe corner (or layer) of the universe and float around as software. Come to think of it, how do we know that hasn’t already happened and we’re not just some lab experiment to improve on a prior version somewhere?
  1. Twain Twain Dec 3, 2015
    
    This is in today’s TechCrunch article on AI: “If machines don’t feel remorse, then the very human moral instinct is eliminated.”Everyone in AI knows the machines are logical and autistic. That’s why they can’t understand meaning in human language because language is inherently emotional and subjective.The major flaw in Maths as a language is that it has a logical base inherited from Aristotle, Descartes and others. It has never had an emotional base (someone clever would need to invent it and do so in such a way that it coheres with the logical base).If the machines continue with this purely logical base, then they just act in the interests of zero-sum outcomes: optimizing data to the machine’s logical idea of a win — rather than contextualizing data to the human mind’s consideration (thinking with care) of what’s a win for the greater good.We’re all already in lab experiments involving AI. That’s what A/B testing is. It’s what emoji buttons do in social networks. It’s what Google changing its search results is about, etc.
    1. Michael Elling Dec 3, 2015
      
      Leaves me with lots of questions, especially as I don’t think we’ve built this internet or social media platforms well from a human nature perspective. Or maybe we have…
      1. Twain Twain Dec 3, 2015
        
        Here’s the thing. Our natural human intelligence is a fantastic integration of our mothers’ X DNA-code and our fathers’ Y DNA-code.How has the Internet (Global Brain) been set up and how does data and existing “Knowledge Graphs” get classified?By mathematical logic (i.e., binary methods that fit neatly into a probability distribution curve) rather than human nature and the coherency of our minds which are … QUANTUM phenomena.
      2. Michael Elling Dec 3, 2015
        
        Glad you brought us here.The evolution of data networks since 1984, but really the early 1990s is consistent with voice networks from early 1880s to 1913.While we’ve learned a lot about monopoly & competition we’ve managed to forget important principles.Today’s Internet is a set of disconnected silos. We need to rethink this settlement free peering model & develop scalable, relatively open settlement exchanges that clear supply and demand bother north-south (apps to infrastructure) & east west (between networks & actors or users).Communication networks are inherently 2-way & of most value when real-time (although latency & long tail clearly can be used & are of value). The 1-way store & forward mathematicians didn’t model the complexity of a full duplex, everywhere (mobile & fixed), low cost, all the time 4K VoD & collaboration and Iot future.What’s missing are price signals & incentives (settlements) and mandated interconnection out to the edge to get us there.In other words an “internet” that resembles natural networks, such as the human body & mind.
      3. Twain Twain Dec 3, 2015
        
        A-ha, well … In this lifetime … this WILL happen …The economics part of the problem (supply-demand, monopoly pricing, signal:noise et al) is solvable once the basic building block of the mathematics is solved.And someone is solving that Quantum super-position which will re-engineer the Internet and everything else to be much more human, coherent and considerate.
      4. Lawrence Brass Dec 3, 2015
        
        There are powerful forces against graph shaped networks. Almost all the commercial value of today human networks come from primitive forms of AI analyzing our human interactions with the sole purpose of influencing or bending your next buy decision, by means of injecting messages into the communication channel. This valuable activity is simpler to do in a centralized network where all the interactions flow through a single node. Interactions ocurring exclusively at the outer edge are harder to track and analyze, so they become less valuable (harder to monetize). Advertising, rights management, spying, all these activities get more complicated in the presence of exclusive edge interactions.But why are you chatting about this? We should be looking for costumes and wigs to attend WE.
      5. Michael Elling Dec 3, 2015
        
        But what if everyone is on a unique demand curve. While everyone has talked about convergence, it’s been more around supply. A case can be made for an enormous amount of latent demand divergence that is waiting to be unleashed.Network theory confounds modern economic thinking; particularly that which leads to development of silos. It is neither completely centralized nor decentralized. There is no one organ that controls the body; they are all interdependent. I haven’t come up with a good term yet. Maybe centralized hierarchical networks?
      6. Stephen Voris Dec 3, 2015
        
        Everyone is different; lots of people are similar. The trick is figuring out which similarities and differences are the most important (not that I can claim to have this figured out). I tend to think of this in musical/poetic terms – theme and variation, rhyme, motifs that almost-repeat in different keys or stanzas.
      7. Michael Elling Dec 3, 2015
        
        Not just similar, but contextually connected. And that is missing in the current model (and the resulting messaging).
      8. Twain Twain Dec 4, 2015
        
        At a Deep Learning dinner the other night, one of the speakers said, “We used to say “Content is king” but now it’s “Context is king.” (@@wmoug:disqus)It’s LOL when men plant their ownership stakes and claim masculinity for everything and completely abstract female contributions from the equations involving successful creations. Conversely, when misandrists respond by “cutting men back down the size” that’s also less productive than if both genders would give each other credit for our respective strengths and contributions.My position is that the greatest tech breakthroughs and successes are achieved via the INTERDEPENDENCY & FUSION of male and female talent.* https://www.startupgrind.co…That’s why kudos to @fredwilson:disqus and his wife for consistently crediting each other for their shared success.I’d argue the case that, “If content is king, context is queen and coherency is their progeny.”As time passes, this idea is validated by market events.There are huge volumes of content and super-fast servers (even Quantum ones). Male engineers have applied almost every quantitative tool known to Google, IBM Watson, FB, the big banks, consultancies & governments to correlate that content. This is so what “Big Data” and Machine Learning do.Male engineers have tried to put that content into context by classifying if the bit of content refers to a person, place, price, friend-of-friend etc. on a Knowledge Graph.Yet Google itself has now admitted, “Meaning is something that has eluded Computer Science.”Why and how is it possible that Google can correlate huge quantities of content yet not be able to measure meaning?Because the code for context, language and meaning is baked inside… the female X chromosome, and that neuroscience research is only just emerging (@MsPseudolus:disqus, @lawrencebrass:disqus) .Today’s neural nets, Deep Learning et al are based on neuroscience and academic research from the 1950s and 1960s. They all have a male logic bias because 99% academics in AI are male and they think in terms of mathematical logic because that’s the building block of the codebase.Of course, some random outlier women who happen to be mathematicians, engineers, product people, artists and multi-lingual might just be those atomic forces of motion that set us onto a different path to ENABLE MEANING IN COMPUTER SCIENCE, human+machine coherency and re-frame how the machines serve Humankind and our collective potential.Yes, so for truly intelligent context and more … invest in female knowhow.
      9. Michael Elling Dec 4, 2015
        
        Doesn’t tolerance of risk, especially as much of this computation is about predicting (or swaying) future events, factor in as well? And may it not account for a significant portion of the gap?Coming from the capital markets I’ve always been surprised that no one questioned the way risk was being minimized, namely just kicking it down the road with derivatives. Would a female dominated world have put us in such a leveraged position over the past 30 years?Maybe women intuitively know or expect that s–t will happen and men fool themselves or turn a blind eye to the eventuality. But in taking more risk, rather than less, the range of unintended outcomes, or aha moments, or innovation increases. And so by default men dominate? More a question than a statement.
      10. Twain Twain Dec 4, 2015
        
        Christine Lagarde, Head of IMF, in May 2015: “We must move from rule-based behavior to values-based.”* http://www.americanbanker.c…So here’s the thing: similar problems and blindspots exist in the financial sector as they do in tech and AI (and the Blockchain doesn’t solve these blindspots, by the way).It’s to do with 2 specific flaws in economics and computing:(1.) All models follow rules-based logic.(2.) Price has been applied as a proxy for values — even though, to be precise, price is quanta (numbers) and values are qualia (language), aka two different factors.The assumption that price has values baked in and then to “kick it down the road with derivatives” is one of the biggest fudges in economics ever.There are all sorts of issues re. risk management. It’s straightforward enough to model interest rate risk, right? Just do time series extrapolations, apply some gradient descent to work out where interest rates could top / bottom out and then co-correlate across countries for country risk.See? For quanta such as price, interest rates, yields, etc there is established and reasonable mathematical methodology.However, for qualia such as political risk, people risk, development risk, cultural risk etc… fudging (fuzzy logic of probability and derivatives) happens. I know this because …I sat my final maths exam on Thursday and by Monday I was working at http://www.risk.net, writing about… derivatives and risk mgmt systems. After that, a Cambridge don recruited me into a hedge fund where we had several of those “hot” Deep Learning models the AI sector is now so enamored with.These experiences are why I know the utility and limitations in the “hot” AI and why not even Google can do meaning in Natural Language Processing.Compounding my knowhow is this: I spent several years co-investing in institutional trading platforms AND wrote some strategy papers arguing against mortgage CDOs with special attention to risk derivatives and prop trading.Well, if the bank had listened…they wouldn’t have ended up writing down $38+ BILLION in value.The issue has less to do with whether a female-dominated world would have let the machines go up to gearing of 60 times to 1.It has more to do with inventing and configuring tools such that QUANTA (huge volumes of content) and QUALIA (specific vectors of context) make coherent sense.It’s not a trivial problem for the financial sector and the technology sector (in particular AI) to solve.It gets to the very core of the differences between Maths as a language (quanta) and Natural Language (qualia) and how to integrate the two coherently.In any case, “mad artist-scientist” that I am, I wrote some Quantum notation for the computing code structure in 2009/10 and it makes sense. Not even W3C has been able to write that notation — much less Google, the quantum physicists, neuroscientists and economists, haha.It solves the common problems and blindspots in finance and in tech, especially wrt AI & Natural Language.Unfortunately, the system won’t ship in time to do anything about the next crisis which is predicted to happen:* http://www.imd.org/research…I’ve taken more risk and had more guts than any guy out there to invent my systems.Only someone completely crazy (in a good way) would take apart legacy models that have been with us for 100s of years (Aristotle, Descartes, Bayes, Adam Smith, Dr Johnsons’ dictionary, et al) and solve tech’s limitations long before…Google SVP says, “Meaning is something that has eluded computer science.”Yes, well, it doesn’t elude me and my inventions — which are laser-precise in a world of blunt and fuzzy logic.
      11. Twain Twain Dec 3, 2015
        
        Haha, Lawrence, spot on: guys would learn a lot from attending WE too.
    2. Stephen Voris Dec 3, 2015
      
      Just to play devil’s advocate for a moment – remorse is merely one part among several of moral instinct; if someone or something is missing remorse, but has disgust, sanctimony, and righteous indignation… we might not like the results, but it’s hard to say they don’t have human moral instincts.As for an emotional base for math, I’d be inclined towards vector spaces as a starting point; picturing a bunch of dimmer switches labeled “pride”, “fear”, “love”, and so on, elastic bands of varying tautness stretched between them to say which ones tend to move together and how much.
  2. Twain Twain Dec 3, 2015
    
    Tech tools are supposed to be able to measure, model, illuminate, inform and reflect us so that we can see and understand the who, what, when, where, how and WHY more clearly. Knowledge should enable us to, collectively, be more aware and then make changes for the betterment of global society (in economic models, in diversity & inclusion, in systems intelligence).@fredwilson:disqus and @wmoug:disqus have seen these slides of mine.
2. skyberrys Dec 4, 2015
  
  Thanks for the vote of confidence 🙂 The female computer engineers are silently sitting at their desks right now working on solutions for AI. I do think though, that the bottom line revenues reside in taking care of really basic and fundamental needs. Why lump advertising into your bottom line? Search I understand, it’s amazing how much more proficient society could become if we could immediately find answers to even somewhat simplistic pressing questions, like “What’s for dinner?” in ways that would benefit the largest number of humans. Data Analytics, ok that is essentially the building blocks of current AI solutions for finding answers to huge systems we do not understand. But advertising is somewhat misleading as a bottom line revenue. Why would your AI researcher care at all about making sure corporations can pay $$$ to influence the emotions of eyeballs? If I’m going to work on something that is supposed to result in optimal ‘human potential and equality’ and someone holds up a measuring stick that looks like advertising, I’ll fail all of the tests as fast as possible. Yes we need revenue, but what you are really asking for is an optimal system. Something that searches out the transactions that humans need to be engaging in on a daily basis and takes care to minimize barriers to those events from taking place.What does it look like? How does it work? It’s probably not even easily recognizable, maybe it’s an economics solution or a political action. Yes, somewhere in the tangle it does involve CMOS machines and ok quantum computers do seem like they could solve different sorts of problems, but is the answer neatly packaged to fit into the business model canvas? not really. It looks much more like a neural net, and those are messy and complex.
  1. Twain Twain Dec 4, 2015
    
    Whether we like it or not as engineers, investor decisions are based on whether or not a platform has advertising potential.The business model canvas is a set of discrete 2D boxes whilst current NN involve scalar probabilistic weightings.Neither is necessarily the optimal structure to model “human potential and equality”.
    1. Lawrence Brass Dec 4, 2015
      
      Yes, we don’t have to like it… who does? Ads are like a skeleton in the closet, its is there but everyone avoids talking about it, at least not in direct terms. And, as you say, that is where most internet companies revenue, and indirectly valuation, comes from. I recall conversations with friends looong ago, we were amazed by cool companies like Google and Facebook and their growing valuations and did not understand how and when would these companies make ‘real’ money. In our brick and mortar mindset back then, money came from selling ‘something’ wether it was man-hours (here is another case of sexist language we need to fix), software products or web pages, this was before SaaS.I guess this will sound stupid now, specially to people in marketing and probably investors, but years later when we saw these companies running ads on their services, we were rather disappointed. So this was it, it was just like cheap open TV, the only difference was that the content was generated by the same people using the services.
    2. Stephen Voris Dec 4, 2015
      
      A thought – granted, one that I’ve been mulling for a while by now: why is advertising treated so differently from other forms of broadcast content? And, what makes corporations so special when it comes to advertising? (yes, these questions are partially rhetorical, but only partially) Paradoxically, a solution to disproportionate corporate influence might well be to make advertising more available – so that you don’t have to have the resources of a corporation to put your message in front of strangers (not to be confused with the “share” button, which I believe is for putting your message in front of non-strangers).
pointsnfigures Dec 3, 2015

Support networks for entrepreneurs are integral to startup ecosystems. In Chicago, we have a few good ones for female entrepreneurs and we have 30% female founded companies, highest in the world.
1. William Mougayar Dec 3, 2015
  
  Wow…where is that statistic from, on the 30%. Does it include data from other cities?
  1. pointsnfigures Dec 3, 2015
    
    The Compass Report on startups. Chicago was ranked #7 startup hub in the world, (NYC was #2, LA #4, Toronto was right there too)
Twain Twain Dec 3, 2015

Regardless of whether it’s in entrepreneurship or in corporate, women need to be more visible, vocal and sharing our special knowhow.AVC post today on WE Festival is an important reminder of this. Right before I popped by here to see what’s happening, where was I? Registering for the Deep Learning Summit. Let’s play “Count the number of female speakers.”Why do the machines not grasp the meaning in human language, as Google’s SVP of Search has admitted?Because they haven’t had “Mothers of AI” to teach and train them. The machines need modern-day Ada Lovelaces more than ever if global systems are to understand us, our meanings, our intents, our hopes & dreams, our values.
1. Carrie Dec 4, 2015
  
  $450 for student tickets… It was cheaper for me to fly to NYC from CA last year for WE5, and buy my WE5 $75 student ticket, and sleep on my co-founders couch for 4 days. I can’t even write a paper or submit a poster to get a cheaper ticket to this. But I could see Karpathy present!
Twain Twain Dec 3, 2015

Maya Angelou…WE Festival and others like it let women stand up and show+share why+how they’re making dents+differences to the startup universe and in global society.Thank you, the Wilsons, for being each other’s conscience and action engines.
Kirsten Lambertsen Dec 3, 2015

I can testify that this is indeed a unique and excellent event. The focus on interactivity is outstanding.I would encourage women at all stages of entrepreneurism to apply. The impression I got when I attended is that applicant selection is done with an eye towards creating a diverse group. I met everyone from college students to seasoned serial founders.I’d love to see them offer a couple of scholarships, if they don’t already. Entrepreneurs are often very poor 🙂
LE Dec 3, 2015

Attendance costs $100 if you are a student, $350 if you are a WE Festival alum, and $375 otherwise. I am noting that the prices that you are charging for admission are very reasonable given what it must cost to put this event together.
1. William Mougayar Dec 3, 2015
  
  Prob because of the sponsorships that help to subsidize it.
Alan Warms Dec 3, 2015

Fred is there a way the people can donate money to sponsor entrepreneurs who may not be able to afford the $350? I am confident your audience could put together a nice chunk of change to finance some just starting out folks to attend. Sign me up
1. Twain Twain Dec 3, 2015
  
  YES! THIS +googol times.
2. JoeK Dec 3, 2015
  
  This really encapsulates the hype corroding the startup ecosystem today. The expression “donate money to sponsor entrepreneurs”.
  1. NatM Dec 4, 2015
    
    Considering that past attendees are considered “alumni” the term “scholarship” seems appropriate here. Is the fact that money is donated to literature majors to cover their tuition indicative of hype in the humanities ecosystem?
3. LE Dec 3, 2015
  
  I had thought of something similar (and wrote to my daughter to see if she and a friend would like to attend for example). However the issue is also that any entrepreneur should have some skin in the game. So I don’t know if it’s a good idea to reduce the amount to $0. I think if you aren’t willing to invest $100 or even $80 of your own money, then you probably are not the right person to come to this particular meeting (emphasis “this particular meeting”). Additionally for networking purposed the crowd probably has to be carefully managed.
  1. Jess Bachman Dec 4, 2015
    
    I dunno, sometimes gifts can carry a lot more obligation than the price of the ticket. You know… like when your mother-in-law gives you that thing that has been passed down in the family and has no monetary value but you have to use it at every holiday anyways.
4. Nidhi Mevada Dec 3, 2015
  
  Yes there should be sponsors. It will be difficult for outsider (non US person), they have to pay travelling cost, accomodation and other things. Let me know if there are sponsors, I also want to attend.
sigmaalgebra Dec 3, 2015

Ah, just JPG, 6000 x 4000 pixels, 24 bit color image, name, address, phone, e-mail address, bio, marital status!!!!! Some favorite recipes, home decorating themes, child nurturing ideas. favorite movies. Good at home schooling kids? List of technical topics optional!!!!! :-)!!!!
1. Lawrence Brass Dec 3, 2015
  
  Um… like a social-business card? Or are you decorating your den for Christmas.. or just under the influence of ginger and cinnamon edibles?… or both? 🙂
  1. sigmaalgebra Dec 3, 2015
    
    Ah, now it’s December!So, anything she could do good with butter, sugar, flour, cinnamon, and ginger — TERRIFIC!E.g., Mom used to take left over pie dough in a flat circle, top it with butter, sugar, and cinnamon, and bake it! Sometimes she also made eggnog.Ah, somewhere recently saw a picture of Cinderella taking care of the mice, the little ones! Some of those Disney Cinderella drawings are nice!If Cinderella could also home school them, get them through the Bach Chaconne, e.g.,Heifetz: https://www.youtube.com/wat…Heifetz Master Class: https://www.youtube.com/wat…(there, the student’s second effort is near the end of the center D major section and just before the climax of the piece)Hahn: https://www.youtube.com/wat…Lisitsa https://www.youtube.com/wat…and their Princeton math Ph.D. qualifying exams, so much the better!Heck, if she could just say what is the fastest way to get rid of duplicate lines in a big Web site log file, depending on the size of the file, the number of entries, and the fraction of different lines, that would also be okay. And the candidates are: (1) Pull the log file into a text editor, sort the lines, have a little macro delete the duplicates and leave the rest, (2) have in main memory a heap data structure used as a priority queue that discards duplicates, read the log file once, stuff each entry into the priority queue, (3) do the same as (2) except use a version of a heap data structure with good locality of reference when using lots of virtual memory, (4) like (2) except use the Fagin, et al., extendible hashing, (5) sort the entries using a disk based sort-merge routine, on at least four hard disks, and then read the sorted file and write out the unique entries, (6) partition the log file, assign one core to each partition, borrow from sort-merge, have each computer discard duplicates as it finds them, and then sort, merge, and delete the results of all the cores into one file.Which is fastest depending on the nature of the log file? Make some reasonable assumptions and do some applied probability.Or, maybe she could take a resampling plan for a multi-dimensional, distribution-free, statistical hypothesis test anomaly detector and say how, for a given probability of Type I error the power of the test improves as the sized of the input data grows.Or, she could say how to take a spreadsheet for multi-period project and/or financial planning over time, under uncertainty, upload it to the cloud, apply stochastic dynamic programming optimization with enough efficiency to work in the cloud, e.g., with multi-variate spline interpolation for the optimal value functions, scenario aggregation, certainty equivalent linear, quadratic, Gaussian approximation, and report the results. Good way to keep the world’s largest cloud busy for a few years!Had an apple pie over the weekend and a pumpkin pie back near Turkey Day.On the Internet, recently found the phone number of one of the prettiest girls in my high school, called her, and got caught up on the gossip.Yesterday happened to see some of Deanna Durbin in Spring Parade, (1940)http://www.youtube.com/watc…she’s pretty and sings well, and some of Barbara Bonny, the one on the left, in the scene Presentation of the Rose athttp://www.youtube.com/watc…where she is pretty and sings even better, and it’s about time for The Nutcracker.Anything she could do good with butter, sugar, flour, cinnamon, and ginger — TERRIFIC!Ah, before the cinnamon must come the typing; I’ve got some typting to do!
    1. Lawrence Brass Dec 4, 2015
      
      I am always amazed by the length of your replies and enjoy the contents too, thank you.My daughter is a musician and plays the violin, I will ask her to play that piece for me, if she can. I witnessed the long process she went through learning to play so I admire people devoted to music, only true love for music can explain that career choice.Of all the choices to solve the problem Cinderella won’t solve I like something between 2 and 3. It is funny reading you trashing C one day and the next day mentioning locality of reference. The best language I know to achieve decent locality of reference in a direct and straightforward way is C/C++, where you can *precisely* define the memory layouts of data structures. If you stack languages considering their abstraction level, C/C++ fits the exact slot where concepts like locality of reference have a meaning. Language wars are boring, but I read you yesterday or the day before and had this thorn piercing my foot.Back to the solution to the problem, I would suggest computing a non-cryptographic hash for each line. Save hash, line position within the file and line length into a data structure. Insert the hash into a binary (or better) tree, to eliminate duplicates. Save non dupes (line position and length) into an array. Using the array, copy the non dupes from the original file to the processed file. Fast and simple, probably I/O bound, not very multicore friendly. Sprinkle with cinnamon.
      1. sigmaalgebra Dec 4, 2015
        
        > My daughter is a musician and plays the violin, I will ask her to play that piece for me, if she can.Great congrats on having a daughter, especially one who cares enough to work hard enough to play violin.Nearly all violin students work on classical music, and there that Bach piece is known better than cinnamon at Christmas.Bach wrote six pieces for violin and six more for cello. Each big time violinist plays and records all the violin pieces, and same for cello. Each piece is in several parts. The Chaconne is the last part of the 2nd Partita.Still more popular is the 3rd Partita, especially the first part, the Preludio, mostly in E major but with some parts in A major.Heifetz: https://www.youtube.com/wat…I’d expect that, from your daughter’s practicing, you would know that piece to the last note.Your approach to the unique lines problem via hashing the string of each log file entry seems to have an issue: Duplicate hash values do not quite guarantee duplicate log file entry strings. So, with each duplicate pair of hash value and entry length, still have to check if the entry strings are actually duplicates. Would have to do this for each duplicated pair of hash value and entry length and for such a pair use the file position and lengths to extract the actual strings and compare them. So for each set of duplicated pairs, would have to go skipping through much of the file doing this check.This skipping could be hard on rotating disk, but might consider the situation as quite different and okay with the given log file on solid state disk drives.Okay, suppose, instead, do take the hash you have in mind, use the heap data structure, organize the heap so that each key is the pair of the hash value and the string, go chasing down the heap comparing only hash values; when find a duplicate hash value compare the full key of the hash value and the entry string. As before with the heap, if the entry strings are equal, discard the one that is a candidate for the heap of unique strings.Just getting for each log file entry string, the hash value, string length, and string file position seems to take us close to the old idea of sorting by surrogates which, IIRC, was considered not so good because at the end of working with the surrogates still had to move the actual strings to where they belonged which ate up essentially all the speed advantages of working with just the short surrogates. With all the data on disk, for getting the unique strings where want them, the work of traditional sequential I/O sort-merge is ballpark the fastest way. Again, SSDs might let sorting by surrogates be a better idea now.Indeed, if have the given file on SSDs, the in the heap keep just your triple of hash value, string length, and file position, and when need the actual string, which will for at least each duplicate, just get that from the SSD — since are reading the data from SSD anyway, keeping the unique string values on the SSD instead in the heap in main memory might not be too slow.If want to have detailed control of memory usage, e.g., including page alignment, then in most languages just allocate a large array of type byte so that for some constant each array index plus the constant is the virtual memory address of the array component with that array index, and there apply any memory management technique want that doesn’t need a lot of compiler and OS support. So, don’t need to be limited to C/C++ to do that. There should be a way to get the virtual address of the first byte and, thus, where the both real and virtual page boundaries are, at least if know what the page size is.This use of an array for the space for dynamic storage management with using just array indices for addresses has lots of gray hairs, e.g., goes back to Fortran where could use a block COMMON for the array so that all parts of the program could address the dynamic memory.A few years ago I was coding up the W. Cunningham (Department of Combinatorics and Optimization, Waterloo) modification of the network simplex algorithm for the standard pharmaceutical salesman (detail men) problem, coding in Fortran to have portable code, and writing my own dynamic memory allocation starting with a block COMMON idea.But the locality of reference I had in mind was really just to reduce paging, and for that there is a modification of the usual heap approach that should not need really careful work with addresses, page alignment, etc. That is, will be working with a heap larger than main memory and, thus, will want to be local to maybe 100 pages at a time and not just one or two.Congrats to your daughter. If she knows the Bach pieces, definitely find something she would really like you to do for her and, then, do it! See your daughter smile — one of the greatest prizes life has to offer.
      2. Lawrence Brass Dec 5, 2015
        
        There was a period in our lives when my daughter and I were a bit apart, she chose to live with her mom, far away from the city, so I missed many day to day details about her for a few years. Of course this is only a lame excuse for not having a memory of her playing the The Chaconne.. she probably did and I wasn’t listening. We see more of each other now that she is back in the city, those are precious moments we both cherish.Of course you are right about the issues in the approach, I totally missed hash collisions, which I am aware of. My bad. As you said, the repeated hash-length pairs corresponding strings have to be checked on disk. I would do this on a second pass, which purpose is to copy the unique strings to to their final destination, disk to disk or disk to RAM, so the first pass can continue sequentially. Every time I need solutions for this type of generic problems I look around to avoid having to build the solutions myself and it is very unusual to find what I want, the way I want it, in the language I want it, etc. Very often I end up doing things myself.
      3. sigmaalgebra Dec 5, 2015
        
        > I would do this on a second passHave to skip through much of the original file for each case of a duplicate pair of hash and length — at’s a lot’s disk head movement, better done on an SSD.Hope your daughter plays the Chaconne for you. There is so much in it it’s not easy to like on a first hearing. So, to get ready, listen, first, to, say, the orchestral version where it’s much easier to hear everything:https://www.youtube.com/wat…Okay, if you like to use work from others, here is my heap routine: Function obj_heap_insert In Visual Basic .NET’ Function obj_heap_insert’ Object heap insert to use the heap algorithm to maintain’ a ‘priority queue’.” So, suppose for positive integer n we have x(i) for i =’ 1, 2, …, n and for positive integer m <= n we want the’ m largest of the x(i). So, we allocate an array y(j), j’ = 1, 2, …, m.” Suppose we regard y as an ascending heap, e.g., so that’ y(1) <= y(j), j = 2, …, k <= m where the number of’ locations of y in use so far is integer k where 0 <= k <=’ m.” Then for i = 1, 2, …, n, we consider inserting x(i)’ into the heap.” If k < m, then we set k = k + 1 and set y(k) = x(i) and’ ‘sift’ to ‘promote’ the value in y(k) until we have a’ heap again.” If k = m, then we compare x(i) and y(1). If x(i) <= y(1),’ then we are done with x(i). Else, x(i) > y(1) and y(1) is’ not among the m largest and, to ‘remove’ value y(1) we’ set y(1) = x(i) and ‘sift’ the value in y(1) to create a’ heap again.” After i = n, y contains the m largest of x(i), i = 1, 2,’ …, n.” The advantage of this routine is speed: When k = m, the’ effort to insert x(i) is proportional to log(m). When k’ < m, the effort to insert x(i) may be proportional just’ to k. The worst case would be when the array x was in’ ascending order since then x(i) would have to be inserted’ for each i. Similarly, in the case the array x is in’ descending order, after m inserts, no more inserts will’ be done. For the order of array x ‘random’, once a’ relatively large fraction of the m largest are in the’ heap, additional inserts become relatively rare.” This routine is to be ‘polymorphic’, that is, to work for’ essentially any user defined class my_class1 where” Dim m, n As Int32′ Dim x( n ) As my_class1′ Dim y( m ) As my_class1” This routine can build an ascending heap, as discussed’ above, or a descending heap. The only difference is how’ the comparisons are made in the class used for the’ interface.” The interface IComparer is used. If the usual class’ Comparer is used for this interface, then this function’ builds an ascending heap, that is, where upon return” y( 1 ) <= y( j ),” for j = 2, 3, …, m. Function obj_heap_insert( _ x As Object, _ y() As Object, _ ByRef k As Int32, _ m As Int32, _ compare As IComparer) As Int32 Dim routine_name As String = “obj_heap_insert” Dim error_code As Int32 Dim return_code As Int32 = 0′ Dim result_code As Int32 Dim message_tag As Int32 Dim i_father As Int32 Dim i_child0 As Int32 Dim i_child1 As Int32 Dim i_child2 As Int32 Dim i_2 As Int32 = 2 Try error_code = 1001 message_tag = 1001 If console_msg_level >= console_routine_messages2 Then _ Console.WriteLine( routine_name & ” ” & message_tag & _ “: Started …” ) error_code = 1002 If m <= 0 Then return_code = 1001 Goto out End If If k < 0 Then return_code = 1002 Goto out End If If k > m Then return_code = 1003 Goto out End If If y.GetUpperBound(0) < m Then return_code = 1004 Goto out End If error_code = 1003 If k < m Then i_child0 = k + 1 k = i_child0 error_code = 1004 Do ‘ Sift value of x into correct position. If i_child0 < 2 Then Exit Do i_father = i_child0 i_2 error_code = 1005 If compare.Compare( x, y( i_father ) ) >= 0 Then Exit Do y( i_child0 ) = y( i_father ) i_child0 = i_father Loop ‘ Sift value of x into correct position. error_code = 1006 y( i_child0 ) = x Goto out End If error_code = 1007 If compare.Compare( x, y( 1 ) ) <= 0 Then Goto out i_father = 1 error_code = 1008 Do ‘ Delete value of y( 1 ) and sift x to correct position ‘ starting at location y( 1 ) If i_father > m i_2 Then Exit Do i_child1 = i_father + i_father If i_child1 < m Then i_child2 = i_child1 + 1 error_code = 1009 If compare.Compare( y( i_child1 ), y( i_child2 ) ) < 0 Then i_child0 = i_child1 Else i_child0 = i_child2 End If Else i_child0 = i_child1 End If error_code = 1010 If compare.Compare( x, y( i_child0 ) ) <= 0 Then Exit Do y( i_father ) = y( i_child0 ) i_father = i_child0 Loop ‘ Delete value of y( 1 ) and sift x to correct position ‘ starting at location y( 1 ) y( i_father ) = x Catch Console.WriteLine( ” ” ) message_tag = 1002 Console.WriteLine( routine_name & ” ” & message_tag & _ “: Error condition raised. ” & vbCrLf & _ “: error_code = ” & error_code & vbCrLf & _ “: err.number = ” & err.number & vbCrLf & _ “: err.source = ” & err.source & vbCrLf & _ “: err.description = ” & err.description ) message_tag = 1003 Console.WriteLine( routine_name & ” ” & message_tag & _ “: Raising error condition with error_code = ” & error_code ) err.raise(error_code, routine_name) End Try out: message_tag = 1004 If console_msg_level >= console_routine_messages2 Then _ Console.WriteLine( routine_name & ” ” & message_tag & _ “: Returning.” ) Return return_code End Function ‘ Function obj_heap_insert