Invested in the stock market? Portfolio crashing? Glued to your monitor watching this train wreck in slow motion?
Try installing Wireshark and watching the packets traversing your network in real-time instead. You get all the thrills of real-time data, and you’ll actually learn something about all the packets passing through your computer. Best of all, it’s free and you can forget about all the money you’re losing (unless you had the foresight to invest in Gold, Platinum, Corn and Wheat).
Posted by admin, in Uncategorized on Mar 04 08 No Comments | Read MoreAs of 11/15, all indexes are back on a regular update schedule. News is updated daily, Podcasts and Blogs are updated weekly, and the full crawl of 24,000+ sites is now being incrementally updated continuously.
Emphasis in the search index continues to be on deeply search sites of utility for software developers.
In the queue for the next round of updates is better support for podcast and blog search, and performance improvements to the results clustering server.
You can access the search site directly HERE. If you want to bypass the Carrot2 Clustering Engine, go HERE.
Posted by admin, in Uncategorized on Nov 19 07 No Comments | Read MoreA few months ago I did a fairly positive review of Windows Vista Ultimate, but after more than 9 months of use I’ve reconsidered. At the moment I’m thinking the apex of Microsoft’s desktop operating systems was probably Windows XP SP2. Here are a couple of my observations:
Performance
Because we do practically daily builds of our software, I run VMware so that I can install and look at the latest features and functionality — and when required do some smoke testing. I recently installed an XP virtual machine hosted on Vista and have been using it for not only for our builds, but also as the host for some apps like Vongo, iTunes and Rhapsody that suffer from poor performance on Vista. This is obviously not a scientific test, but the XP virtual machine is fast and responsive — and isn’t plagued by the performance glitches I’ve seen on Vista.
Stability
I mysteriously got a blue screen error when booting Vista this week after applying a software patch. I was unable to get the OS to boot in any mode, and after some tinkering discovered the registry was corrupted. In particular, the entries for keyboard support were damaged and after playing with the system for an hour I gave up trying to get it running.
Giving up, I went out and bought a new (larger) drive did a fresh install and connected the old drive via an external enclosure to the machine to copy data files back onto it. The fresh install performs better than the one that had months of detritus on it, but I still get mysterious hangs when browsing between folders and the performance of Outlook 2007 is really appalling.
Application Compatibility
Spotty. Hard to attribute this to Vista or the applications themselves, but even with Excel and Access I’ve had catastrophic, data losing failures.
Aesthetics
Really when you get right down to it not much better than XP — certainly not gorgeous — or to die for.
So my revised recommendation for Vista users with the RAM and diskspace to spare is: go out and spring for a copy of VMware Workstation 6.2 (or your VM of choice) and install a guest instance of Windows XP. While you’re at it maybe a Linux distribution. That way when you start yearning for the good ole days running Windows XP you can jump right into it again without doing a complete OS install. Hey, and when SP1 for Vista comes out you can upgrade and toggle right back.
I’ve harped on the negatives, and Vista does have improvements. Too bad like many things Microsoft, it shows promise –but also feels 85% complete. For the first time since 1987 I find myself recommending Macs to users without large legacy investments in Microsoft software.
Posted by admin, in Uncategorized on Nov 17 07 No Comments | Read MoreWell, it’s not hazy, hot and humid anymore. Here are some pics from the grounds of the HHH world headquarters

Stony Brook 11/2007

Stony Brook 11/2007

Stony Brook 11/2007
Posted by admin, in Uncategorized on Nov 09 07 No Comments | Read MoreConnection to this site is a bit sporadic, but it’s the only webcast I’ve come across of a Chuck Prophet show. You can find it here. I think he’s great. Hopefully you will also.
Posted by admin, in Uncategorized on Oct 31 07 No Comments | Read MoreAs I was building FindITAnswers, three software tools were critical to managing my spider indexes. Where spider exclusion rules act as a first line of defense for maintaining the quality of the index, a few simple utilities on the back-end are also immensely valuable:
Merge Utility: Merges multiple indexes into one. This was an invaluable utility since FIA’s spider crawl was divided into 125 index segments. The indexes were organized around key platform vendor sites, and sites with similarly structured content. Using multiple indexes has lots of obvious advantages, including:
This was the Lucene utility I developed, and while in-elegantly coded due to my superficial Java knowledge, works as anticipated.
Kelvin Tan, developed two small utilities for me that are also critical to increasing the relevance of search results. When you don’t have a team of astrophysicists building your search algorithms, tools that improve the quality of your indexes can help your search engine a lot:
De-duplication Utility: An unavoidable byproduct of using multiple spider crawls/index segments is inevitable duplication of some pages. Rather than checking for and suppressing dupes at search-time, this simple utility looks at a merged index and deletes any duplicated pages.
Ad-Hoc Deletion Utility: This tool allows deletion of index records based on keywords, terms, wildcards and regular expressions — and allows for searching specific index fields. This is great for scrubbing pages that pollute search results — and catch anything that got through the initial spider exclusion filters.
Combining the simple utilities above with a good database of URLs to crawl, and well-planned spider exclusions can vastly improve the results your search engine delivers by feeding it higher-quality indexes.
In Part 4 of this series, I’ll discuss clustering search results — and my experience having the Lingo3G Document Clustering Engine integrated with Lucene.
Posted by admin, in Search on Oct 14 07 No Comments | Read MoreThere’s a lot to like in Office 2007, but the learning curve for the new UI is steep. If you’ve been a casual user of the apps you can probably quickly find the few features you’ve become accustomed to using, but if you live in the apps (like I do) prepare yourself for days of hunting for the new locations of your most used features.
Over time, the new ribbon bars do become handy time savers, but in the meantime prepare yourself for a big hit to your productivity. After a few months, I still find myself peridiodically going into brain lock trying to remember the location of a simple menu item or button.
There is hope. I recently installed Classic Menus for Office from www.addintools.com. Essentially this $19.95 utility gives you back your Office 2003 menu system.
I’ve found this tool to be a handy timesaver, and cetainly worth the price. The utility adds a new Office menu called "Menus."
Selecting "Menus" will give you a ribbon bar containing Office 2003 style menus and menu bars. If you prefer to work with the Office 2007 ribbons, they are still availabe to you in their standard locations.
There are only three drawbacks worth mentioning. 1) Load time for your Office apps will increase. On my machine by 2-3 seconds; 2) While this app will certainly ease the hits to your productivity during the first few weeks of upgrading to Office 2007, in the long-run it might actually keep you from discovering some of the suites’ nifty new features if you never expore the new UI; 3) I can’t pin this entirely on Classic Office Menus since I downloaded Microsoft’s Vista patches this week, but I have noticed some strange UI behaviors after installing the patches, that seem to be related to patch/menu interactions.
If time is money to you, this is a pretty inexpensive solution to managing your migration to the new Office 2007 user experience.
Just when you’ve just about had enough with every new-fangled website staking a claim to Web 2.0, you need to contend with Web 3.0.
The Wikipedia post on Web 3.0 talks about the “decomposition” of websites into discrete widgets, and the “data Web”, and the semantic web, but the far more interesting aspect of this is the seamless integration of Web and desktop computing experiences — and it’s happening right under our noses today. Perhaps the best examples I’ve seen are the desktop clients for Rhapsody and iTunes.
Both are desktop apps that are essentially wrappers around web-based content — and in case you’re wondering where the Web 3.0 is here — it’s in the enormous databases of music metadata, customer reviews, playlists, etc.
Both companies have done an outstanding job of creating user experiences that allows people to move seamlessly between local content, and content in the cloud. While both apps have their flaws, they do an outstanding job of allowing users to navigate huge amounts of data to find the media they want. Consider for a minute how much easier it is to navigate terabytes of data on Rhapsody to find the music you’re looking for, than it is to find out what you might want to watch on TV at any point in time.
Looking into the future, they both have an opportunity to build client platforms that expand beyond music (spoken word, and books) into other digital media — and ultimately create user interfaces for affinity groups and social networks. Music is a fascinating starting point for such an expansion since it’s extremely viral — and evocative in ways that movies and books are not. Music is intensely social, and Internet users have shown a proclivity for sharing it.
Rhapsody in particular has caused me to go out and buy lots of music from artists I probably never would have heard of had they not been recommended by Rhapsody’s profiler (two artists in heavy rotation in our house now are Keller Williams and Mike Doughty both via Rhapsody suggestions).
Desktop software isn’t dead, but Rhapsody and iTunes are great examples of the paradigm-shift ISVs need embrace to survive in a Web 3.0 world.
Posted by admin, in Uncategorized on Jul 25 07 No Comments | Read MoreWe recently moved out of our Bay Area ranch — a house that was pretty easy to set-up both wired and wireless networks in, into a bigger house with lots of hard-to-wire rooms.
On a whim I bought a few Powerline Netgear XE102 Wall Plugged Ethernet Bridges. In the past it seemed unnatural to plug an ethernet cable into such a small device attached directly to an AC outlet, but I finally broke-down and started testing this in our house. So far so good.
Powerline devices use your home’s electrical wiring to transmit data.
I have to admit I half expected my laptop to go up in smoke the first time I plugged it directly into one of these devices,
Set-up was a snap. One XE102 is plugged into an AC outlet and connected to an ethernet switch in my office, with the other devices plugged-in throughout the the house with a mix of PCs and wireless access points connected to them.
Soo far the devices have preformed flawlessly, and allowed me to forgo adding wired ethernet drops throughout our house.
Some reviews have noted that data transfer rates are low on these devices, but it’s higher than the throughput I’m getting from our cable internet provider.
Posted by admin, in Uncategorized on Jul 14 07 No Comments | Read MoreIn addition to the software-based factors influencing search like the quality of the indexing and retrieval algorithms, vertical search has an advantage over broad-based search engines because you as the administrator can constrain the content you crawl — and thus use human QA to make up for the deficiencies in purely algorithmic search. If you weed out irrelevant content, you can go a long way towards improving the quality of results. Some of the factors that you can use to influence the quality of your index:
Sites you crawl
Paths you include and exclude
Pages you include and exclude
Utilities to arbitrarily delete documents from the index based on pattern matching
Let’s look at each of these.
Sites You Crawl
I developed my own seed database starting with a few thousand sites related to software development, IT and the environment. You have a number of options for spiders to build your crawl database. I ran my spider letting it do 5 hops from each initial URL. After the crawl was done, I took the crawl logs for the URLs in the first 5 hops and used them as the basis for a second crawl — also 5 hops deep.
The second crawl became the basis for the FIA database that has subsequently been enhanced with other sites added manually.
Every two or three months the spider database is updated by using the crawl logs to add URLs in hops 2-5 has new root URLs.
In practice each iteration of the crawl produces deeper results since the crawls are starting at progressively deeper root URLs. In each of these crawls the spider is allowed to harvest pages on sites external to the root URLs.
Paths and Pages to Include and Exclude
As you analyze the results of your spider crawl it will become obvious fairly quickly which sites, paths and documents you’ll want to exclude from your crawl database and your indexes.
Virtually every search platform has the capability to create rules excluding certain sites or documents. You’ll want to exclude commonly linked-to sites like digg, technorati, NY Times, Yahoo, etc. You’ll also want to add rules to your search engine spider to ignore prevalent documents like login*; privacy*, aboutus*, *print=*, etc. — you get the idea. In practice this will become a long list — and is one of the keys to increasing the quality of the results you return for queries. You’ll also want filter rules that exclude gambling, porn, hotels, travel and other common search engine spam. Utilities to Arbitrarily Delete Indexed Documents
You’ll find that despite a good database of crawling rules, you still get undesirable results in your index. I had a tool developed that allows for SQL-style select queries against a Lucene index and allows deletions based on pattern matching and reg ex. This is a handy way to delete docs that slip through the spider filters — or sites that use overly aggressive SEO. You’ll also probably want to filter sites that use poorly constructed pages that for example use the same title on every page (alternatively you could use document heading tags rather than meta title as the basis for your index more on that in a future post).
In the next post, I’ll give an overview of how you give users access to the index, and present results.
Posted by peter, in Search on Jul 05 07 No Comments | Read More
Recent Comments