Sunday, October 26, 2008

Acoustic fingerprinting and sound origins

This task of going through the millions of sound and image files on YTMND has had me thinking a lot about meta-data that we could be pulling out of files and displaying for everyone to see. The obvious plan of action would be to grab ID3 and EXIF info out of uploaded files and put it in the database somewhere. One issue is that the majority of the files uploaded are edited, chopped, shopped etc, so most of the data (if any) would be useless or only partially relevant.

I spent some time looking into fuzzy image algorithms for the dupecheck system I designed a while back but there didn't seem to be any open source method available that wasn't extremely simple proof-of-concept type code. It was clear from the research I did that it was out of my existing math education and time constraints to even try to implement anything similar.

Recently I've found the Shazam application on the iPhone to be quite helpful. To summarize, you just hold your phone near a source of music, it records for around 8 seconds and then computes a hash somehow and sends it off to their servers where it tries to find a match. I've found the success rate to be fairly good, maybe 70% of the time (usually the more mainstream stuff).

There is also an open project called MusicBrainz which aims to be an open-source CD lookup/identification system and they've created some nice programs that use third party acoustic fingerprinting among other methods to identify MP3 files to add and correct meta-data. The most recent system they've started using for acoustic fingerprinting is MusicDNS which is proprietary but seems fairly young and approachable.

I was thinking about how a system like this could be beneficial to YTMND. Once I am feeling less pressured to get important things done I'd like to play with the SDK on one of these service on some YTMND sound clips. At this point, many users go out of their way to put incorrect info in the origin fields, which I find disappointing. YTMND has been a great way to find new artists and music, so it would be nice to have this information automatically filled out for you.

I would be interested to see if anyone else has played with this sort of technology on non-standard music clips such as "mashups" or even artists like Girl Talk who use multiple clips of other songs in rapid succession. These days it is extremely common for music to use samples from other songs made over the last century. I imagine at this point it would be incredibly hard to accurately figure out the samples of every chunk used in a song or loop, but it would be pretty amazing to have some sort of tree-view that showed songs and what songs they sampled from etc.

Anyway, still waiting on that replacement power supply. Thanks to the IRS I don't currently have the monetary freedom to order a third replacement, so I am patiently harassing this company to ship the correct item.

Monday, September 22, 2008

Featured users algorithms

Some of you may have noticed the featured users list just changed and there's been a few people either angry or confused as to why they were removed, and others asking about the algorithm so I figured I'd write a quick post about it.

Featured users was an idea I came up with one night in response to ROY4L's thoughts on the motivation of users to produce quality content as well as my near religious following of Clay Shirky's "A Group Is Its Own Worst Enemy" essay.

"Featured users" was supposed to provide two main benefits; to find users who consistently make good content that we could highlight on the front page in an effort to make the front page less overwhelming and less of a "treasure hunt". Second, and more often overlooked was to create an incentive for the "cream of the crop" users to create new content and interact with the site more often.

The goal was (and is still) ambitious; figure out who the best content producers are mathematically. I was surprised that a simple algorithm could produce fairly good results. The last featured users list was roughly 98% of generated by algorithm, with the last 2% being me adding or removing users manually.

So here, for the first time, is a run down of the incredibly simple featured users algorithm
(originally called 'user score'):

(average_site_score * 1.2)
(average_number_of_votes * 0.23)
x (number_of_favorites * 0.43)
---------------------------------------
= Your dumb score.

We calculate a score for all the users, and then take the top score and convert it to 10000, and convert all other scores to fit into that percentage. An example of how this turns out shows that the weight is very light towards the top and heavy towards the bottom. Here are some sample results from an old run

#1 nutnics 10000
#2 ROY4L 9947.16
#3 phaseblue 8328.45
#4 max 6374.33
#5 astuteNacute 5957.35
#6 krebstar 4685.71
#7 syncan 4451.45
#8 kingstefan 3757.21
#9 ALMusic 3620.85
#10 PCF 3549.24

From there, the scores went down drastically, because users near the top skew the results for everyone else. The requisite for getting on the "list" was a score over 200, which only roughly 300 people achieve.

At, I used more data to base the scores off of, number of comments (this is why whetstone made the list), number of views, etc. I then realized sites like "Blue Ball Machine" skewed the averages for everyone, so I tried to do them with the top 5% of each users sites excluded (trying to discard anomalies), but the results were still really off.

The numbers don't lie. This algorithm is working on a large enough set of data that a few up-voting alts wont make a difference. More people are viewing, favoriting and voting on the featured users than those who arent featured (even if you use a time scale of a period before featured users existed).

Now the only thing that you can really muck with here is the weight on each piece of the algorithm. Normally, you can look at the results and change the algorithm to remove results you don't like or get results you do like, but a huge part of this is opinion based. Multiple people want DarthWang to be featured, but I find I think the problem is that you can't please everyone with a single list.

This time around, after I was persuaded to let it happen, BTape and Teknorat took the generated list and then added and removed people as they saw fit, which is what caused much of the recent change. So focus your rage towards them for the next couple weeks.

I also made a quick change to the featured users content box, which filters out duplicate users, so at any one point in time you wont see more than one site by each user, which I think will deal with a lot of the spam issues.

anyway, back to work, dongs.

Thursday, September 18, 2008

Asset conversion

One of the problems with moving over to a Flash is that we can only import a finite set of file types at run-time. Flash doesn't support WAV or OGG or animated GIFs natively so part of the move is to clean out the asset system and convert file types where needed.

Originally when YTMND was created, there was no file type checking at all, people could basically upload anything and we would save it. Then we added simple file name checks, which removed a good portion of the miss-clicks, but people who had the proper file endings with invalid files still got through. The next variation of checks used mime-magic to actually check the file types to see if they were valid, but with the millions of different possibilities when encoding sound and images, mime-magic still wasn't perfect. In addition to that, the original mime-magic setup checked if the user uploaded a sound and an image, but didn't make sure they put them in the right fields.

Today, there are literally thousands of sites using images as sounds and vice versa, hundreds of sites using files we don't support like MIDI, OGG, etc. Hundreds more use word documents, zip files, executables, mpeg clips, etc. So far I've found some pretty interesting file types, once I do an initial pass on the whole file system I will post up statistics.


Now that I am at the stage where we need to convert the entire site over to work in Flash, I have written a much more thorough file type checking routine. One of the nice things about the fact that we are forced to do conversion on a lot of assets is that once we are in there converting, adding new file types will be easy. Hundreds of people have uploaded MIDI as sound files on their sites, which currently don't work at all. In the new conversion routine, MIDI files are converted to WAV and then compressed to MP3, so we can allow people to upload MIDI files if they want.

The new system will be basically be backwards compatible to the old system, so SWF versions of WAV files are just treated as children of the original asset, i.e. the original asset is archived. This means if we allowed MIDI files, they would be converted to MP3 to be heard in Flash, but users would still be able to download the original MIDI file from the site's asset pages.

So what sort of files do you think it would be beneficial to add to the allowed upload list? OGG? MOD? NSF or all the other 8 bit music types? What about image types?

I have thought about allowing SWF/FLV since a lot of people make their animated GIFs in Flash, but not only do I think the compression requires a human hand, I am concerned that YTMND would become even more of a YouTube clone than it is now.

Anyway that's all for now. Post a comment if you have any ideas or questions or hit up the IRC if you want to chat about it.